Extract data from PDF

I have attached an image of part of a PDF.  I would like to be able to extract data from it.   I would like to extract the data in the square boxes for the fields like DHS case no, case name, creation date, etc.  

Does anyone know of tools/software that would do this? PDFImage.png
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Joe Winograd, Fellow&MVEDeveloperCommented:
I realize you posted it as a PNG image, but does the actual PDF file have text in it or is it an image-only PDF? If it already has text, then you may extract it in a variety of ways, such as the PDFtoText utility described in this 5-minute EE video Micro Tutorial:
Xpdf - Convert PDF Files to Plain Text Files - Part 3

But if it is an image-only PDF, then you need to perform OCR on it first. For example, attached is a PDF Searchable Image file created from the PNG you posted — it contains both the image and searchable text created by the OCR process. Btw, the OCR results should be much more accurate with the original PDF rather than the low resolution PNG. Regards, Joe
HLRosenbergerAuthor Commented:
it's not an image - I just posted an image.   I already own a tool that will extract text from a PDF, but I still need to parse the string that is returned to me.  

I was hoping to eliminate the parsing by using a tool that would recognized the border lines between the cells.
Joe Winograd, Fellow&MVEDeveloperCommented:
OK, if it's not an image and already has text and is a form, then you can use PDFtk Server (free) to generate an FDF file:

Don't be misled by "Server" in the name. I don't know why they called it that, but it's just an executable (pdftk.exe, with a supporting DLL, libiconv2.dll) that runs on XP, Vista, W7, and W8 (i.e., it does not have to run on a "server" OS). The command line would be like this:

pdftk.exe inputfile.pdf generate_fdf output outputfile.fdf

The border lines between the cells should not be a problem if it's a form. Regards, Joe
Big Business Goals? Which KPIs Will Help You

The most successful MSPs rely on metrics – known as key performance indicators (KPIs) – for making informed decisions that help their businesses thrive, rather than just survive. This eBook provides an overview of the most important KPIs used by top MSPs.

HLRosenbergerAuthor Commented:
but it's not a form.   I have a tool called aspose that will extract out PDF text.  It also will extract out form fields, but when I use the tool, it tells me there are no form fields.

What I'm looking for is a way to extract all the data in those "cells" (surrounded by borders).
Joe Winograd, Fellow&MVEDeveloperCommented:
I'm not familiar with Aspose, but it sounds as if it's similar to the PDFtoText utility that I mentioned earlier, which is what you'll want — that is, plain text to work with. For example, here's the output from a PDFtoText call with its -layout option:

CUA In-Home Services Referral                                                      Case No, 294118 Page 1 of 12

     CUA IN-HOME SERVICES      DHS CASE NO.: aaaaaa                        PHILADELPHIA DEPARTMENT
                 REFERRAL                                                        OF HUMAN SERVICES
                               CASE NAME:           aaaaaa                      CHILDREN AND YOUTH

                               CREATION DATE: aaaaaa

CASE WORKER:         aaaaaa    SUPERVISOR:          aaaaaa                 ADMINISTRATOR:  aaaaaa

PHONE:        aaaaaa           PHONE:               aaaaaa                 PHONE:  aaaaaa

EMAIL:        aaaaaa''' -      EMAIL:               aaaaaa                 EMAIL:  aaaaaa



Open in new window

That's from the PDF that I created with your low-res PNG file, which is why you see some garbage characters, such as in the email field — that won't be an issue with your original PDF that already has text (it's an artifact of my running OCR on the PNG).

I presume that your Aspose tool can do something like the above — if not, use Xpdf's PDFtoText.

As you can see in the PDFtoText output, it has the fields from the "cells" and the borders are not a problem. Regards, Joe

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
HLRosenbergerAuthor Commented:
ok.  thanks.   I'll give it a shot.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Visual Basic.NET

From novice to tech pro — start learning today.