Extracting data from pdf

Hello,

I  need to extract data from a pdf file so that it can be normalized an imported into MS Access.  The pdf file has the data in a form like appearance.  I've downloaded some pdf to Excel conversion tools but basically, the documents become just excel documents with no way to easily extract the records.  A pdf file may contain 136 pages.  I'd like some recommendations on tools I can use to specific data from pdf files so that the data appears in rows and columns in a spreadsheet table.
Juan VelasquezAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

arnoldCommented:
There are pdf2 converters depending on what programming languages you are familiar with, there are PDF interfaces that could allow you to do what you want.  If the PDF is I be you design, and how and who fills them out and you get them back, there are mechanisms that the sata sent back would be the form data rather than the entire PDF.
0
Juan VelasquezAuthor Commented:
"If the PDF is I be you design, and how and who fills them out and you get them back, there are mechanisms that the sata sent back would be the form data rather than the entire PDF."

I'm not sure what you are trying to say.  I'm looking for a took to mine data from a pdf file and export that data into a database,
0
Joe Winograd, Fellow&MVEDeveloperCommented:
I have numerous tools that can convert from PDF to Excel, but the results vary by document. If you can post a page or two of your PDF file (being careful, of course, not to include any private/sensitive data), I'll run it through various products and post back the best results. Btw, is it an image-only PDF file (a bitmap/graphic/raster image) or does it have text, such as a PDF Normal file or a PDF Searchable Image file (the latter typically being created by scanning, and containing both the scanned image and the text created by OCR)? Regards, Joe
0
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

arnoldCommented:
Which programming languages do you use/have access to?
PDF is a standard and there are guides/libraries for different programming/scripting languages that will let you extract the data from a PDF.
I.e. the PDF has PDF formatting references and part of that is data.

What are you planning to crunch the PDF and what are you using to insert the data into the Access DB?
VB, C#, etc.
0
Juan VelasquezAuthor Commented:
The data is private and sensitive and the data is formatted using a form format and thus is not tabular.  The final destination for the data is an Access database where it will be used in queries.  Each "record" can take up more than one pdf page. I've looked at a product call Astera Report Miner but it is a bit expensive.
0
Joe Winograd, Fellow&MVEDeveloperCommented:
Is the data text or image? In other words, is OCR needed?
0
arnoldCommented:
Your selection of topic is what limits your approach. Access is the database into which you want the data in. PDF is the source of the data. What programming languages do you know? Programatically there are ways to extract the data you want.
There are pdf2html converters pdf2ps etc. The difficult part unless you want to manually copy and paste data is to cover the variations or the types of documents that the PDF represents to understand which sections are important and how they should be referenced within the database.
0
Joe Winograd, Fellow&MVEDeveloperCommented:
If the data is text (not image), then you can use the Xpdf command line utility called PDFtoText to convert PDF files to plain text files (free for non-commercial use). Here are two 5-minute EE video Micro Tutorials about Xpdf:

Xpdf - Command Line Utility for PDF Files - Part 1
Xpdf - Convert PDF Files to Plain Text Files - Part 3

The first video discusses how/where to download all of the utilities (there are nine of them) and the second video is about PDFtoText specifically. You could write a program that calls PDFtoText to convert the PDF to text, then parse the text and load the data into Access.

If you need finer control over converting the text than PDFtoText provides, look at the comprehensive iText library:
http://itextpdf.com/

And iTextSharp, the .NET port of iText:
http://sourceforge.net/projects/itextsharp/

It is a robust library for working with PDF files. It has two licensing/pricing models — AGPL (free) and commercial/OEM (not free).

I understand that your data is private/sensitive, but without seeing a sample document, it's difficult to conjecture how to process it. Perhaps you could create a sample page or two that has the format of the real documents, but content that is simply test data. Regards, Joe
0
Juan VelasquezAuthor Commented:
The documents are pdf text
0
Juan VelasquezAuthor Commented:
Attached is a sample file
TEST.pdf
0
arnoldCommented:
What generates these PDFs? do you control that, can you add functionality to that mechanism that in addition to outputing the PDF it will enter data in a database? Or are  these notifications to you?
0
Joe Winograd, Fellow&MVEDeveloperCommented:
Attached are the four text files created by PDFtoText from your sample file, using four of its formatting options: default, layout, raw, table. Regards, Joe
TEST-default.txt
TEST-layout.txt
TEST-raw.txt
TEST-table.txt
0
aikimarkCommented:
Will this report ever cover multiple customers?

Is there a
Cumulative Quantity 234 Start Date 20150101
section after the last line on the sample page?
0
aikimarkCommented:
It looks like you need to populate three tables
0
Juan VelasquezAuthor Commented:
Hello Arnold,

The pdfs are sent via email to the company from an outside source. They will not change the formatting. Period. EOD
0
Juan VelasquezAuthor Commented:
Hello Aikimark,

The reports only cover one customer, with many POs
0
aikimarkCommented:
So, the Cumulative Quantity section is optional?
0
ProfessorJimJamCommented:
perhaps, if you try it with Able2Extract professional trial version and see if it works for you.
0
arnoldCommented:
The formatting is set.
The tools that are available. Might do portions of what you want, creating your own that deals with the specific issue of extracting data from the file you need.
It sounds as though the PDF is a report of multiple pages including data belonging to multiple customers who may have within this time frame placed several orders in you want those ........

Look at the PDF API that can be used with a programming language that you are familiar with.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Juan VelasquezAuthor Commented:
I'll try the pdf api route.
0
Joe Winograd, Fellow&MVEDeveloperCommented:
> I'll try the pdf api route.

Yes, good idea, as I mentioned in my previous post about the comprehensive iText library, one of the best PDF APIs out there. Or, as I also mentioned in the same post, try the much simpler Xpdf command line utility called PDFtoText, which may be called from any programming/scripting language that can make a command line call. Good luck on the project! Regards, Joe
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Adobe Acrobat

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.