Solved

PDF Extraction to Excel

Posted on 2011-03-23
7
454 Views
Last Modified: 2012-06-21
I have a multiple page scanned PDF document that contains several 1 page invoices.  I need a solution to OCR the document so that the data may be extracted and then select specific fields from the document to export them to a spreadsheet.  The specific fields are repeated on each page.

I've looked at a couple of solutions, but you have to copy each field from all pages to extract the data fields that I want and that takes too much time.
0
Comment
Question by:curtconner
7 Comments
 
LVL 33

Accepted Solution

by:
jppinto earned 100 total points
ID: 35200184
Did you tryed PDF2XL? Take a look at my review to this program on my blog here:

http://excel-user.blogspot.com/2010/11/pdf-to-excel.html

jppinto
0
 

Author Comment

by:curtconner
ID: 35200581
jppinto:  The OCR piece didn't work very well with the document that I'm scanning.  Loved the features, but the OCR failed.
0
 
LVL 3

Assisted Solution

by:InfoStranger
InfoStranger earned 100 total points
ID: 35202192
Do you have Adobe Acrobat?

My instructions below are for Acrobat 8.0.  To convert picture to text using OCR,
1) open PDF in Acrobat
2) Select Document Menu
3) Select OCR Text Recognition
4) Recognize Text Using OCR...
5) Click OK

You may want to try this first then try it again.  The OCR may not work as well if the document is faded or too crooked.
0
Networking for the Cloud Era

Join Microsoft and Riverbed for a discussion and demonstration of enhancements to SteelConnect:
-One-click orchestration and cloud connectivity in Azure environments
-Tight integration of SD-WAN and WAN optimization capabilities
-Scalability and resiliency equal to a data center

 
LVL 26

Assisted Solution

by:redmondb
redmondb earned 100 total points
ID: 35203938
curtconner,

I've frequently used ABBYY FineReader for tasks such as this. (My version is V8, the current is V10 - http://www.abbyy.com/.)

Initially, you create a template specifying the fields that you want to extract from the invoice (a few minutes work for a typical invoice layout) and set up a job to open, read and export the fields to Excel (another minute's work).

From then on, simply run the job which opens the PDF, OCRs the required fields and exports them to Excel.

Regards,
Brian.
0
 
LVL 1

Assisted Solution

by:jyk_aus
jyk_aus earned 100 total points
ID: 35205871
Cortconnor,

Have you considered purchasing the full version of Acrobat Reader?  Amongst other things it has the facility to convert PDF to quite a few formats, Excel included.

See here:
http://www.adobe.com/products/acrobatstandard.html

Best regards
Jacob
0
 
LVL 20

Assisted Solution

by:viki2000
viki2000 earned 100 total points
ID: 35419680
Try this http://www.abbyyusa.com/finereader/
It is programmable with macros, has customizable areas...
0
 
LVL 26

Expert Comment

by:redmondb
ID: 35857619
Thanks, curtconner.

Hope it worked out OK in the end.

Regards,
Brian.
0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

PaperPort is a popular document imaging/management product from Nuance Communications (http://www.nuance.com/). It is in widespread use by both individuals (http://www.nuance.com/for-individuals/by-product/paperport/index.htm) and businesses (http:/…
Freeze panes is an option within all variants of Excel to enable parts of a sheet to remain stationary when the cursor is in another part of the sheet. This is a very useful feature which is overlooked or under used.
In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …
We often encounter PDF files that are pure images, that is, they do not have text characters, but instead contain only raster graphics. The most common causes of this are document scanning software and faxing software/services that create image-only…

860 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question