pdf to word or excel

D_wathi
D_wathi used Ask the Experts™
on
Dear Experts
We have project for 6 months where the data to be captured from the pdf document to captured and entered into excel and then imported into the application, it is found that the pdf documents in some sections are scanned some sections it is hand written, in some section it is table,  
1. looking for solution/software  please let us know the best solution to handle this work.
2.  also please suggest can we think of reading so that use some voice reorganization software if yes then please suggest the software
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Dr. KlahnPrincipal Software Engineer
Commented:
How critical is this information?  No OCR software is 100% perfect, and handwriting recognition is usually not much better than 80% even after training.

In this situation one might look into Amazon Mechanical Turk instead.  Commission three separate transcriptions, compare the results, and deal with any discrepancies manually.
Joe WinogradDeveloper
Fellow 2017
Most Valuable Expert 2018

Commented:
> We have project for 6 months
> looking for solution/software

What is your budget for this solution, including software and services? Regards, Joe

Author

Commented:
data is confidential hence we cannot outsource,  our budget is approx  between 3000$ to $4000 however if some solution is possible please let us know budget can be discussed but would like to know the solution approach so that i can discuss internally for the budget.
IT Professional | Freelance Journalist | Looking for Opportunities
Distinguished Expert 2018
Commented:
Hi Di_Wathi,

I've had a lot of success converting pdf documents to Excel using PDF Elements Pro by Wondershare and also found that it outperformed the genuine (and considerably more expensive) Adobe Acrobat products when it came to OCR capabilities.

I've tried a lot of OCR software and found pdfelement to be one of the best for converting into Excel. That said, I also found that if tables were mixed with text and numbers that didn't belong in a table, the software would still sometimes get confused with the conversion, so proof checking any converted document is a must.

There is no 100% solution for this type of thing, unfortunately, regardless of how much you pay going by the research I've personally done.

Regards, Andrew
Joe WinogradDeveloper
Fellow 2017
Most Valuable Expert 2018
Commented:
> the data to be captured from the pdf document to captured and entered into excel

For PDF to Excel, I've had excellent results with this free online tool:
http://www.pdftoexcel.org/

It does a good (but not perfect) job of maintaining the formatting, which is always the trick with any PDF-to-Excel (or PDF-to-Word) conversion. Another nice feature of this tool is that it performs OCR if the file is an image-only PDF, thereby automatically creating the text. I don't know if it will work well on your particular PDFs, but it's worth a (free!) shot. If you do like it and would prefer a local install rather than the online tool, it is available for purchase and download (not free, but it has a 7-day free trial):
http://www.investintech.com/prod_downloadsa2e.htm

Another local install (not free, but reasonably priced at $27) is Boxoft PDF to Excel:
http://www.boxoft.com/pdf-to-excel/

Yet another local install worth trying is A-PDF to Excel (also not free, but reasonably priced at $39, and there's a free trial):
http://www.a-pdf.com/to-excel/index.htm

Both Boxoft PDF to Excel and A-PDF to Excel require that the PDF have text (not just an image), e.g., if your docs are scanned images, the PDFs need to have been processed with OCR software. You may do that with any OCR tool that you have. Then the A-PDF and Boxoft products will be able to process the text generated by OCR and attempt to create a properly formatted Excel spreadsheet. If you don't have an OCR tool, Boxoft has a free one:
http://www.boxoft.com/free-ocr/

And A-PDF has a reasonably priced one ($27):
http://www.a-pdf.com/ocr/index.htm

And here's a 5-minute EE video Micro Tutorial that shows another free product to perform OCR on PDFs:
How to OCR pages in a PDF with free software

Here's another approach instead of getting two products (one for OCR and one for conversion to Excel). Take a look at Kofax (previously Nuance) PaperPort. It has built-in OCR and a built-in feature to convert PDF docs to Excel. PaperPort can even scan directly to an Excel spreadsheet, as explained in these two 5-minute EE video Micro Tutorials:
How to create custom scanning profiles in PaperPort - Part 1
How to create custom scanning profiles in PaperPort - Part 2

> it is found that the pdf documents in some sections are scanned

Just because it is scanned, doesn't mean that it has text...it may be just a raster image (bitmap/graphic). That's why Dr Klahn, Andrew, and I all mentioned OCR. f course, if your PDF already has text (i.e., is not just an image), then ignore all of the recommendations with respect to OCR.

> some sections it is hand written

Handwriting is a whole different ballgame. Typewritten text is amenable to OCR, which is very accurate these days. But handwriting is a different (and much more difficult) task that requires a process known as Intelligent Character Recognition (ICR) or another one known as Intelligent Word Recognition (IWR). ICR recognizes handwriting a character at a time, while IWR recognizes full words and phrases in handwriting. The accuracy of ICR and IWR is way below that of OCR.

In any case, two products worth trying are ABBYY FineReader and Kofax (previously Nuance) OmniPage:
https://www.abbyy.com/en-us/finereader
https://www.kofax.com/Products/omnipage

I use both and can say that both have very accurate OCR, but I can't say that one is always better than the other. I've tested them on the same documents, and sometimes one is better, sometimes the other is, but for the most part, the accuracy is similar - both very good! And they both can create Excel files. But, again, accuracy on handwriting is likely to be very low.

> in some section it is table

Maintaining table formatting is difficult. There is no product that does it perfectly in all cases. You'll need to experiment to see what works best on your particular documents.

> then imported into the application

Even if you get accurate text, I don't know how difficult that will be...depends on the application.

> use some voice reorganization software

I presume that you mean voice recognition, not voice reorganization (probably a spell-check error), but in any case, I'd say that's worthy of a separate question...and it needs a lot more detail on what you're trying to achieve.

> budget is approx between 3000$ to $4000

Without understanding the project better, I can't say if that's a reasonable amount of money for it.

As a disclaimer, I want to emphasize that I have no affiliation with any of the companies mentioned in this post and no financial interest in them whatsoever. I am simply a happy user/customer. Regards, Joe

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial