Need an OCR tool

I need a tool I can use to digitize a report, like the one attached here...

Will this kind of report get a 100% successful conversion rate?

Eventually, I need the tool to be part of my website, but I have not, as yet, chosen my back-end technology. For now, a simple Mac based tool is fine, just so I can hand convert a report that I can start to use in my programming of the back-end.

Windows is okay, if there are limited Mac FREE versions.

I do have Office 365 (Mac) if there is a tool in there which I can use.

I am also interested in hearing what "plug-ins" can work when I deploy this to my website, for online OCR conversions.
curiouswebsterSoftware EngineerAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

David FavorLinux/LXD/WordPress/Hosting SavantCommented:
Simple way to do this, is to convert .pdf files (most statements come in .pdf format) to text so you can parse it, via something like this...

pdftotext -enc ASCII7 -nopgbrk -layout '$file' >$outfile 2>/dev/null

Open in new window

Which every sensible Linux Distro + Macports provides by installing the poppler-utils package.

If you're scanning documents, consider a printer like a ScanSnap which will do a double-sided scan in <1 second/page + generate a .pdf file with an embedded text version, which can be parsed too.
curiouswebsterSoftware EngineerAuthor Commented:
I am not looking for a new printer, at the moment. I tried to install ScanSnap, but it required the printer. What is your second choice for a Mac?

Your pdftotex conversion util looks interesting. It would be most helpful for me to use the same OCR tool during my early research for this project as I would use when I deploy the website.

1) Is this free to use?
2) How do I install it on my Mac?
Joe WinogradDeveloperCommented:
> Will this kind of report get a 100% successful conversion rate?

In a word, NO! While today's OCR is very accurate, it is not 100%. There are always issues like the number "0" and the upper case letter "O"; the number "1" and the lower case letter "l"; and words like "modern", where the "r" and the "n" can be nearly touching in a proportional font, thereby causing OCR to think it's "modem". Even on good quality docs, it's very unlikely to be 100%; on the type of doc that you posted, I can guarantee that it won't be 100%.

> a printer like a ScanSnap
> tried to install ScanSnap, but it required the printer

Just to be clear, ScanSnap is a scanner, not a printer.

> Windows is okay, if there are limited Mac FREE versions.

As you know from our past threads together, I'm a Windows guy, not Mac. If you simply want something to experiment with for free, watch my 5-minute EE video Micro Tutorial:
How to OCR pages in a PDF with free software

If you're willing to pay for something better, these two articles and one video discuss other OCR products:
Batch Conversion of PDF, TIFF, and Other Image Formats via Command Line Interface to PDF, PDF Searchable, and TIFF with Power PDF Advanced
PaperPort - How To Create Searchable PDF Files
Convert Scanned Image-Only PDF Files to PDF Searchable Image Files via OCR with Power PDF Advanced

Back to the freebies, here are some other OCR tools for you to consider and experiment with:

(1) Tesseract OCR Engine, an open source product:

(2) FreeOCR, which uses a compiled version of the Tesseract engine:

(3) GOCR/JOCR, an open source OCR package developed under the GNU Public License:

(4) OCR Desktop, which is not open source, but is free for personal use (needs to be registered in order to turn off popups and advertising):

(5) SimpleOCR, which is not open source, but is free, with both an end-user version:

and a royalty-free SDK:

(6) Boxoft Free OCR (I use several Boxoft free tools):

(7) Google Drive/Docs has an option to perform OCR on uploaded files, but the last time I tried it (a while ago, so it might be better now), the resulting PDF did not hide the text layer, so the file looked ugly.

Regarding the pdftotext tool, it is one of the Xpdf utilities. These two 5-minute EE videos will get you going on it:
Xpdf - Command Line Utility for PDF Files
Xpdf - Convert PDF Files to Plain Text Files

Of course, the PDF file must have text in it for pdftotext to work, i.e., pdftotext does not perform OCR — you must do the OCR first before feeding it to pdftotext. Regards, Joe

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
curiouswebsterSoftware EngineerAuthor Commented:
Joe WinogradDeveloperCommented:
You're welcome. Good luck on the project!
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.