Need an OCR tool

I need a tool I can use to digitize a report, like the one attached here...

Report
Will this kind of report get a 100% successful conversion rate?

Eventually, I need the tool to be part of my website, but I have not, as yet, chosen my back-end technology. For now, a simple Mac based tool is fine, just so I can hand convert a report that I can start to use in my programming of the back-end.

Windows is okay, if there are limited Mac FREE versions.

I do have Office 365 (Mac) if there is a tool in there which I can use.

I am also interested in hearing what "plug-ins" can work when I deploy this to my website, for online OCR conversions.
newbiewebSr. Software EngineerAsked:
Who is Participating?
 
Joe Winograd, Fellow&MVEConnect With a Mentor DeveloperCommented:
> Will this kind of report get a 100% successful conversion rate?

In a word, NO! While today's OCR is very accurate, it is not 100%. There are always issues like the number "0" and the upper case letter "O"; the number "1" and the lower case letter "l"; and words like "modern", where the "r" and the "n" can be nearly touching in a proportional font, thereby causing OCR to think it's "modem". Even on good quality docs, it's very unlikely to be 100%; on the type of doc that you posted, I can guarantee that it won't be 100%.

> a printer like a ScanSnap
> tried to install ScanSnap, but it required the printer

Just to be clear, ScanSnap is a scanner, not a printer.

> Windows is okay, if there are limited Mac FREE versions.

As you know from our past threads together, I'm a Windows guy, not Mac. If you simply want something to experiment with for free, watch my 5-minute EE video Micro Tutorial:
How to OCR pages in a PDF with free software

If you're willing to pay for something better, these two articles and one video discuss other OCR products:
Batch Conversion of PDF, TIFF, and Other Image Formats via Command Line Interface to PDF, PDF Searchable, and TIFF with Power PDF Advanced
PaperPort - How To Create Searchable PDF Files
Convert Scanned Image-Only PDF Files to PDF Searchable Image Files via OCR with Power PDF Advanced

Back to the freebies, here are some other OCR tools for you to consider and experiment with:

(1) Tesseract OCR Engine, an open source product:
https://github.com/tesseract-ocr/tesseract

(2) FreeOCR, which uses a compiled version of the Tesseract engine:
http://www.paperfile.net/

(3) GOCR/JOCR, an open source OCR package developed under the GNU Public License:
http://jocr.sourceforge.net/

(4) OCR Desktop, which is not open source, but is free for personal use (needs to be registered in order to turn off popups and advertising):
http://www.ocrtools.com/fi/prdOCRFree.aspx

(5) SimpleOCR, which is not open source, but is free, with both an end-user version:
http://www.simpleocr.com/

and a royalty-free SDK:
http://www.simpleocr.com/Info.asp#SDK

(6) Boxoft Free OCR (I use several Boxoft free tools):
http://www.boxoft.com/free-ocr/

(7) Google Drive/Docs has an option to perform OCR on uploaded files, but the last time I tried it (a while ago, so it might be better now), the resulting PDF did not hide the text layer, so the file looked ugly.

Regarding the pdftotext tool, it is one of the Xpdf utilities. These two 5-minute EE videos will get you going on it:
Xpdf - Command Line Utility for PDF Files
Xpdf - Convert PDF Files to Plain Text Files

Of course, the PDF file must have text in it for pdftotext to work, i.e., pdftotext does not perform OCR — you must do the OCR first before feeding it to pdftotext. Regards, Joe
0
 
David FavorLinux/LXD/WordPress/Hosting SavantCommented:
Simple way to do this, is to convert .pdf files (most statements come in .pdf format) to text so you can parse it, via something like this...

pdftotext -enc ASCII7 -nopgbrk -layout '$file' >$outfile 2>/dev/null

Open in new window


Which every sensible Linux Distro + Macports provides by installing the poppler-utils package.

If you're scanning documents, consider a printer like a ScanSnap which will do a double-sided scan in <1 second/page + generate a .pdf file with an embedded text version, which can be parsed too.
0
 
newbiewebSr. Software EngineerAuthor Commented:
I am not looking for a new printer, at the moment. I tried to install ScanSnap, but it required the printer. What is your second choice for a Mac?

Your pdftotex conversion util looks interesting. It would be most helpful for me to use the same OCR tool during my early research for this project as I would use when I deploy the website.

1) Is this free to use?
2) How do I install it on my Mac?
0
 
newbiewebSr. Software EngineerAuthor Commented:
thanks
0
 
Joe Winograd, Fellow&MVEDeveloperCommented:
You're welcome. Good luck on the project!
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.