Very High Level Of Accuracy on OCR recognisation

Which is the Best OCR engine for most accuracy - commerical or open source in terms of very high quality
Software ProgrammerAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Joe Winograd, Fellow&MVEDeveloperCommented:
I can't say what is "Best" for two reasons. First, although I've tried many OCR packages (both free and not free), I haven't tried them all. For example, Readiris gets good reviews, but I've never used it. Second, it sometimes depends on the documents. For example, I've tested different OCR packages on the same documents, and sometimes one is better, sometimes another one is.

Since you are looking for engines, not end-user packages (and your "name" here is Software Programmer), I'll trim down my standard OCR recommendations to just engines (callable via API, SDK, command line):

• Tesseract OCR Engine, an open source product:
https://github.com/tesseract-ocr/tesseract

• GOCR/JOCR, an open source OCR package developed under the GNU Public License:
http://jocr.sourceforge.net

• SimpleOCR, which is not open source, but has a royalty-free SDK:
https://www.simpleocr.com/OCR-SDK

• ABBYY FineReader Engine, which is commercial:
https://www.abbyy.com/en-us/ocr-sdk

• Nuance's OmniPage Capture SDK, which is commercial:
https://www.nuance.com/print-capture-and-pdf-solutions/optical-character-recognition/omnipage/omnipage-capture-sdk-for-windows.html

I use both FineReader and OmniPage for the bulk of my OCR (the end-user packages, not the SDKs) and can say that both are very accurate, but, as mentioned above, I can't say that one is always better than the other. Sometimes one is better, sometimes the other is, but for the most part, the accuracy is similar — both excellent! I use the OmniPage engine the most, because my go-to document scanning/imaging packages are Nuance's PaperPort (I use the latest version, 14.5 with Patch 1) and Power PDF Advanced (I use the latest version, 2.1), both of which use the OmniPage engine under the covers. Regards, Joe
0
Software ProgrammerAuthor Commented:
How about Google Cloud Vision OCR ? Have you tried it ?
0
Joe Winograd, Fellow&MVEDeveloperCommented:
How about Google Cloud Vision OCR ? Have you tried it ?
No, but if it uses the same OCR engine as Google Docs/Drive, it's mediocre. As you probably know, Google Drive/Docs has an option to perform OCR on uploaded files, but the last time I tried it, the accuracy wasn't good. Also, the resulting PDF did not hide the text layer, so the file looked ugly, although for your purposes that probably doesn't matter, since all you care about is the accuracy of the engine. I haven't tried it in a while, so maybe it's better now, and it's also possible that the Cloud Vision API uses a different OCR engine from Drive/Docs. Can't hurt to put it on your short list, but I'd be very surprised if it has the accuracy of FineReader or OmniPage...unless Google is OEM'ing one of them. :)  It's more likely to be based on Tesseract, since Google has been involved with Tesseract for more than a decade, and all of my personal experience with Tesseract places it well short of FineReader and OmniPage in accuracy. That said, if you try Tesseract, this article may help:
http://vbridge.co.uk/2012/11/05/how-we-tuned-tesseract-to-perform-as-well-as-a-commercial-ocr-package/
0
Become a CompTIA Certified Healthcare IT Tech

This course will help prep you to earn the CompTIA Healthcare IT Technician certification showing that you have the knowledge and skills needed to succeed in installing, managing, and troubleshooting IT systems in medical and clinical settings.

Software ProgrammerAuthor Commented:
I tried the following and it works pretty well. pls test and let me know

1. Go to URL https://cloud.google.com/vision/  (OR)  https://cloud.google.com/vision/docs/drag-and-drop
2. Middle of the URL there is a drag and drop of image file window
3. Drop a file and scan.
0
Software ProgrammerAuthor Commented:
It shows a better result then tesseract. Does any configuration needs to be done in tesseract to get the same result coming in Google OCR api?
0
Joe Winograd, Fellow&MVEDeveloperCommented:
> pls test and let me know

It doesn't accept PDF files...gives this:

google vision cannot handle PDF file
That makes it difficult to test thoroughly because you can't feed it a multi-page file (of course, that has nothing to do with the accuracy of the engine). I sent it a one "page" PNG and it did very well on that (same results as PaperPort 14.5/Patch 1 and Power PDF Advanced V2.1).

> Does any configuration needs to be done in tesseract to get the same result coming in Google OCR api?

Read the article that I sent in my previous post:
How we tuned Tesseract to perform as well as a commercial OCR package

Regards, Joe
0
Software ProgrammerAuthor Commented:
We don't need PDF conversion and just an image conversion is required. Google OCR seems good. Why tesseract is behaving different from Google OCR engine? if both are developed by google.
0
Joe Winograd, Fellow&MVEDeveloperCommented:
> We don't need PDF conversion and just an image conversion is required.

That's good, since the Google engine doesn't seem to be able to handle PDF.

> Google OCR seems good.

Yes, it does.

> Why tesseract is behaving different from Google OCR engine?

Probably because they're different engines, but I don't know that for a fact. Could also be that Google Cloud Vision OCR is a heavily tweaked Tesseract, but I doubt it.

> if both are developed by google

First, Google didn't develop Tesseract initially...it was HP, beginning in 1985. The Wikipedia article on it discusses its history:
https://en.wikipedia.org/wiki/Tesseract_(software)

Note the comment that "Tesseract development has been sponsored by Google since 2006."

Second, a company can certainly develop two different products in the same space. In fact, that happens often. Regards, Joe
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
OCR

From novice to tech pro — start learning today.