asked on

Variability in quality and accuracy of OCR

Hello,

What factors determine the quality and accuracy of OCR (optical character recognition) and is there much variability among different OCR software?

If there is variability in software, what applications are best (both free and purchased)?

Thanks

WeThotUWasAToad

ASKER

Addendum:

Which applications are simplest/quickest (ie have the fewest steps)?

For example, is there an application which OCR's the text as soon as it is on the clipboard — ie so you can grab a screenshot of some image text and then paste it directly into Notepad or Word as editable text.

Microsoft OneNote has an OCR capability but it requires the following steps:

1) Capture a screenshot of the desired text
2) Paste the screenshot in OneNote
3) Right-click and select Copy Text from Picture
4) Content is now on the clipboard

Alan

Hi WeThotUWasAToad,

I usually use OneNote for a local app (free to all), and onlineocr.net if doing it online (there are loads of online options of course).

I find the latter better, but if the item is confidential, then I tend to be cautious using the online options.

Alan.

Joe Winograd

Hi Steve,
A quick note to let you know that I'm working on a comprehensive post for you. Regards, Joe

ASKER CERTIFIED SOLUTION

Joe Winograd

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

dbrunton

Variability I have found can be caused by:

Source documents. Newspapers can cheap books and worn books and thin paged books can be hard to scan.
Scanner, get a high resolution one. A base resolution minimum of 2400 dpi optical. Note the word optical. If the scanner has interpolated resolution to reach that 2400 dpi you don't want it.

If you use the free Irfanview image viewer then there is the Kadmos OCR plugin you can get which offers basic OCR for small sections of text in graphics.

Joe Winograd

> Source documents.

Absolutely!

> A base resolution minimum of 2400 dpi optical.

I disagree on this. Scanning at 2400 DPI is not needed for OCR. In fact, I have found that 300 DPI B&W is fine to OCR most documents. Occasionally I'll do 300 DPI grayscale and sometimes crank it up to 600 DPI B&W, but never more than that for OCR. I'll mention again Wayne Fulton's excellent site, especially the section that discusses OCR, where he says, "for OCR, use 300 dpi and Line art mode. Line art mode is 1-bit 2-color (B&W)".

> the Kadmos OCR plugin you can get which offers basic OCR for small sections of text in graphics

I've experimented with the IrfanView KADMOS plugin in the past and found its OCR to have relatively low accuracy. Maybe it's better now — I haven't downloaded an updated version in at least a year or two. So, question for dbrunton: how is the KADMOS OCR accuracy these days? Thanks, Joe

P.S.: I just noticed at the KADMOS IrfanView plugin site that the last KADMOS update was on 19-Dec-2013. Mine is dated 21-Jan-2011 and that's what I did my experiments with — how time flies! It's certainly possible that the 2013 plugin is better.

dbrunton

>> How is the KADMOS OCR accuracy these days

Good enough for my purposes which is scanning old owner's motorcycle hand books and turning print into HTML. For small chunks of text, say a couple of paragraphs, it is ideal. Anything larger use a proper OCR application.

>> In fact, I have found that 300 DPI B&W is fine to OCR

Which is a little different to my results. I went from an old HP which was doing either 300 or 600 to a newer one that did 2400 and the OCR (Ominipage) jumped in recognition accuracy. Now my source documents for this were very old car hand books (sixty years old) with extremely small print.

Joe Winograd

> my source documents for this were very old car hand books (sixty years old) with extremely small print

I suppose that could explain it. I deal mostly in typical business documents.

This piqued my interest and I just ran an experiment on a public domain page from Pride and Prejudice (attached). Here's a screenshot of the KADMOS results (after loading the PNG into IrfanView):

Here's a screenshot of the Capture2Text results, based on OCR'ing the on-screen image:

When I get some spare time, I'll experiment with the 2013 KADMOS plugin — maybe it will do better than what's shown above. Thanks for letting us know about your results. Regards, Joe
ocr-test-page.png

WeThotUWasAToad

ASKER

This is a masterful post Joe. In fact, "comprehensive" almost seems like an understatement — but I can't think of a more accurate adjective at present. :)

You answered everything I was wondering plus more.

Many thanks.

WeThotUWasAToad

ASKER

PS "exhaustive" maybe?

PPS P&P is one of my very favorite novels. In fact, I love just about everything by Jane Austen. The P&P movies are great too, eg the one with Rosamund Pike (yowza) & Donald Sutherland — and especially the 6 hr A&E version starring Colin Firth & Jennifer Ehle. And mentioning the latter, also brings to mind Jane Austen's S&S from the 90's (Emma Thompson, Alan Rickman, Hugh Grant). Great books/movies.

Joe Winograd

Hi Steve,
I'll take "exhaustive". :) Thanks for all the compliments — very nice to hear!

Yes, the Jennifer Ehle and Colin Firth 6-episode show was great — another winner from the BBC. So was the Rosamund Pike and Donald Sutherland movie — and don't forget, a not-too-shabby Keira Knightley was in that, too. :) Regards, Joe