Link to home
Start Free TrialLog in
Avatar of Patrick O'Dea
Patrick O'DeaFlag for Ireland

asked on

OCR Me!

Hi again all and thanks for your help so far,

See attached.  This document was scanned by a cheap handheld scanner but it looks okay.
Proof of concept time!
I do not have OCR software.
I would like this single page transferred from jpeg to Excel.

The data that I am interested in is in the central piece of the paper.
I.e. Form numbers 0000018 to 0000622.  The other data is not needed.

Unfortunately there are a few "tick" marks in pen beside some of the numbers.
Ultimately if I pursue this matter I will have many many pages with NO "tick" marks.

Bottom Line: Could someone send me back this file in Excel format.

Thanks,
OCRJACKA.jpg
Avatar of John Easton
John Easton
Flag of United Kingdom of Great Britain and Northern Ireland image

I have just tried this using OpenOCR and it seams to work fine.  Ok the ticks did cause an issue, but the other date comes out pretty well.

Go to here to download: CutePDF Link

Although the site isn't English the software is if you download the "Cognitive OpenOCR (CuneiForm), english version (EXE, 31,9)" file.

Although it opens in a Word document it is a proper table and therefore can be copied to Excel quite easily,
Avatar of Joe Winograd
Hi Dewsbury,
Hope it's a great day there in Dublin. Here's what I did:

(1) Used PaperPort to crop the image so that it has just the data you're interested in. The tick marks are killers, so I tried to crop them, too. The cropped JPG is attached.

(2) Used the Send To Bar in PaperPort Pro 14 to send the cropped JPG to Excel, having configured it to invoke OmniPage Pro 18.

(3) OmniPage does the OCR and gives you a chance to correct errors using its Proofreader. Here's the screen showing its initial OCR results:
User generated imageAnd here's the screen showing the first Proofreader item:
User generated image(4) I did not correct any errors. I simply clicked the Document Ready button and it automatically created the attached Excel spreadsheet. As you can see, it did pretty well, except for where the tick marks were – I wasn't able to crop them fully, but without the tick marks, you'll be in great shape! Also, notice that the "4" is missing in column A of row 15. That's because it is very faint in the source document and OCR missed it.

As I mentioned in a previous thread, JPGs are fine, and everything above was done with your JPG. That said, I want to point out that I do practically all of my scanning (of typical business docs) in black&white at 300 DPI. Many folks mistakenly think that the higher the resolution, the better it is for OCR. Not true, at least, not usually true. I have seen many situations where the OCR accuracy is better on a 300 DPI image than a 600 DPI one. In any case, I am getting excellent accuracy on b&w docs at 300 DPI with ABBYY FineReader, Nuance OmniPage, and Nuance PaperPort. On rare occasions, I'll scan in grayscale (usually at 200 DPI) and even rarer in color (at 150 or 200 DPI). I suggest that you take a look at Wayne Fulton's excellent website, "A few scanning tips":
http://scantips.com/

It has some very good tips and advice on scanning, including a section about OCR. Regards, Joe
OCRJACKA-cropped.jpg
PP14-OP18-on-OCRJACKA.xlsx
One other thing. Here's the OmniPage OCR Proofreader screen showing how the tick mark affects recognition of the 30:
User generated imageIt's a very simple matter to correct the error – just type a "0" where the "d" is and click the Change button, but, of course, this won't be an issue for you if your source docs don't have the tick marks. But it's good to know that it's easy to correct any OCR errors, if you choose to.

Btw, I configured it to create an Excel 2007-2010 file (.xlsx), but it can also be configured to create an Excel 97-2003 file (.xls). Cheers, Joe
Avatar of Patrick O'Dea

ASKER

Joe,

That quality of the your spreadsheet data is excellent.
I had been concerned that the fact that the paper is on a blue background might cause problems.

Can I just check - Is it necessary to use two bits of software to achieve these results : (a) PaperPort and (b) OmniPage ?

I will read Wayne Fultons tips later.

Without giving away any state secrets ...... I have just started working with a number of Pharmacies (drugstores) who are provided with pages and pages of data on a monthly basis.  They need to analyse this data but it is not in electronic format ... etc.

If I can get the data into excel format then the rest is easy for me.

Thanks again,

Dewsbury
Joe,

A further thought.
Say, I have document with "x" pages and the first page is different in format to the remaining ones.

Can I setup a template in advance to process the first page in one way and the subsequent ones slightly differently.   Or is it necessary to "re-map" the document each time I process it.

I accept that some manual intervention will often be necessary.

Hopefully you understand what I mean!
No, you do not need both software packages. I used PaperPort only because I know it inside-and-out, and so it was easier to crop the JPG with it. But all you really need is the OmniPage software. I just OCR'ed both your original, uncropped JPG and the cropped version of it directly with OmniPage – no PaperPort involved in the process. The two resulting spreadsheets are attached. Regards, Joe
OCRJACKA-uncropped-straight-to-O.xlsx
OCRJACKA-cropped-straight-to-OP1.xlsx
I just noticed that the OCR came out all in one row with the uncropped JPG. That's fixable by setting up zones in OmniPage (and the zones may be stored in templates for easy re-use later). The cropped JPG came out great. Regards, Joe
Also, attached is the spreadsheet created by printing out the cropped JPG and scanning it in with a Fujitsu ScanSnap S1500. It scanned straight to an Excel spreadsheet using the ABBYY FineReader that came bundled with the scanner (Build 8.0.2.650). I'm sure the commercial ABBYY FineReader 11 Professional would do a better job. Regards, Joe
OCRJACKA-cropped---ScanSnap-S150.xlsx
Lots of thinks for me to evaluate.

Final question for the moment.

I note that the ABBYY Finereader is much cheaper than the OmniPage Pro 18.
(At this stage the OmniPage would be too expensive.)


Will the ABBYY Finereader suffice for me (i.e. getting from jpeg/PDF to Excel cleanly)??
Hang on a minute ...

I probably don't need OmniPage Professional 18.

Their website suggests a price of €499 but there also appear to be much cheaper versions with less functionality as you delve into the website.
ASKER CERTIFIED SOLUTION
Avatar of Joe Winograd
Joe Winograd
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Our posts just crossed each other – you figured it out yourself! :)
Thx.

Now all I need is for my client to engage with the project ... he seems keen but who knows!
Good luck! I hope you land the deal! Even though the question is closed, I'd love to hear if you win the business and, if so, how the project goes. Please post back here with any exciting news. Regards, Joe