asked on

Software or process to harvest contact details from scanned paper checks

I'm looking for a software package or process to take 1000 or so paper checks and harvest the contact details, and if possible the date and amount - and add that to a sortable list like CSV or excel.

I don't need to capture any of the micr data, account numbers, routing numbers, or siganture.

This isn't to process checks for payment, only for who wrote a check for how much and when, in a list.

The checks can be either scanned in at the time, or they can be processed in batch from a directory of JPG images or PDF's

Ideally we'd like to see if there's something no cost, or low cost as this will only be done maybe once per year... so buying 'neat receipts' or the like is probably not cost effective.

What are some ideas that fit the task?

Steven Harris

If you have the images in PDF already and can run the OCR tool for text recognition, then you may be able to just run a vba process to extract the text from pdf to excel.

FocIS

ASKER

If that's the case, that's what i'd need help with - the "run a vba process to extract the text from pdf to excel"

How? :)

Scanning them in isn't hard, and i can certainly run the OCR wizard in Acrobat Pro (not sure if it will get the total paid). I don't need to capture the written total in words, but the numbers from the amount block would be great.

Typically the date and total are hand-written. In either case, how to harvest the OCR'd text in a meaningful format?

ASKER CERTIFIED SOLUTION

Steven Harris

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Joe Winograd

Are the checks handwritten or typewritten?

That makes a big difference. Typewritten text is amenable to Optical Character Recognition (OCR) and there are numerous free (and low cost) products out there that do an excellent job with high accuracy (I'll get to those in a moment). But handwriting is a different (and much more difficult) ballgame that requires a process known as Intelligent Character Recognition (ICR) or another one known as Intelligent Word Recognition (IWR). ICR recognizes cursive handwriting a character at a time, while IWR recognizes full words and phrases in cursive handwriting. The accuracy of ICR and IWR is way, way below that of OCR. I suspect that if your checks are handwritten, you will be extremely disappointed with the results, and you will be much better off hiring some low-cost labor to type the data into an Excel spreadsheet for you.

That said, here are some free OCR tools for you to consider and experiment with:

(1) Tesseract OCR Engine, an open source product now maintained by Google:
http://code.google.com/p/tesseract-ocr/

It has numerous add-ons:
http://code.google.com/p/tesseract-ocr/wiki/AddOns

(2) FreeOCR, which uses a compiled version of the Tesseract engine:
http://www.paperfile.net/

(3) GOCR/JOCR, an open source OCR package developed under the GNU Public License:
http://jocr.sourceforge.net/

(4) OCR Desktop, which is not open source, but is free for personal use (needs to be registered in order to turn off popups and advertising):
http://www.ocrtools.com/fi/prdOCRFree.aspx

(5) SimpleOCR, which is not open source, but is free, with both an end-user version and a royalty-free SDK:
http://www.simpleocr.com/
http://www.simpleocr.com/Info.asp

(6) Boxoft Free OCR (I use several Boxoft free tools):
http://www.boxoft.com/free-ocr/

(7) Google Drive/Docs has an option to perform OCR on uploaded files, but the resulting PDF doesn't hide the text layer, so the files look ugly.

You said that "low cost" is OK, but you didn't define that. Assuming they qualify as "low cost" for you, two very well regarded OCR programs are Nuance OmniPage and ABBYY FineReader. Here are links to more information:
http://nuance.com/for-individuals/by-product/omnipage/index.htm
http://finereader.abbyy.com/

Here are links to feature comparison charts:
http://nuance.com/ucmprod/groups/imaging/@web-enus/documents/collateral/nc_016052.pdf
http://finereader.abbyy.com/editions_comparison_chart/

I use both and can say that both are very accurate, but I can't say that one is always better than the other. I've tested them on the same documents, and sometimes one is better, sometimes the other is, but for the most part, the accuracy is similar - both very good! And they both can create Excel files.

Another (non-free) idea is Nuance's PaperPort product, which is not a dedicated OCR package, but can perform OCR via Nuance's OmniPage, which is included "under the covers" (the OmniPage OCR engine is built into PaperPort):
http://nuance.com/for-individuals/by-product/paperport/index.htm

PaperPort is a robust scanning/imaging package that does a lot more than just OCR (but for pure OCR, is not as robust as OmniPage and FineReader). I use PaperPort extensively (more than OmniPage and FineReader combined). Its OCR capabilities (via the built-in OmniPage) may be adequate for your purposes. But if not, then go with OmniPage or FineReader.

Another non-free, but inexpensive, product is Nuance's PDF Converter Professional 8:
http://www.nuance.com/for-business/document-imaging-and-scanning/pdf-converter-professional/index.htm

Although the list price is $100, the street price is substantially less. It is currently $64 at Amazon:
http://www.amazon.com/Nuance-Communications-Inc-M109A-G00-8-0-Professional/dp/B0084PK8CS/

Yet another (non-free) possibility is Adobe Acrobat (not Adobe Reader), which is also a lot more than just OCR:
http://www.adobe.com/products/acrobat.html

I'm not a big fan of Acrobat (it's too expensive for what it does, in my opinion), but many folks like it and its built-in OCR is good.

One more idea: Microsoft Office Document Imaging (MODI) was bundled with Office 2003 and 2007. Here's a link to some good info about it:
http://office.microsoft.com/en-us/help/about-microsoft-office-document-imaging-HP001077103.aspx

MODI was removed from Office 2010, but here's an article on how to install it in 2010:
http://support.microsoft.com/kb/982760

Of course, MS Office is not free, but if you already have MS Office, then MODI is included at no additional charge.

Now for a key point. While today's OCR is very accurate, it is not 100%. There are always issues like the number "0" and the upper case "O"; the number "1" and the lower case "l"; and last names like "Turner", where the "r" and the "n" can be nearly touching in a proportional font, thereby causing the OCR to think it's the name "Tumer".

When creating searchable PDF files (a primary usage of OCR these days), most users are willing to live with the occasional OCR error. But since you're OCRing checks, where you expect the data to be 100% accurate, OCR alone won't do it. I like to say that the good news of OCR is that it's 99% accurate, and the bad news of OCR is that it's 99% accurate. :) This is why some folks, in some situations, use heads-down data entry instead of, or in conjunction with, OCR. And as stated earlier, you should definitely do the heads-down data entry approach if the checks are handwritten. In fact, for just 1,000 items only once a year, you may want to go that route instead of OCR, even if the checks are typewritten. Regards, Joe

Joe Winograd

FocIS,
My message crossed with yours...took me a while to write it. :)

With the date and total (even just the numbers in the amount block) being handwritten (and the payee, also, I presume), I think you will be very disappointed with Acrobat's OCR results. But it's simple to test. Forget for the moment how to harvest the OCRed text in a meaningful format. Let's see if it's worth harvesting! My bet is no, but I'd be happy to be proven wrong. Use Acrobat's OCR on a dozen checks and take a look at the OCR results. Regards, Joe

FocIS

ASKER

I was able to get the VB code working in a macro, it does harvest the text but sort of just smears it all over the place - which is still better than typing straight up, but requires a lot of editing

Joe, you've given a lot of ideas, i'm going to try each one, especially the Nuance line. I'll need a day or so to go thru those things. I do have Acrobat X Pro, office 2003 and office 2010 already. Acrobat does decent OCR obviously, but i'm interested in seeing what Nuance can do especially in regards to saving into excel

Thanks for the direction so far, both of you

FocIS

ASKER

As a quick followup, the excel macro via acrobat X pro, was able to garnish most of the contact details, none of the dates/totals, and interspersed seemingly endless useless data (parts of bank names, what text it made up from logos, etc)

The payee is not important, the total is mildly important and doesn't have to be exact, dates would be nice... but in the end, getting close to a christmas card list would be a good goal. Being able to rank payers based on how much they paid per year would be awesome

Joe Winograd

You're welcome. I have all of the Nuance products I mentioned and would be happy to run some tests for you if you feel comfortable posting sample checks (of course, redacting any private/sensitive data). I also have ABBYY FineReader 11. But my concern is your comment that it "requires a lot of editing"...and in the end, I wonder if the combination of OCRing and editing really is better than straight typing. Definitely worth some experimentation, though, and I look forward to hearing the results of your tests. Regards, Joe

Joe Winograd

Our posts crossed again. With "none of the dates/totals, and interspersed seemingly endless useless data (parts of bank names, what text it made up from logos, etc)", the editing task could prove to be as difficult as straight-up data entry. And if all you're looking for is two fields - the Payer and the Amount (or maybe three – the Date, too), I think low-cost labor may be the solution. Ferreting out those two (or three) fields from the large amount of garbage created by OCR (or ICR or IWR) could be more time-consuming (and expensive) than good ol' data entry. Regards, Joe

FocIS

ASKER

I'm really starting to think just manually entering the data is probably the best route

i was (am) hoping there might be a product specifically set up to judge the "hot spots" of a paper check format, and grab fields, flip them around into line format in excel.

Every check (that i have) is machine printed for the contact details, and hand-written for the date/total... but what i just learned is -all the rest- of the data is unimportant and should be ignored.

Obviously a human can ignore the rest but i think that only a program specifically designed for check scanning might be capable of this

SOLUTION

Joe Winograd

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

FocIS

ASKER

You're right Joe.. i have checked out Finereader and it's a nice program, but it still takes longer than 15 seconds per check to edit the data, as does adobe

I'm going to split points 85/15 if i can because, Joe has the ultimate answer but i do like the VB macro from ThinkSpaceSolutions, i can customize that for some other things too

I'll close this case on Sunday, assuming some rockstar doesn't pop in and say "this check-to-excel program works 100%" or some other such homerun

Joe Winograd

> I'm really starting to think just manually entering the data is probably the best route

Saw this after submitting my last post - agreed! That said, there are products that can do the "hot spots", as you called them. In the OCR business, they are known as "zones", and that type of OCR is known as zonal OCR (as opposed to full-text OCR, where the entire page is captured). So with an advanced OCR package, like ABBYY FineReader and Nuance's OmniPage, zonal OCR is supported, and you may be able to define those zones. However, since checks are different sizes and types, the placement of those "hot spots" is different from check to check (unlike a fixed form with zones), so even the zonal OCR approach on your checks would be very iffy.

Joe Winograd

Our posts keep crossing, but it shows we're both working hard. :) Splitting points however you want is fine with me. So is waiting for the rock-star to come along! But since you're closing it tomorrow, the metaphor should be that we're waiting for a touchdown, not a home run. :)

Steven Harris

No need to send any points my way. Ultimately, Joe is the expert here, I was just running off of an idea on this one.

Joe Winograd

TSS,
Nice of you to say that...but, hey, that's an interesting chunk of code you found...definitely worth some points. Regards, Joe

FocIS

ASKER

The macro posted is very useful, but ultimately after having tested the suggested programs, it was far easier to just do it by hand.

Joe Winograd

I think that's the right call in this case. Regards, Joe