Solved

Software or process to harvest contact details from scanned paper checks

Posted on 2014-02-01
18
445 Views
Last Modified: 2014-02-04
I'm looking for a software package or process to take 1000 or so paper checks and harvest the contact details, and if possible the date and amount - and add that to a sortable list like CSV or excel.

I don't need to capture any of the micr data, account numbers, routing numbers, or siganture.  

This isn't to process checks for payment, only for who wrote a check for how much and when, in a list.

The checks can be either scanned in at the time, or they can be processed in batch from a directory of JPG images or PDF's

Ideally we'd like to see if there's something no cost, or low cost as this will only be done maybe once per year... so buying 'neat receipts' or the like is probably not cost effective.  

What are some ideas that fit the task?
0
Comment
Question by:FocIS
  • 9
  • 6
  • 3
18 Comments
 
LVL 18

Expert Comment

by:Steven Harris
Comment Utility
If you have the images in PDF already and can run the OCR tool for text recognition, then you may be able to just run a vba process to extract the text from pdf to excel.
0
 
LVL 2

Author Comment

by:FocIS
Comment Utility
If that's the case, that's what i'd need help with - the "run a vba process to extract the text from pdf to excel"

How?  :)

Scanning them in isn't hard, and i can certainly run the OCR wizard in Acrobat Pro (not sure if it will get the total paid).  I don't need to capture the written total in words, but the numbers from the amount block would be great.

Typically the date and total are hand-written.  In either case, how to harvest the OCR'd text in a meaningful format?
0
 
LVL 18

Accepted Solution

by:
Steven Harris earned 100 total points
Comment Utility
Here is a code I ran across from a user by the name of crimson_b1ade.  From excel, this script will open a PDF file, and then use the SendKeys command to copy all recognized text and paste it into excel.

If possible, one large pdf would be best.  I would test with a few pages first and see what the output is and if it can be edited to fit your needs.

This code uses Acrobat 9.  You will need to set the file path in place of the "FILE NAME HERE" string, still using quotes on the path.

Sub StartAdobe()
'=========
'by crimson_b1ade
'http://tinyurl.com/lsulgh7
'=========
Dim AdobeApp As String
Dim AdobeFile As String
Dim StartAdobe
AdobeApp = "C:\Program Files\Adobe\Reader 9.0\Reader\AcroRd32.exe"
AdobeFile = "FILE NAME HERE"
StartAdobe = Shell("" & AdobeApp & " " & AdobeFile & "", 1)
Application.OnTime Now + TimeValue("00:00:05"), "FirstStep"
End Sub

Private Sub FirstStep()
SendKeys ("^a")
SendKeys ("^c")
Application.OnTime Now + TimeValue("00:00:10"), "SecondStep"
End Sub
 
Private Sub SecondStep()
AppActivate "Microsoft Excel"
Range("A1").Activate
SendKeys ("^v")
End Sub

Open in new window

0
 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
Comment Utility
Are the checks handwritten or typewritten?

That makes a big difference. Typewritten text is amenable to Optical Character Recognition (OCR) and there are numerous free (and low cost) products out there that do an excellent job with high accuracy (I'll get to those in a moment). But handwriting is a different (and much more difficult) ballgame that requires a process known as Intelligent Character Recognition (ICR) or another one known as Intelligent Word Recognition (IWR). ICR recognizes cursive handwriting a character at a time, while IWR recognizes full words and phrases in cursive handwriting. The accuracy of ICR and IWR is way, way below that of OCR. I suspect that if your checks are handwritten, you will be extremely disappointed with the results, and you will be much better off hiring some low-cost labor to type the data into an Excel spreadsheet for you.

That said, here are some free OCR tools for you to consider and experiment with:

(1) Tesseract OCR Engine, an open source product now maintained by Google:
http://code.google.com/p/tesseract-ocr/

It has numerous add-ons:
http://code.google.com/p/tesseract-ocr/wiki/AddOns

(2) FreeOCR, which uses a compiled version of the Tesseract engine:
http://www.paperfile.net/

(3) GOCR/JOCR, an open source OCR package developed under the GNU Public License:
http://jocr.sourceforge.net/

(4) OCR Desktop, which is not open source, but is free for personal use (needs to be registered in order to turn off popups and advertising):
http://www.ocrtools.com/fi/prdOCRFree.aspx

(5) SimpleOCR, which is not open source, but is free, with both an end-user version and a royalty-free SDK:
http://www.simpleocr.com/
http://www.simpleocr.com/Info.asp

(6) Boxoft Free OCR (I use several Boxoft free tools):
http://www.boxoft.com/free-ocr/

(7) Google Drive/Docs has an option to perform OCR on uploaded files, but the resulting PDF doesn't hide the text layer, so the files look ugly.

You said that "low cost" is OK, but you didn't define that. Assuming they qualify as "low cost" for you, two very well regarded OCR programs are Nuance OmniPage and ABBYY FineReader. Here are links to more information:
http://nuance.com/for-individuals/by-product/omnipage/index.htm
http://finereader.abbyy.com/

Here are links to feature comparison charts:
http://nuance.com/ucmprod/groups/imaging/@web-enus/documents/collateral/nc_016052.pdf
http://finereader.abbyy.com/editions_comparison_chart/

I use both and can say that both are very accurate, but I can't say that one is always better than the other. I've tested them on the same documents, and sometimes one is better, sometimes the other is, but for the most part, the accuracy is similar - both very good! And they both can create Excel files.

Another (non-free) idea is Nuance's PaperPort product, which is not a dedicated OCR package, but can perform OCR via Nuance's OmniPage, which is included "under the covers" (the OmniPage OCR engine is built into PaperPort):
http://nuance.com/for-individuals/by-product/paperport/index.htm

PaperPort is a robust scanning/imaging package that does a lot more than just OCR (but for pure OCR, is not as robust as OmniPage and FineReader). I use PaperPort extensively (more than OmniPage and FineReader combined). Its OCR capabilities (via the built-in OmniPage) may be adequate for your purposes. But if not, then go with OmniPage or FineReader.

Another non-free, but inexpensive, product is Nuance's PDF Converter Professional 8:
http://www.nuance.com/for-business/document-imaging-and-scanning/pdf-converter-professional/index.htm

Although the list price is $100, the street price is substantially less. It is currently $64 at Amazon:
http://www.amazon.com/Nuance-Communications-Inc-M109A-G00-8-0-Professional/dp/B0084PK8CS/

Yet another (non-free) possibility is Adobe Acrobat (not Adobe Reader), which is also a lot more than just OCR:
http://www.adobe.com/products/acrobat.html

I'm not a big fan of Acrobat (it's too expensive for what it does, in my opinion), but many folks like it and its built-in OCR is good.

One more idea: Microsoft Office Document Imaging (MODI) was bundled with Office 2003 and 2007. Here's a link to some good info about it:
http://office.microsoft.com/en-us/help/about-microsoft-office-document-imaging-HP001077103.aspx

MODI was removed from Office 2010, but here's an article on how to install it in 2010:
http://support.microsoft.com/kb/982760

Of course, MS Office is not free, but if you already have MS Office, then MODI is included at no additional charge.

Now for a key point. While today's OCR is very accurate, it is not 100%. There are always issues like the number "0" and the upper case "O"; the number "1" and the lower case "l"; and last names like "Turner", where the "r" and the "n" can be nearly touching in a proportional font, thereby causing the OCR to think it's the name "Tumer".

When creating searchable PDF files (a primary usage of OCR these days), most users are willing to live with the occasional OCR error. But since you're OCRing checks, where you expect the data to be 100% accurate, OCR alone won't do it. I like to say that the good news of OCR is that it's 99% accurate, and the bad news of OCR is that it's 99% accurate. :)  This is why some folks, in some situations, use heads-down data entry instead of, or in conjunction with, OCR. And as stated earlier, you should definitely do the heads-down data entry approach if the checks are handwritten. In fact, for just 1,000 items only once a year, you may want to go that route instead of OCR, even if the checks are typewritten. Regards, Joe
0
 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
Comment Utility
FocIS,
My message crossed with yours...took me a while to write it. :)

With the date and total (even just the numbers in the amount block) being handwritten (and the payee, also, I presume), I think you will be very disappointed with Acrobat's OCR results. But it's simple to test. Forget for the moment how to harvest the OCRed text in a meaningful format. Let's see if it's worth harvesting! My bet is no, but I'd be happy to be proven wrong. Use Acrobat's OCR on a dozen checks and take a look at the OCR results. Regards, Joe
0
 
LVL 2

Author Comment

by:FocIS
Comment Utility
I was able to get the VB code working in a macro, it does harvest the text but sort of just smears it all over the place - which is still better than typing straight up, but requires a lot of editing

Joe, you've given a lot of ideas, i'm going to try each one, especially the Nuance line.  I'll need a day or so to go thru those things.  I do have Acrobat X Pro, office 2003 and office 2010 already.  Acrobat does decent OCR obviously, but i'm interested in seeing what Nuance can do especially in regards to saving into excel

Thanks for the direction so far, both of you
0
 
LVL 2

Author Comment

by:FocIS
Comment Utility
As a quick followup, the excel macro via acrobat X pro, was able to garnish most of the contact details, none of the dates/totals, and interspersed seemingly endless useless data (parts of bank names, what text it made up from logos, etc)

The payee is not important, the total is mildly important and doesn't have to be exact, dates would be nice... but in the end, getting close to a christmas card list would be a good goal.  Being able to rank payers based on how much they paid per year would be awesome
0
 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
Comment Utility
You're welcome. I have all of the Nuance products I mentioned and would be happy to run some tests for you if you feel comfortable posting sample checks (of course, redacting any private/sensitive data). I also have ABBYY FineReader 11. But my concern is your comment that it "requires a lot of editing"...and in the end, I wonder if the combination of OCRing and editing really is better than straight typing. Definitely worth some experimentation, though, and I look forward to hearing the results of your tests. Regards, Joe
0
 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
Comment Utility
Our posts crossed again. With "none of the dates/totals, and interspersed seemingly endless useless data (parts of bank names, what text it made up from logos, etc)", the editing task could prove to be as difficult as straight-up data entry. And if all you're looking for is two fields - the Payer and the Amount (or maybe three – the Date, too), I think low-cost labor may be the solution. Ferreting out those two (or three) fields from the large amount of garbage created by OCR (or ICR or IWR) could be more time-consuming (and expensive) than good ol' data entry. Regards, Joe
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 2

Author Comment

by:FocIS
Comment Utility
I'm really starting to think just manually entering the data is probably the best route

i was (am) hoping there might be a product specifically set up to judge the "hot spots" of a paper check format, and grab fields, flip them around into line format in excel.

Every check (that i have) is machine printed for the contact details, and hand-written for the date/total... but what i just learned is -all the rest- of the data is unimportant and should be ignored.

Obviously a human can ignore the rest but i think that only a program specifically designed for check scanning might be capable of this
0
 
LVL 51

Assisted Solution

by:Joe Winograd, EE MVE
Joe Winograd, EE MVE earned 400 total points
Comment Utility
I just brought up an Excel spreadsheet and typed in three cells: a name (first and last), an amount (dollars and cents) and a date. It took 15 seconds. Let's be ultra-conservative and add 10 seconds per check. For 1,000 checks, that's 25,000 seconds, or just under 7 hours. One person could do it in a single work day, including a lunch break. :)  You may (or may not) have noticed that I am ranked the #1 Expert Overall at EE in OCR (and by a wide margin), so with that credential you might think that I'd push OCR at every turn, but I am suggesting that OCR is not the answer in this case. Regards, Joe
0
 
LVL 2

Author Comment

by:FocIS
Comment Utility
You're right Joe.. i have checked out Finereader and it's a nice program, but it still takes longer than 15 seconds per check to edit the data, as does adobe

I'm going to split points 85/15 if i can because, Joe has the ultimate answer but i do like the VB macro from ThinkSpaceSolutions, i can customize that for some other things too

I'll close this case on Sunday, assuming some rockstar doesn't pop in and say "this check-to-excel program works 100%" or some other such homerun
0
 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
Comment Utility
> I'm really starting to think just manually entering the data is probably the best route

Saw this after submitting my last post - agreed! That said, there are products that can do the "hot spots", as you called them. In the OCR business, they are known as "zones", and that type of OCR is known as zonal OCR (as opposed to full-text OCR, where the entire page is captured). So with an advanced OCR package, like ABBYY FineReader and Nuance's OmniPage, zonal OCR is supported, and you may be able to define those zones. However, since checks are different sizes and types, the placement of those "hot spots" is different from check to check (unlike a fixed form with zones), so even the zonal OCR approach on your checks would be very iffy.
0
 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
Comment Utility
Our posts keep crossing, but it shows we're both working hard. :)  Splitting points however you want is fine with me. So is waiting for the rock-star to come along! But since you're closing it tomorrow, the metaphor should be that we're waiting for a touchdown, not a home run. :)
0
 
LVL 18

Expert Comment

by:Steven Harris
Comment Utility
No need to send any points my way.  Ultimately, Joe is the expert here, I was just running off of an idea on this one.
0
 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
Comment Utility
TSS,
Nice of you to say that...but, hey, that's an interesting chunk of code you found...definitely worth some points. Regards, Joe
0
 
LVL 2

Author Closing Comment

by:FocIS
Comment Utility
The macro posted is very useful, but ultimately after having tested the suggested programs, it was far easier to just do it by hand.
0
 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
Comment Utility
I think that's the right call in this case. Regards, Joe
0

Featured Post

Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

Join & Write a Comment

PaperPort (http://www.nuance.com/for-individuals/by-product/paperport/index.htm) is among the most important applications that I run on my Windows computers. I use it every day, for nearly all of my document and photo scanning, as well as most of my…
This article will guide you to convert a grid from a picture into Excel format using Microsoft OneNote and no other 3rd party application.
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…
Sometimes we receive PDF files that are in the wrong orientation. They may be sideways or even upside down. This most commonly happens with scanned or faxed documents. It is possible to rotate the view of these PDFs with the free Adobe Reader produc…

763 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

9 Experts available now in Live!

Get 1:1 Help Now