[Webinar] Streamline your web hosting managementRegister Today


Software or process to harvest contact details from scanned paper checks

Posted on 2014-02-01
Medium Priority
Last Modified: 2014-02-04
I'm looking for a software package or process to take 1000 or so paper checks and harvest the contact details, and if possible the date and amount - and add that to a sortable list like CSV or excel.

I don't need to capture any of the micr data, account numbers, routing numbers, or siganture.  

This isn't to process checks for payment, only for who wrote a check for how much and when, in a list.

The checks can be either scanned in at the time, or they can be processed in batch from a directory of JPG images or PDF's

Ideally we'd like to see if there's something no cost, or low cost as this will only be done maybe once per year... so buying 'neat receipts' or the like is probably not cost effective.  

What are some ideas that fit the task?
Question by:FocIS
  • 9
  • 6
  • 3
LVL 18

Expert Comment

by:Steven Harris
ID: 39827177
If you have the images in PDF already and can run the OCR tool for text recognition, then you may be able to just run a vba process to extract the text from pdf to excel.

Author Comment

ID: 39827179
If that's the case, that's what i'd need help with - the "run a vba process to extract the text from pdf to excel"

How?  :)

Scanning them in isn't hard, and i can certainly run the OCR wizard in Acrobat Pro (not sure if it will get the total paid).  I don't need to capture the written total in words, but the numbers from the amount block would be great.

Typically the date and total are hand-written.  In either case, how to harvest the OCR'd text in a meaningful format?
LVL 18

Accepted Solution

Steven Harris earned 400 total points
ID: 39827188
Here is a code I ran across from a user by the name of crimson_b1ade.  From excel, this script will open a PDF file, and then use the SendKeys command to copy all recognized text and paste it into excel.

If possible, one large pdf would be best.  I would test with a few pages first and see what the output is and if it can be edited to fit your needs.

This code uses Acrobat 9.  You will need to set the file path in place of the "FILE NAME HERE" string, still using quotes on the path.

Sub StartAdobe()
'by crimson_b1ade
Dim AdobeApp As String
Dim AdobeFile As String
Dim StartAdobe
AdobeApp = "C:\Program Files\Adobe\Reader 9.0\Reader\AcroRd32.exe"
AdobeFile = "FILE NAME HERE"
StartAdobe = Shell("" & AdobeApp & " " & AdobeFile & "", 1)
Application.OnTime Now + TimeValue("00:00:05"), "FirstStep"
End Sub

Private Sub FirstStep()
SendKeys ("^a")
SendKeys ("^c")
Application.OnTime Now + TimeValue("00:00:10"), "SecondStep"
End Sub
Private Sub SecondStep()
AppActivate "Microsoft Excel"
SendKeys ("^v")
End Sub

Open in new window

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

LVL 58
ID: 39827193
Are the checks handwritten or typewritten?

That makes a big difference. Typewritten text is amenable to Optical Character Recognition (OCR) and there are numerous free (and low cost) products out there that do an excellent job with high accuracy (I'll get to those in a moment). But handwriting is a different (and much more difficult) ballgame that requires a process known as Intelligent Character Recognition (ICR) or another one known as Intelligent Word Recognition (IWR). ICR recognizes cursive handwriting a character at a time, while IWR recognizes full words and phrases in cursive handwriting. The accuracy of ICR and IWR is way, way below that of OCR. I suspect that if your checks are handwritten, you will be extremely disappointed with the results, and you will be much better off hiring some low-cost labor to type the data into an Excel spreadsheet for you.

That said, here are some free OCR tools for you to consider and experiment with:

(1) Tesseract OCR Engine, an open source product now maintained by Google:

It has numerous add-ons:

(2) FreeOCR, which uses a compiled version of the Tesseract engine:

(3) GOCR/JOCR, an open source OCR package developed under the GNU Public License:

(4) OCR Desktop, which is not open source, but is free for personal use (needs to be registered in order to turn off popups and advertising):

(5) SimpleOCR, which is not open source, but is free, with both an end-user version and a royalty-free SDK:

(6) Boxoft Free OCR (I use several Boxoft free tools):

(7) Google Drive/Docs has an option to perform OCR on uploaded files, but the resulting PDF doesn't hide the text layer, so the files look ugly.

You said that "low cost" is OK, but you didn't define that. Assuming they qualify as "low cost" for you, two very well regarded OCR programs are Nuance OmniPage and ABBYY FineReader. Here are links to more information:

Here are links to feature comparison charts:

I use both and can say that both are very accurate, but I can't say that one is always better than the other. I've tested them on the same documents, and sometimes one is better, sometimes the other is, but for the most part, the accuracy is similar - both very good! And they both can create Excel files.

Another (non-free) idea is Nuance's PaperPort product, which is not a dedicated OCR package, but can perform OCR via Nuance's OmniPage, which is included "under the covers" (the OmniPage OCR engine is built into PaperPort):

PaperPort is a robust scanning/imaging package that does a lot more than just OCR (but for pure OCR, is not as robust as OmniPage and FineReader). I use PaperPort extensively (more than OmniPage and FineReader combined). Its OCR capabilities (via the built-in OmniPage) may be adequate for your purposes. But if not, then go with OmniPage or FineReader.

Another non-free, but inexpensive, product is Nuance's PDF Converter Professional 8:

Although the list price is $100, the street price is substantially less. It is currently $64 at Amazon:

Yet another (non-free) possibility is Adobe Acrobat (not Adobe Reader), which is also a lot more than just OCR:

I'm not a big fan of Acrobat (it's too expensive for what it does, in my opinion), but many folks like it and its built-in OCR is good.

One more idea: Microsoft Office Document Imaging (MODI) was bundled with Office 2003 and 2007. Here's a link to some good info about it:

MODI was removed from Office 2010, but here's an article on how to install it in 2010:

Of course, MS Office is not free, but if you already have MS Office, then MODI is included at no additional charge.

Now for a key point. While today's OCR is very accurate, it is not 100%. There are always issues like the number "0" and the upper case "O"; the number "1" and the lower case "l"; and last names like "Turner", where the "r" and the "n" can be nearly touching in a proportional font, thereby causing the OCR to think it's the name "Tumer".

When creating searchable PDF files (a primary usage of OCR these days), most users are willing to live with the occasional OCR error. But since you're OCRing checks, where you expect the data to be 100% accurate, OCR alone won't do it. I like to say that the good news of OCR is that it's 99% accurate, and the bad news of OCR is that it's 99% accurate. :)  This is why some folks, in some situations, use heads-down data entry instead of, or in conjunction with, OCR. And as stated earlier, you should definitely do the heads-down data entry approach if the checks are handwritten. In fact, for just 1,000 items only once a year, you may want to go that route instead of OCR, even if the checks are typewritten. Regards, Joe
LVL 58
ID: 39827200
My message crossed with yours...took me a while to write it. :)

With the date and total (even just the numbers in the amount block) being handwritten (and the payee, also, I presume), I think you will be very disappointed with Acrobat's OCR results. But it's simple to test. Forget for the moment how to harvest the OCRed text in a meaningful format. Let's see if it's worth harvesting! My bet is no, but I'd be happy to be proven wrong. Use Acrobat's OCR on a dozen checks and take a look at the OCR results. Regards, Joe

Author Comment

ID: 39827202
I was able to get the VB code working in a macro, it does harvest the text but sort of just smears it all over the place - which is still better than typing straight up, but requires a lot of editing

Joe, you've given a lot of ideas, i'm going to try each one, especially the Nuance line.  I'll need a day or so to go thru those things.  I do have Acrobat X Pro, office 2003 and office 2010 already.  Acrobat does decent OCR obviously, but i'm interested in seeing what Nuance can do especially in regards to saving into excel

Thanks for the direction so far, both of you

Author Comment

ID: 39827206
As a quick followup, the excel macro via acrobat X pro, was able to garnish most of the contact details, none of the dates/totals, and interspersed seemingly endless useless data (parts of bank names, what text it made up from logos, etc)

The payee is not important, the total is mildly important and doesn't have to be exact, dates would be nice... but in the end, getting close to a christmas card list would be a good goal.  Being able to rank payers based on how much they paid per year would be awesome
LVL 58
ID: 39827211
You're welcome. I have all of the Nuance products I mentioned and would be happy to run some tests for you if you feel comfortable posting sample checks (of course, redacting any private/sensitive data). I also have ABBYY FineReader 11. But my concern is your comment that it "requires a lot of editing"...and in the end, I wonder if the combination of OCRing and editing really is better than straight typing. Definitely worth some experimentation, though, and I look forward to hearing the results of your tests. Regards, Joe
LVL 58
ID: 39827218
Our posts crossed again. With "none of the dates/totals, and interspersed seemingly endless useless data (parts of bank names, what text it made up from logos, etc)", the editing task could prove to be as difficult as straight-up data entry. And if all you're looking for is two fields - the Payer and the Amount (or maybe three – the Date, too), I think low-cost labor may be the solution. Ferreting out those two (or three) fields from the large amount of garbage created by OCR (or ICR or IWR) could be more time-consuming (and expensive) than good ol' data entry. Regards, Joe

Author Comment

ID: 39827225
I'm really starting to think just manually entering the data is probably the best route

i was (am) hoping there might be a product specifically set up to judge the "hot spots" of a paper check format, and grab fields, flip them around into line format in excel.

Every check (that i have) is machine printed for the contact details, and hand-written for the date/total... but what i just learned is -all the rest- of the data is unimportant and should be ignored.

Obviously a human can ignore the rest but i think that only a program specifically designed for check scanning might be capable of this
LVL 58

Assisted Solution

by:Joe Winograd, EE Fellow 2017, MVE 2016, MVE 2015
Joe Winograd, EE Fellow 2017, MVE 2016, MVE 2015 earned 1600 total points
ID: 39827229
I just brought up an Excel spreadsheet and typed in three cells: a name (first and last), an amount (dollars and cents) and a date. It took 15 seconds. Let's be ultra-conservative and add 10 seconds per check. For 1,000 checks, that's 25,000 seconds, or just under 7 hours. One person could do it in a single work day, including a lunch break. :)  You may (or may not) have noticed that I am ranked the #1 Expert Overall at EE in OCR (and by a wide margin), so with that credential you might think that I'd push OCR at every turn, but I am suggesting that OCR is not the answer in this case. Regards, Joe

Author Comment

ID: 39827233
You're right Joe.. i have checked out Finereader and it's a nice program, but it still takes longer than 15 seconds per check to edit the data, as does adobe

I'm going to split points 85/15 if i can because, Joe has the ultimate answer but i do like the VB macro from ThinkSpaceSolutions, i can customize that for some other things too

I'll close this case on Sunday, assuming some rockstar doesn't pop in and say "this check-to-excel program works 100%" or some other such homerun
LVL 58
ID: 39827234
> I'm really starting to think just manually entering the data is probably the best route

Saw this after submitting my last post - agreed! That said, there are products that can do the "hot spots", as you called them. In the OCR business, they are known as "zones", and that type of OCR is known as zonal OCR (as opposed to full-text OCR, where the entire page is captured). So with an advanced OCR package, like ABBYY FineReader and Nuance's OmniPage, zonal OCR is supported, and you may be able to define those zones. However, since checks are different sizes and types, the placement of those "hot spots" is different from check to check (unlike a fixed form with zones), so even the zonal OCR approach on your checks would be very iffy.
LVL 58
ID: 39827239
Our posts keep crossing, but it shows we're both working hard. :)  Splitting points however you want is fine with me. So is waiting for the rock-star to come along! But since you're closing it tomorrow, the metaphor should be that we're waiting for a touchdown, not a home run. :)
LVL 18

Expert Comment

by:Steven Harris
ID: 39827242
No need to send any points my way.  Ultimately, Joe is the expert here, I was just running off of an idea on this one.
LVL 58
ID: 39833068
Nice of you to say that...but, hey, that's an interesting chunk of code you found...definitely worth some points. Regards, Joe

Author Closing Comment

ID: 39833863
The macro posted is very useful, but ultimately after having tested the suggested programs, it was far easier to just do it by hand.
LVL 58
ID: 39833878
I think that's the right call in this case. Regards, Joe

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Some code to ensure data integrity when using macros within Excel. Also included code that helps secure your data within an Excel workbook.
Excel can be a tricky bit of software to get your head around. Whilst you’ll be able to eventually get to grips with the basic understanding of how to get by, there are a few Excel tips that not everybody will even know about let alone know how to d…
Excel styles will make formatting consistent and let you apply and change formatting faster. In this tutorial, you'll learn how to use Excel's built-in styles, how to modify styles, and how to create your own. You'll also learn how to use your custo…
This video Micro Tutorial shows how to password-protect PDF files with free software. Many software products can do this, such as Adobe Acrobat (but not Adobe Reader), Nuance PaperPort, and Nuance Power PDF, but they are not free products. This vide…

608 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question