OCR Data Acquisition

Posted on 2008-01-30
Medium Priority
Last Modified: 2010-04-21
Hi Experts,

This is mostly a linux question, but at the bottom there is also a Windows question.

Does anyone know of non-commercial OCR software that can be configured to look for specific words (data recognition) in a scanned document?  For example, on a scanned document I have the software search for "Purchase Order" or "PO"  and other abbreviations of "Purchase Order."  The software might return to my program "Found/Non-Found", the actual string found or even the location on the image where the string was found, (e.g. [x1,y1] upper left and [x2,y2] lower right).

Further more, can the OCR software scan the immediate area  where the "Purchase Order" string was found and return that string (e.g. the purchase order number). The "immediate area" could be defined buy the program as upper-left & lower-right  coordinates.

Another Purchase Order example is to find the name & address information of purchaser, delivery date, delivery location, item number, quantity, etc.  Very ambitious.

I'm running Ubuntu, 7.10 gusty gibbon.  That gives me immediate access to debian packages.  Are there any Fedora RPMS, SuSE packages, Slackware, Mandriva ,Gentoo, Xandros, etc. that have packages, rpm or whatever they use to manage application software?

Are there any C libraries that aid in (1) OCR and looking for specific strings and (2) scanning a specific area of an image and return any data found?

For Windows Experts, are there OCX controls to do OCR and data acquisition?

Thanks much!!!

Question by:IT79637
LVL 19

Accepted Solution

http:// thevpn.guru earned 360 total points
ID: 20781715
Well for linux I know about these two progs you might wana have a look at..since they are open source you might wana have a look at their source code..they both have OCR features.

kooka - scanner program for KDE
unpaper - post-processing tool for scanned pages

As for windows compoents you can surely find something relevant
LVL 14

Assisted Solution

trigger-happy earned 320 total points
ID: 20781716
You can use tesseract: http://code.google.com/p/tesseract-ocr/ to scan the documents and then create a script/program to handle the actual searching.

LVL 41

Assisted Solution

noci earned 320 total points
ID: 20792409
gocr is another one  http://jocr.sourceforge.net

Author Closing Comment

ID: 31426444
The part of question regarding data acquisition is a very difficult one.  I'm looking for key word, such as Purchase Order on an image. Then want to find the purchase order data around the key word.  That type of intelligence is significantly more difficult than vanilla OCR.  The experts responses pointed me to several  linux based packages.
Thank you all very much!!!

Featured Post

Hire Technology Freelancers with Gigs

Work with freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely, and get projects done right.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

PaperPort 14.5 Patch 1 update is often not detected or downloaded automatically. This article provides direct download links to solve the problem for retail (non-bundled) versions of the Standard and Professional editions, as well as the Professiona…
In part one, we reviewed the prerequisites required for installing SQL Server vNext. In this part we will explore how to install Microsoft's SQL Server on Ubuntu 16.04.
Connecting to an Amazon Linux EC2 Instance from Windows Using PuTTY.
Get a first impression of how PRTG looks and learn how it works.   This video is a short introduction to PRTG, as an initial overview or as a quick start for new PRTG users.
Suggested Courses

599 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question