asked on

OCR Tool to read files

I am looking for an OCR app that reads me scanned files & inserts data into a SQL table. As an example I want the APP to look at multiple invoices from multiple vendors (suppliers) & stores the data field items into a table. is there a good OCR software available in the market?

Scott Fell

There are plenty of options from free & paid open source to pure commercial for the OCR part. If the text is generally clear and in a plain format, or you have pdf's generated by a script this is going to be be easier. If this is from scans it can start getting more difficult.

The question is how much do you want to rely on automation for the first step of OCR? Text can get mangled or misinterpreted thus munging up your data.

Once you have the text, it is a matter of coding to get it to your database.

Can you provide a demo file of what you are starting with? What server side language do you want to use to code in order to bring the data to your database?

If the invoice format is always going to be the same, it is going to be matter of setting up once and running. I do think you will want to have a step to QC the data.

Are you wanting to code this on your own with our help or looking for somebody to do this for you?

Sanjay Kukreja

ASKER

Hi Scot

The invoice is going to coming from suppliers so many different formats. I have a team that can write a query program to make it cleaner. I would rather be interested in a paid solution. Basically I would have the invoices in a SQL table & compare it with the orders & then make payments based on the invoice converted. Server-side language can be asp.be/vb script/javascript/python/SQL stored procedures etc.

Dr. Klahn

is there a good OCR software available in the market?

Not good enough for this application. The best OCR is less reliable than 99%, and 99% would not be good enough in this application.

You might consider Mechanical Turk if the source material can be expurgated enough that it is just columns of numbers. Have it done three times and see if the results agree.

Scott Fell

In that case, I would look to something that is already to go

https://www.simpleocr.com/product-category/abbyy-flexicapture/
https://docparser.com/blog/scan-to-database/

Once you have the data extracted, your team can take over.

If you want to dive into rolling your own, https://github.com/tesseract-ocr/tesseract I started using tesseract myself. It is not something you will be able to do quickly especially with many different versions of invoices and many unknowns. Docparser seems like a good way to start.

ASKER CERTIFIED SOLUTION

Scott Fell

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Jim Dettman (EE MVE)

I'd just kick-in that often these types of setups don't work well, even when the incoming form is consistent. It's why many web sites employ a captcha of letters and numbers distorted or with lines through them; it's extremely difficult for a computer program to make sense of the result. If your thinking of doing this with just regular OCR, the failure rate will be high and the reliability rate will be low.

I know of a situation at utility company that put in a system like this utilizing high end software. They scan vendor invoices and try to pick out five pieces of info; invoice #, Invoice Date, PO#, qty, and dollar amount. The system is AI based, meaning it learns (i.e. same format comes from a vendor each time) as it goes along.

After a year's worth of system training, success rate is about 71% with those five pieces of info (500 out of 700 invoices a day). I don't know what the error rate is on the 71% it can read. What it fails at is multi-line invoices or invoices that have any kind of hand writing on them.

FWIW,
Jim.

David Favor

I do something similar to this for processing bank statements + credit card statements producing a tax preparer report.

Here's my approach...

1) For physical/paper documents, I scan this using a ScanSnap device.

These can be purchased off Amazon for around $100 USD.

ScanSnap devices approach 100% correct scanning. They also scan at an incredible speed. Roughly 1 second per page, full duplex.

2) If I'm provided a .pdf file or image, I scan this with tesseract, which is slow as it runs many algorithms attempting to correctly produce text output.

tesseract scans approach 100% correct scanning also.

Then I merge the text component back into the .pdf file, so I only ever have to do this once.

3) If the .pdf has a text component, no work to do.

Almost every .pdf produced these days has a text component, so rarely is OCR actually required anymore.

4) At this point I have a .pdf document with both an image component + text component.

5) Pass the .pdf to the next step which extracts the text component using poppler, as in...

pdftotext -enc ASCII7 -nopgbrk -layout '$file' $file.txt 2>/dev/null

Open in new window

At this point, all starting points, #1-#3 now have a $file.txt component to process.

6) Run each $file.txt file through mechanical tools to align any slightly misaligned fields, producing $file.txt.clean files.

7) Review each $file.txt.clean to verify all columns align correctly.

If required hand/human edit files to correctly align columns.

Note: This step is essential as no PDF text component will every be 100% correct, across 100s or 1000s of files.

8) At this point all finalized $file.txt.clean files have all columns correctly aligned, for post processing.

9) Run all the finalized files through any other software, which might include inserting data into an SQL database or in my case kicking out a tax return category file for my tax preparer.

Consideration: This may sound like a lot of work + once it's automated, normally one person can go through 100s of files/hour to get from step #1 to #9. Since input files normally only kick out a few each week/month/year this process is extremely fast + efficient.

Sanjay Kukreja

ASKER

Thank You All.