Link to home
Start Free TrialLog in
Avatar of Kanwaljit Singh Dhunna
Kanwaljit Singh DhunnaFlag for India

asked on

Scan or read the document directly into Excel in proper columnar table

Hi Experts,

We are facing a situation where the person handling and issuing the bilty receipts is not a very qualified person and the receipts are being issued MANUALLY now.
An Invoice is presented to him, and on the basis of particulars of the Invoice the Bilty is issued MANUALLY as of now. Those particulars are entered in Excel by a person later on.

But we wish to enter the same directly in Excel so as to speed up the process and avoid duplicacy. The following options are being floated at present.

1. Scan the invoice directly in Excel. It could be tricky as sometimes the invoice is folded and / or dirty. It is printed using simple INK, so MICR is not supported. Also we are not sure of scanners or other devices available at present. I tried Microsoft Office Lens but that too needs further working.

2. Voice command entry in Excel. I am not sure how that can be done. If that is possible kindly advise.

I request the experts to kindly suggest how that can be worked out !

Regards
Kanwaljit
Avatar of aikimark
aikimark
Flag of United States of America image

With OCR, you don't need special ink.  It isn't a perfect process, but you can get pretty good results for forms.

Have you looked at OCR packages?  You should be able to specify different regions of the page and the software will assign values to different variables/fields.
Avatar of Kanwaljit Singh Dhunna

ASKER

I have tried using Microsoft Office Lens App available for Androids.
That is OCR. It saves the documents in word Format only. So not fit for the purpose in present form. We need data in Columnar tables.

Have you looked at OCR packages?  You should be able to specify different regions of the page and the software will assign values to different variables/fields.
- Kindly suggest and advise how to do it !
Do you have One Note?  It will produce text.

How about MODI? (Microsoft Office Document Imaging)
You might have already installed it when you installed Office/Excel
https://support.microsoft.com/en-us/help/982760/install-modi-for-use-with-microsoft-office-2010

Have you installed a scanner to your PC?  Many times the installation also installs OCR software.

How about Adobe Acrobat?

Tesseract is a reliable open source OCR library.  I haven't invoked it or consumed its output with VBA
Hi Kanwal,
There are many document imaging products with scanning and OCR capability. For example, below are links to some articles and videos that I've published here at Experts Exchange on the topic:

PaperPort - How To Create Searchable PDF Files
How to OCR pages in a PDF with free software - PDF-XChange Editor
Convert Scanned Image-Only PDF Files to PDF Searchable Image Files via OCR with Power PDF Advanced
Batch Conversion of PDF, TIFF, and Other Image Formats via Command Line Interface to PDF, PDF Searchable, and TIFF with Power PDF Advanced

Those products are PaperPort, PDF-XChange Editor, and Power PDF Advanced (there are many others).

But there are issues with using OCR for this, including (1) OCR is not 100% accurate (and is likely to be even worse since you say that "the invoice is folded and/or dirty") and (2) maintaining the table format is tricky stuff...very difficult to get right. The results can also vary tremendously by document.

I have all of the products that I mentioned above (and many more, including ABBYY FineReader, Acrobat DC, OmniPage, Tesseract, and others), and would be happy to run some tests for you if you post a sample image (being careful, of course, that it doesn't have any private/sensitive info in it).

Btw, regarding Mark's comment on Microsoft Office Document Imaging (MODI), it was bundled with Office 2003 and 2007. It was removed from Office 2010, but here's an article on how to install it in 2010:
http://support.microsoft.com/kb/982760

The same technique may work with Office 2013, but I never tried that, and reports on the web about it show mixed results...some successes...some failures. I'm nearly certain that it won't work with Office 2016/2019/365. Also, Microsoft OEM'ed the software in MODI from ScanSoft, which became the Document Imaging division of Nuance, and which was recently sold to Kofax. Instead of even trying to get MODI to work with a modern version of Office, you should go with the current version of PaperPort (v14.7) or Power PDF Advanced (v3.1) or some other current document imaging product. Regards, Joe
Thanks Aikimark and Joe,

I am using Windows 10 64 Bit version and Office 365 64 Bit Version.
I am not sure How to Tesseract. Is this the right source ?


I need the data to be imported in a Data Table in Excel, so even If I scan the image with OCR it is not going to help me till the same is imported in Excel.

As Joe said, MODI is not going to work with Office 365, that option is gone.

Can the Speech Recognition technology serve our pupose ?
Are you thinking about dictating values into Excel?  I think that would be as error prone as what you are doing now.
Yes, I was thinking about that and I take yours word for that.

What would be the Best Option then !!
Install Tesseract. Play with it. Do some manual OCR operations from the command line.  See if you can get the text output in a form that might be consumed by VBA code.  Develop VBA code that does an import.
Does the receipts have barcodes or qr codes. if so buy a barcode scanner and it can easily insert info into an excel spread sheet.  All you need to do is add the bar code item in the quick access toolbar. I have it running in excel to scan receipts with my barcode scanner. It works well and you dont have to worry about typo mistakes when you ad info manually.
Hi Kanwal,
Tesseract is fine for personal use and if you're looking for something that is free. But since you're obviously in a business situation where you're dealing with invoices, I think that you should go for a top quality, commercial OCR product, with better accuracy than Tesseract, such as ABBYY FineReader or Nuance/Kofax OmniPage (the latter is the OCR engine built into Nuance/Kofax PaperPort and Power PDF). They're not free, but well worth the improvement in OCR accuracy, imo. The other issue is the ability to create the Excel spreadsheet, where the ABBYY and Nuance/Kofax products are very good (although not perfect...no OCR product that I'm aware of is perfect at it).

As I offered earlier, if you post a sample image (being careful that it doesn't have any private/sensitive info), I'll run it through a few OCR products and will post the Excel spreadsheets that they create. Also, you may try two of the really good ones yourself, as they both offer free trials:
https://www.abbyy.com/en-us/lp/finereader15-download-free-trial
https://www.kofax.com/Products/omnipage/ultimate-trial-version

In addition, Kofax offers free trials of PaperPort Professional and Power PDF Advanced, both of which are easier to use than OmniPage, yet have the OmniPage OCR engine under the covers:
https://www.kofax.com/Products/paperport/professional-trial-version
https://www.kofax.com/Products/power-pdf/advanced-free-trial

As a disclaimer, I want to emphasize that I have no affiliation with any of the companies mentioned in this post and no financial interest in them whatsoever. I am simply a happy user/customer. Regards, Joe
Thanks Aikimark ! Kindly direct me from where to download the Tesseract. There are a lot of links and I am not sure which one to follow.
We may not use it in the actual scenario as the user is not too keen on using an OCR due to the quality of physical Invoices (the source document), but as you and Joe are pretty sure about it, I feel we must give it a try, even if that is for any future use. You never know. Thanks !

Thanks Robert ! We are trying to convince the Invoicing company to print a bar code on the invoices, which we can scan via a Bar code scanner. But they are using SAP and a big company like them may choose to give us a helping hand or not. But that is a Great option, I agree.

Thanks Joe ! Whenever I see yours comments, I can feel the passion with which you experts on EE go all out to help the person in need. Thanks a Lot, first of all, for being so enthusiastic and thanks a Lot for yours time. I can see the merit in those products. As the samples, of physical invoices I got, are of poor quality in general (Truck Operators tend to treat them poorly for reasons not so obvious to us), the results were not so good. In fact, we need 6 fields from a whole lot of fields from the Invoices. So it might need a lot more to do than just scanning the Invoice.

I have attached a sample though (One pdf File and One Word File, scanned via Microsoft Lens App for Android)
We need to extract
-Invoice No
-Invoice Date
-Way Bill No
-Vehicle Number
-Lorry Receipt No
-Delivery Instruction Number

-Customer Name and Code (in separate Columns)
-Consignee Name and Code (in separate Columns)
-Destination and City Code (in separate Columns)
-HSN Code
-Quantity (MT) and Bag (in separate Columns)

We are still exploring our options like Dictating to excel 365.
What is Pen Input (I Heard of it, not sure what it is) to feed data to system.

Regards
Kanwal
emamiEE.docx
emamiEE.pdf
This should be a reliable download for Windows:
https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#400-alpha-for-windows

Thank you for uploading the documents.
#1 rule when scanning documents: Remove the staple and flatten the page against the scan plane. (alternatively, feed multiple sheets into a scanner page feeder)
Thanks Aikimark,

https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#400-alpha-for-windows

There are a Lot of files there. I am not able to find exact .exe file for installation. Am I missing something. Github seems to be different in a way. I have not used it till date.
Look for Windows downloads.
Hi Kanwal,

> but as you and Joe are pretty sure about it, I feel we must give it a try,

I have some other comments for you, but I want to respond to this one first, as I want it to be clear that I am NOT "pretty sure about it", especially after seeing the quality of your source documents. I worked in the high-end document imaging business (literally, million-dollar systems) for over 20 years and more often than not recommended against OCR. Yes, it can be worthwhile in some cases, but most times heads-down data entry was the more cost-effective approach.

Now, on to other comments.

> the user is not too keen on using an OCR due to the quality of physical Invoices (the source document)

After looking at your sample doc, the user is absolutely correct!

> Kindly direct me from where to download the Tesseract.

The latest version is 5.0.0-alpha. The pre-built binaries for Windows are available at the UB Mannheim Tesseract wiki:
https://github.com/UB-Mannheim/tesseract/wiki

Here are the direct download links:

32-bit:
https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w32-setup-v5.0.0-alpha.20191030.exe
64-bit:
https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v5.0.0-alpha.20191030.exe

Note this comment at the site (copied here under "Fair Use"):
We don't provide an installer for Tesseract 4.1.0 because we think that the latest version 5.0.0-alpha is better for most Windows users in many aspects (functionality, speed, stability). Version 4.1 is only needed for people who develop software based on the Tesseract API and who need 100 % API compatibility with version 4.0.

> As the samples, of physical invoices I got, are of poor quality in general (Truck Operators tend to treat them poorly for reasons not so obvious to us), the results were not so good.

Indeed! I ran the PDF through ABBYY FineReader and Nuance OmniPage (did not bother with PaperPort or Power PDF, since they use the OmniPage engine under the covers). The Excel spreadsheets that they created are attached...as you say, not so good!

Tesseract does not support PDF as the input, so I converted your PDF to a PNG (at 300 DPI) using the Xpdf utility called PDFtoPNG. The text file that Tesseract 5.0.0-alpha created is also attached...and also not so good. :)

I don't know if voice input is a viable method for your project, but I'm certain that OCR is not. Regards, Joe

Attachments below:
kanwal-emamiEE-ABBYY-FineReader.xlsx
kanwal-emamiEE-Nuance-OmniPage.xlsx
kanwal-emamiEE-Tesseract.txt
Thanks Aikimark !
Thanks Joe !

I am going to try Tesseract "on a personal level", as the user seems a bit scared of OCR.
Thanks a Lot to both of You again for the guidance !!!

Regarding Speech recognition, here is something which looks good.
https://www.youtube.com/watch?v=OmgxAOACXRk

I downloaded the nuance_dragon_naturallyspeaking_3836309580 from the site and installed it.
But it brought along with it 110 PUP and installed on my system. So I had to uninstall it.
But it really seems promising.

Regards
Kanwaljit
In the comments the Youtuber has given the following link
https://github.com/t4ngo/dragonfly
I need to give speech to text a try in Excel.
Can you guide me how to do it !
ASKER CERTIFIED SOLUTION
Avatar of Joe Winograd
Joe Winograd
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks Joe !
That is pretty Honest Assessment, I feel.

We have been trying to hire a person for such purpose, but the loading goes on 24*7 and it is quite an issue to hire a quality person for shifts from 6 pm to 8 am timing. That is the reason for this approach. I thought it was worth a try.
You don't have to hire someone on-premisis.  You can show the images anywhere in the world (in any time zone) to meet your data entry needs.

The US Postal System did this, scanning the envelope/package image in Washington DC, transmitting the image (a couple of states away) to West Virginia.  Data entry operators in West Virginia did data entry about the address, usually the zip code.  The image ID and the data entry characters were transmitted back to Washington where the envelope/package was physically processed.
Thanks Guys !
We have a better vision now !
You're welcome, Kanwaljit. Good luck on the project! Regards, Joe