asked on

Post-scan OCR

I need some recommendations on a scanning project I will define and manage - converting an employees paper files to digital. The docs are being scanned, qc'd, renamed and moved to SharePoint.

The owner/user of the files raised the question of text searchability. Is it best practice to perform OCR during the scan? The production scanner being used isn't capapble of OCR. It needs to be done post-scan, but I'd like an idea of how it would impact outcomes.

Joe Winograd

Some people prefer doing OCR during the scan, but others prefer doing it after the scan. The advantage of the latter is that scanning time is faster without doing the OCR. One clarification - scanners simply produce the bitmapped images (typically PDF or TIFF for documents, and JPG for photos) and feed the images to software, which then OCRs it. So the production scanner is capapble of OCR in the sense that it feeds the images to software which, if so configured, can perform the OCR as soon as it receives the images. However, I understand that in your case, the production scanner is sending along images that are just bitmaps...no text...they have not been OCRed.

So my recommendation is to get a good quality OCR pacakge that can OCR the images and convert them to searchable PDF files, meaning the PDF file will have the original image as well as a layer of text (searchable and copy/pastable) created by the OCR. Two well regarded OCR programs are Nuance OmniPage and ABBYY FineReader. Here are links to more information:

http://nuance.com/for-business/by-product/omnipage/professional/index.htm
http://finereader.abbyy.com/

Here are links to feature comparison charts:

http://nuance.com/ucmprod/groups/imaging/@web-enus/documents/collateral/nc_016052.pdf
http://finereader.abbyy.com/Default.aspx?DN=baebec7c-e952-44f3-93bc-065b59dd59bb

I use both products and can say that both are very accurate, but I can't say that one is always better than the other. I've tested them on the same documents, and sometimes one is better, sometimes the other is, but for the most part, the accuracy is similar - both very good! They both can make searchable PDF files, as mentioned above, i.e., PDF files with both the scanned images and a layer of text created by the OCR process.

Another idea is Nuance's PaperPort product, which is not a dedicated OCR package, but can perform OCR via Nuance's OmniPage, which is included "under the covers" (the OmniPage OCR engine is built into PaperPort):
http://nuance.com/for-business/by-product/paperport/index.htm

PaperPort is a robust scanning/imaging package that does a lot more than just OCR (but for pure OCR, does not have as many options as OmniPage and FineReader). I use PaperPort extensively (more than OmniPage and FineReader combined) to create PDF Searchable Image files. It can also scan directly to CSV, DOC, XLS, and many other file types.

PaperPort can scan and OCR at the same time, or it can OCR docs from other sources, such as your production scanner, which is probably dumping the scanned images into a shared folder. This leads to the answer to your question about how doing the OCR post-scan impacts outcomes...the answer is - it doesn't! The same OCR software is invoked at scan time or post-scan by packages like FineReader, OmniPage, and PaperPort. The OCR accuracy will be equivalent at scan time or post-scan time, since they're using the same OCR engine. An important point, though, is that the production scanner needs to send images that lend themselves well to the OCR process. I prefer black&white, 300 DPI, except in unusual cases. Wayne Fulton's excellent site, "A few scanning tips", has some good advice on creating docs that play nicely with OCR:
http://www.scantips.com/basics04.html

Regards, Joe

K_Deutsch

ASKER

I am looking in to the info provided, thank you. You mentioned 300 dpi. Please look at the two files attached. One uses the "compact PDF" setting on our production scanner, and the other uses "PDF 300 X 300." The file size jumps from 136 KB to 1,699 KB for an all-text three-pager. That seems crazy and will drastically increase the storage needs for this project.
compact-pdf.pdf
pdf-300X300.pdf

Joe Winograd

The reason for the size is only partly due to 300x300. It is mostly due to its being a color (24-bit) scan. The Konica Minolta bizhub C353 supports both color scanning and black&white (1-bit) scanning. For OCR purposes, 300DPI/B&W is fine. Try that on your 3-page document and it will be in the 150KB range. Send it to me and I'll convert it with PaperPort (using the built-in OmniPage OCR engine) to a searchable PDF file, which will be only slightly bigger than the original (the text adds relatively little size compared to the image). I took your 300x300 color doc and converted it to a B&W PDF searchable doc using PaperPort...it is attached, and it is only 138KB! Open it with Adobe Reader and you will be able to search and copy/paste the text. Regards, Joe
pdf-300X300-OCRed.pdf

Joe Winograd

Just to be clear, the 138KB PDF searchable file that I sent to you was based on your 1.7MB file, not the 135KB one.

K_Deutsch

ASKER

Had I read more carefully, I would've seen you mentioned B&W from the beginning. I actually do have PaperPort installed. What you describe makes sense, and you have demonstrated it effectively. The only remaining issue is this...I can see how if you have the document opened in your PDF reader, you can search the text. But what about Windows file searches? Is there a way for the words to appear there? Like if you index the files or something?

Joe Winograd

Yes, Windows Search (WS4) or any decent search product will be able to index and then search the text. I use a search tool called dtSearch on all of the PDF Searchable Image files that I create with PaperPort - it indexes and searches them perfectly. I know of PaperPort users who do the same with Copernic, X1, WS4, and other search products. They can all do it because the text is embedded in the PDF file (along with the images).

Btw, what version of PaperPort do you have? The first version that could make a PDF Searchable Image file without OmniPage installed was PP12 (PP14, the latest release, can do it, too - yes, Nuance got superstitious and did not release a version 13). So if you have PP11 or earlier, you will need to either install the separate OmniPage program or upgrade your PP to either 12 or 14, both of which can create PDF Searchable Image files with the built-in OmniPage OCR engine. Regards, Joe

K_Deutsch

ASKER

PP 12 is what I have installed. I have retested, using b&w 300X300 on the bizhub c353. File size after creating searchable doc is 415KB.

Joe Winograd

What was the file size before conversion? How did you convert in PP12?

I also did the conversion in PP12 (actually, 12.1). I did a Save As on the PP Desktop and changed the Color drop-down from Original to B&W, and the Resolution drop-down from Original to 300 DPI. Try that and see what happens.

Btw, if you're on PP12.0, I suggest you upgrade to 12.1 (FREE!), which fixed numerous bugs. My article describes how to do it at no charge via the Nuance website, as long as you have your PP12 serial number:
https://www.experts-exchange.com/Web_Development/Document_Imaging/A_6537-PaperPort-Upgrade-How-to-download-and-install-updated-versions-of-PaperPort-11-and-12.html

Regards, Joe

K_Deutsch

ASKER

Let's not compare results based on converting that 1.7 MB file. I have attached a scan that from scratch is 300X300 black and white. What I do to make the conversion is this...I open PDF Viewer from Start Menu/PaperPort. Then I choose the PDF, then I choose Tools > Make Searchable PDF. Files attached.
bw-300X300.pdf
bw-300X300-searchable.pdf

ASKER CERTIFIED SOLUTION

Joe Winograd

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Joe Winograd

Btw, since you're a business, not an individual, and given how expensive the Konica Minolta bizhub C353 is (in the $10K range last time I looked), I suggest you spend $150 or so and get PP14 Professional:
http://shop.nuance.com/store/nuanceus/pd/productID.234284400

List price is $200 but street price is $100-150. For example, it is currently $140 at Amazon:
http://www.amazon.com/Nuance-Communications-Inc-F309A-G00-140-Professional/dp/B005CELL1G

PP14 uses a newer OmniPage OCR engine than PP12 does, and it is more accurate. You could save a few bucks by getting PP14 Standard instead of Professional, but for your business purposes, there may be some features in Pro that make the price difference (less than 100 bucks in street price) easily worthwhile. Here's a comparison matrix of PP14 Std vs. Pro:
http://www.nuance.com/ucmprod/groups/imaging/@web-enus/documents/collateral/nc_017733.pdf

Regards, Joe

K_Deutsch

ASKER

Okay, I duplicated your parameters, and enjoyed the same result. I will check out your hardware & software suggestions. Nice work, Joe. I am very grateful.

Joe Winograd

My pleasure. I always enjoy working on the interesting issues, like this one. Good luck on the project! Regards, Joe