Solved

Need to be able to OCR and then Text search a 1TB directory containing a variety of documents, tiffs, PDFs, WPD, DOC...

Posted on 2014-03-26
9
598 Views
Last Modified: 2014-05-15
Hello,

As I stated in the title, I have a large directly with many sub-folders and files which contain different file types.  Tiff, WPD, DOC, PDF, being the bulk of them. I need a way to OCR that directory and then text search for specific words to find documents and the exact file paths for those documents.  I am sure there is software out there, but I am lost with this.
0
Comment
Question by:sXmont1j6
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 3
9 Comments
 
LVL 53

Expert Comment

by:Joe Winograd, EE MVE
ID: 39956871
I recommend a top quality OCR package and a top quality search product. You haven't mentioned budget, so I'm going to assume that you're willing to spend money to get really good software. But if you have a limited budget for the project, let me know and I can make some other recommendations.

For OCR, two very well regarded programs are Nuance OmniPage and ABBYY FineReader. Here are links to more information:
http://nuance.com/for-individuals/by-product/omnipage/index.htm
http://finereader.abbyy.com/

Here are links to feature comparison charts:
http://nuance.com/ucmprod/groups/imaging/@web-enus/documents/collateral/nc_016052.pdf
http://finereader.abbyy.com/editions_comparison_chart/

I use both and can say that both are very accurate, but I can't say that one is always better than the other. I've tested them on the same documents, and sometimes one is better, sometimes the other is, but for the most part, the accuracy is similar - both very good!

Another idea is Nuance's PaperPort product, which is not a dedicated OCR package, but can perform OCR via Nuance's OmniPage, which is included "under the covers" (the OmniPage OCR engine is built into PaperPort):
http://nuance.com/for-individuals/by-product/paperport/index.htm

PaperPort is a robust scanning/imaging package that does a lot more than just OCR (but for pure OCR, is not as robust as OmniPage and FineReader). I use PaperPort extensively (more than OmniPage and FineReader combined). Its OCR capabilities (via the built-in OmniPage) may be adequate for your purposes. But if not, then go with OmniPage or FineReader. PaperPort has a built-in search tool (called All-in-One Search), but I do not use (or recommend) it.

For search, I recommend dtSearch:
http://www.dtsearch.com

This is an extraordinary search product that can index and search any file type that has text in it. You may be interested in an article I wrote called "dtSearch and PaperPort", available here:
https://sites.google.com/site/wikipaperport/articles-tutorials

As a disclaimer, I want to emphasize that I have no affiliation with these companies and no financial interest in them whatsoever. I am simply a happy user/customer. Regards, Joe
0
 

Author Comment

by:sXmont1j6
ID: 39957031
I am familiar with all of the products.  I suppose my scenerio is the following:

a Shared directory with a ton of sub-directories.  Will I be able to OCR from the top folder level and will it OCR each file in each sub-directory?
0
 
LVL 53

Expert Comment

by:Joe Winograd, EE MVE
ID: 40018032
Hi sXmont1j6,
I'm sorry for taking so long to reply to your post. I normally get an email from EE every time there's a post on threads, but I either didn't get it this time or I missed it. In any case, the answer to your question for OmniPage is YES. You will be able to start the OCR process at a folder and tell it to include all subfolders. Regards, Joe
0
Optimizing Cloud Backup for Low Bandwidth

With cloud storage prices going down a growing number of SMBs start to use it for backup storage. Unfortunately, business data volume rarely fits the average Internet speed. This article provides an overview of main Internet speed challenges and reveals backup best practices.

 

Author Comment

by:sXmont1j6
ID: 40018124
I presume many different file formats as well?  tiff, doc, wpd, pdf...etc...
0
 
LVL 53

Expert Comment

by:Joe Winograd, EE MVE
ID: 40018186
Yes, all of those and more. Of course, it doesn't OCR file types that already have text, such as DOC and WPD, but it can save to those. Attached is a document created from the OmniPage Ultimate (aka version 19) help file that details its supported file types. Regards, Joe
OP19-supported-file-types.pdf
0
 
LVL 53

Expert Comment

by:Joe Winograd, EE MVE
ID: 40040700
Hi sXmont1j6,
Any update on this? I'm happy to help in any way. Just let me know. Regards, Joe
0
 
LVL 53

Accepted Solution

by:
Joe Winograd, EE MVE earned 500 total points
ID: 40059907
Hi sXmont1j6,
Since my last post, Nuance has released a new product called Power PDF. It comes in two versions — Standard and Advanced. There's a 30-day free trial of the Advanced version here:
http://www.nuance.com/for-business/document-imaging-and-scanning/power-pdf-converter-advanced/index.htm#resources

I recently installed it and have been experimenting with it — so far, I'm happy with it. It has a Watched Folder feature that you could use to OCR from the top folder into each subfolder. Here's what the Watched Folder dialog looks like:

PPA WF include subsThe list of supported file types for the source files is extensive:

PPA supported source file typesAnd here's the list of supported file types for the destination files:

PPA supported destination file typesOf course, for your purposes, you would select Searchable PDF, which performs OCR.

It also has a Batch Converter feature, but the GUI version of it doesn't support subfolders, so that is why I recommended the Watched Folder feature. However, the Batch Converter also comes in a command line version, which does support subfolders, so you may be interested in that.

This is just one more idea for the OCR piece of it, and then you can feed the OCRed files (Searchable PDFs) to dtSearch for indexing/searching, as I mentioned in a previous post. Speaking of dtSearch, here's a link to a recent thread here at EE where I provided some details about it:
http://www.experts-exchange.com/Software/Office_Productivity/Q_28418575.html#a40018425

Regards, Joe
0
 

Author Comment

by:sXmont1j6
ID: 40067773
I tested this with a dummy set of data and this worked properly.   I truly thank you for your help.
0
 
LVL 53

Expert Comment

by:Joe Winograd, EE MVE
ID: 40067812
You're very welcome. I'm really glad that it worked for you. Regards, Joe
0

Featured Post

Salesforce Has Never Been Easier

Improve and reinforce salesforce training & adoption using WalkMe's digital adoption platform. Start saving on costly employee training by creating fast intuitive Walk-Thrus for Salesforce. Claim your Free Account Now

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Online collaboration can help businesses be more efficient, help employees grow their skills and foster a team environment.
Microsoft Office Picture Manager was included in Office 2003, 2007, and 2010, but not in Office 2013. Users had hopes that it would be in Office 2016/Office 365, but it is not. Fortunately, the same zero-cost technique that works to install it with …
Sometimes we receive PDF files that are in the wrong orientation. They may be sideways or even upside down. This most commonly happens with scanned or faxed documents. It is possible to rotate the view of these PDFs with the free Adobe Reader produc…
This video Micro Tutorial is the second in a two-part series that shows how to create and use custom scanning profiles in Nuance's PaperPort 14.5 (http://www.experts-exchange.com/articles/17490/). But the ability to create custom scanning profiles a…

735 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question