• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 677
  • Last Modified:

Need to be able to OCR and then Text search a 1TB directory containing a variety of documents, tiffs, PDFs, WPD, DOC...

Hello,

As I stated in the title, I have a large directly with many sub-folders and files which contain different file types.  Tiff, WPD, DOC, PDF, being the bulk of them. I need a way to OCR that directory and then text search for specific words to find documents and the exact file paths for those documents.  I am sure there is software out there, but I am lost with this.
0
sXmont1j6
Asked:
sXmont1j6
  • 6
  • 3
1 Solution
 
Joe Winograd, EE MVE 2015&2016DeveloperCommented:
I recommend a top quality OCR package and a top quality search product. You haven't mentioned budget, so I'm going to assume that you're willing to spend money to get really good software. But if you have a limited budget for the project, let me know and I can make some other recommendations.

For OCR, two very well regarded programs are Nuance OmniPage and ABBYY FineReader. Here are links to more information:
http://nuance.com/for-individuals/by-product/omnipage/index.htm
http://finereader.abbyy.com/

Here are links to feature comparison charts:
http://nuance.com/ucmprod/groups/imaging/@web-enus/documents/collateral/nc_016052.pdf
http://finereader.abbyy.com/editions_comparison_chart/

I use both and can say that both are very accurate, but I can't say that one is always better than the other. I've tested them on the same documents, and sometimes one is better, sometimes the other is, but for the most part, the accuracy is similar - both very good!

Another idea is Nuance's PaperPort product, which is not a dedicated OCR package, but can perform OCR via Nuance's OmniPage, which is included "under the covers" (the OmniPage OCR engine is built into PaperPort):
http://nuance.com/for-individuals/by-product/paperport/index.htm

PaperPort is a robust scanning/imaging package that does a lot more than just OCR (but for pure OCR, is not as robust as OmniPage and FineReader). I use PaperPort extensively (more than OmniPage and FineReader combined). Its OCR capabilities (via the built-in OmniPage) may be adequate for your purposes. But if not, then go with OmniPage or FineReader. PaperPort has a built-in search tool (called All-in-One Search), but I do not use (or recommend) it.

For search, I recommend dtSearch:
http://www.dtsearch.com

This is an extraordinary search product that can index and search any file type that has text in it. You may be interested in an article I wrote called "dtSearch and PaperPort", available here:
https://sites.google.com/site/wikipaperport/articles-tutorials

As a disclaimer, I want to emphasize that I have no affiliation with these companies and no financial interest in them whatsoever. I am simply a happy user/customer. Regards, Joe
0
 
sXmont1j6Author Commented:
I am familiar with all of the products.  I suppose my scenerio is the following:

a Shared directory with a ton of sub-directories.  Will I be able to OCR from the top folder level and will it OCR each file in each sub-directory?
0
 
Joe Winograd, EE MVE 2015&2016DeveloperCommented:
Hi sXmont1j6,
I'm sorry for taking so long to reply to your post. I normally get an email from EE every time there's a post on threads, but I either didn't get it this time or I missed it. In any case, the answer to your question for OmniPage is YES. You will be able to start the OCR process at a folder and tell it to include all subfolders. Regards, Joe
0
Prepare for your VMware VCP6-DCV exam.

Josh Coen and Jason Langer have prepared the latest edition of VCP study guide. Both authors have been working in the IT field for more than a decade, and both hold VMware certifications. This 163-page guide covers all 10 of the exam blueprint sections.

 
sXmont1j6Author Commented:
I presume many different file formats as well?  tiff, doc, wpd, pdf...etc...
0
 
Joe Winograd, EE MVE 2015&2016DeveloperCommented:
Yes, all of those and more. Of course, it doesn't OCR file types that already have text, such as DOC and WPD, but it can save to those. Attached is a document created from the OmniPage Ultimate (aka version 19) help file that details its supported file types. Regards, Joe
OP19-supported-file-types.pdf
0
 
Joe Winograd, EE MVE 2015&2016DeveloperCommented:
Hi sXmont1j6,
Any update on this? I'm happy to help in any way. Just let me know. Regards, Joe
0
 
Joe Winograd, EE MVE 2015&2016DeveloperCommented:
Hi sXmont1j6,
Since my last post, Nuance has released a new product called Power PDF. It comes in two versions — Standard and Advanced. There's a 30-day free trial of the Advanced version here:
http://www.nuance.com/for-business/document-imaging-and-scanning/power-pdf-converter-advanced/index.htm#resources

I recently installed it and have been experimenting with it — so far, I'm happy with it. It has a Watched Folder feature that you could use to OCR from the top folder into each subfolder. Here's what the Watched Folder dialog looks like:

PPA WF include subsThe list of supported file types for the source files is extensive:

PPA supported source file typesAnd here's the list of supported file types for the destination files:

PPA supported destination file typesOf course, for your purposes, you would select Searchable PDF, which performs OCR.

It also has a Batch Converter feature, but the GUI version of it doesn't support subfolders, so that is why I recommended the Watched Folder feature. However, the Batch Converter also comes in a command line version, which does support subfolders, so you may be interested in that.

This is just one more idea for the OCR piece of it, and then you can feed the OCRed files (Searchable PDFs) to dtSearch for indexing/searching, as I mentioned in a previous post. Speaking of dtSearch, here's a link to a recent thread here at EE where I provided some details about it:
http://www.experts-exchange.com/Software/Office_Productivity/Q_28418575.html#a40018425

Regards, Joe
0
 
sXmont1j6Author Commented:
I tested this with a dummy set of data and this worked properly.   I truly thank you for your help.
0
 
Joe Winograd, EE MVE 2015&2016DeveloperCommented:
You're very welcome. I'm really glad that it worked for you. Regards, Joe
0

Featured Post

What is SQL Server and how does it work?

The purpose of this paper is to provide you background on SQL Server. It’s your self-study guide for learning fundamentals. It includes both the history of SQL and its technical basics. Concepts and definitions will form the solid foundation of your future DBA expertise.

  • 6
  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now