Solved

Need to be able to OCR and then Text search a 1TB directory containing a variety of documents, tiffs, PDFs, WPD, DOC...

Posted on 2014-03-26
9
574 Views
Last Modified: 2014-05-15
Hello,

As I stated in the title, I have a large directly with many sub-folders and files which contain different file types.  Tiff, WPD, DOC, PDF, being the bulk of them. I need a way to OCR that directory and then text search for specific words to find documents and the exact file paths for those documents.  I am sure there is software out there, but I am lost with this.
0
Comment
Question by:sXmont1j6
  • 6
  • 3
9 Comments
 
LVL 52

Expert Comment

by:Joe Winograd, EE MVE
ID: 39956871
I recommend a top quality OCR package and a top quality search product. You haven't mentioned budget, so I'm going to assume that you're willing to spend money to get really good software. But if you have a limited budget for the project, let me know and I can make some other recommendations.

For OCR, two very well regarded programs are Nuance OmniPage and ABBYY FineReader. Here are links to more information:
http://nuance.com/for-individuals/by-product/omnipage/index.htm
http://finereader.abbyy.com/

Here are links to feature comparison charts:
http://nuance.com/ucmprod/groups/imaging/@web-enus/documents/collateral/nc_016052.pdf
http://finereader.abbyy.com/editions_comparison_chart/

I use both and can say that both are very accurate, but I can't say that one is always better than the other. I've tested them on the same documents, and sometimes one is better, sometimes the other is, but for the most part, the accuracy is similar - both very good!

Another idea is Nuance's PaperPort product, which is not a dedicated OCR package, but can perform OCR via Nuance's OmniPage, which is included "under the covers" (the OmniPage OCR engine is built into PaperPort):
http://nuance.com/for-individuals/by-product/paperport/index.htm

PaperPort is a robust scanning/imaging package that does a lot more than just OCR (but for pure OCR, is not as robust as OmniPage and FineReader). I use PaperPort extensively (more than OmniPage and FineReader combined). Its OCR capabilities (via the built-in OmniPage) may be adequate for your purposes. But if not, then go with OmniPage or FineReader. PaperPort has a built-in search tool (called All-in-One Search), but I do not use (or recommend) it.

For search, I recommend dtSearch:
http://www.dtsearch.com

This is an extraordinary search product that can index and search any file type that has text in it. You may be interested in an article I wrote called "dtSearch and PaperPort", available here:
https://sites.google.com/site/wikipaperport/articles-tutorials

As a disclaimer, I want to emphasize that I have no affiliation with these companies and no financial interest in them whatsoever. I am simply a happy user/customer. Regards, Joe
0
 

Author Comment

by:sXmont1j6
ID: 39957031
I am familiar with all of the products.  I suppose my scenerio is the following:

a Shared directory with a ton of sub-directories.  Will I be able to OCR from the top folder level and will it OCR each file in each sub-directory?
0
 
LVL 52

Expert Comment

by:Joe Winograd, EE MVE
ID: 40018032
Hi sXmont1j6,
I'm sorry for taking so long to reply to your post. I normally get an email from EE every time there's a post on threads, but I either didn't get it this time or I missed it. In any case, the answer to your question for OmniPage is YES. You will be able to start the OCR process at a folder and tell it to include all subfolders. Regards, Joe
0
 

Author Comment

by:sXmont1j6
ID: 40018124
I presume many different file formats as well?  tiff, doc, wpd, pdf...etc...
0
Best Practices: Disaster Recovery Testing

Besides backup, any IT division should have a disaster recovery plan. You will find a few tips below relating to the development of such a plan and to what issues one should pay special attention in the course of backup planning.

 
LVL 52

Expert Comment

by:Joe Winograd, EE MVE
ID: 40018186
Yes, all of those and more. Of course, it doesn't OCR file types that already have text, such as DOC and WPD, but it can save to those. Attached is a document created from the OmniPage Ultimate (aka version 19) help file that details its supported file types. Regards, Joe
OP19-supported-file-types.pdf
0
 
LVL 52

Expert Comment

by:Joe Winograd, EE MVE
ID: 40040700
Hi sXmont1j6,
Any update on this? I'm happy to help in any way. Just let me know. Regards, Joe
0
 
LVL 52

Accepted Solution

by:
Joe Winograd, EE MVE earned 500 total points
ID: 40059907
Hi sXmont1j6,
Since my last post, Nuance has released a new product called Power PDF. It comes in two versions — Standard and Advanced. There's a 30-day free trial of the Advanced version here:
http://www.nuance.com/for-business/document-imaging-and-scanning/power-pdf-converter-advanced/index.htm#resources

I recently installed it and have been experimenting with it — so far, I'm happy with it. It has a Watched Folder feature that you could use to OCR from the top folder into each subfolder. Here's what the Watched Folder dialog looks like:

PPA WF include subsThe list of supported file types for the source files is extensive:

PPA supported source file typesAnd here's the list of supported file types for the destination files:

PPA supported destination file typesOf course, for your purposes, you would select Searchable PDF, which performs OCR.

It also has a Batch Converter feature, but the GUI version of it doesn't support subfolders, so that is why I recommended the Watched Folder feature. However, the Batch Converter also comes in a command line version, which does support subfolders, so you may be interested in that.

This is just one more idea for the OCR piece of it, and then you can feed the OCRed files (Searchable PDFs) to dtSearch for indexing/searching, as I mentioned in a previous post. Speaking of dtSearch, here's a link to a recent thread here at EE where I provided some details about it:
http://www.experts-exchange.com/Software/Office_Productivity/Q_28418575.html#a40018425

Regards, Joe
0
 

Author Comment

by:sXmont1j6
ID: 40067773
I tested this with a dummy set of data and this worked properly.   I truly thank you for your help.
0
 
LVL 52

Expert Comment

by:Joe Winograd, EE MVE
ID: 40067812
You're very welcome. I'm really glad that it worked for you. Regards, Joe
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article discusses the PaperPort 14 Scanner Connection Tool, which Nuance provides at no charge in order to fix scanning problems in Windows 8. Furthermore, users of PaperPort 14 in Windows 7 and Windows 10 have reported that the tool works in t…
You may have a outside contractor who comes in once a week or seasonal to do some work in your office but you only want to give him access to the programs and files he needs and keep privet all other documents and programs, can you do this on a loca…
In this first video of the three-part Xpdf series, we introduce and describe Xpdf, a library containing nine command line utilities that perform various functions on PDF files. We show where the library is located and how to download it, discuss its…
In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …

943 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

6 Experts available now in Live!

Get 1:1 Help Now