Solved

Need to be able to OCR and then Text search a 1TB directory containing a variety of documents, tiffs, PDFs, WPD, DOC...

Posted on 2014-03-26
9
565 Views
Last Modified: 2014-05-15
Hello,

As I stated in the title, I have a large directly with many sub-folders and files which contain different file types.  Tiff, WPD, DOC, PDF, being the bulk of them. I need a way to OCR that directory and then text search for specific words to find documents and the exact file paths for those documents.  I am sure there is software out there, but I am lost with this.
0
Comment
Question by:sXmont1j6
  • 6
  • 3
9 Comments
 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
ID: 39956871
I recommend a top quality OCR package and a top quality search product. You haven't mentioned budget, so I'm going to assume that you're willing to spend money to get really good software. But if you have a limited budget for the project, let me know and I can make some other recommendations.

For OCR, two very well regarded programs are Nuance OmniPage and ABBYY FineReader. Here are links to more information:
http://nuance.com/for-individuals/by-product/omnipage/index.htm
http://finereader.abbyy.com/

Here are links to feature comparison charts:
http://nuance.com/ucmprod/groups/imaging/@web-enus/documents/collateral/nc_016052.pdf
http://finereader.abbyy.com/editions_comparison_chart/

I use both and can say that both are very accurate, but I can't say that one is always better than the other. I've tested them on the same documents, and sometimes one is better, sometimes the other is, but for the most part, the accuracy is similar - both very good!

Another idea is Nuance's PaperPort product, which is not a dedicated OCR package, but can perform OCR via Nuance's OmniPage, which is included "under the covers" (the OmniPage OCR engine is built into PaperPort):
http://nuance.com/for-individuals/by-product/paperport/index.htm

PaperPort is a robust scanning/imaging package that does a lot more than just OCR (but for pure OCR, is not as robust as OmniPage and FineReader). I use PaperPort extensively (more than OmniPage and FineReader combined). Its OCR capabilities (via the built-in OmniPage) may be adequate for your purposes. But if not, then go with OmniPage or FineReader. PaperPort has a built-in search tool (called All-in-One Search), but I do not use (or recommend) it.

For search, I recommend dtSearch:
http://www.dtsearch.com

This is an extraordinary search product that can index and search any file type that has text in it. You may be interested in an article I wrote called "dtSearch and PaperPort", available here:
https://sites.google.com/site/wikipaperport/articles-tutorials

As a disclaimer, I want to emphasize that I have no affiliation with these companies and no financial interest in them whatsoever. I am simply a happy user/customer. Regards, Joe
0
 

Author Comment

by:sXmont1j6
ID: 39957031
I am familiar with all of the products.  I suppose my scenerio is the following:

a Shared directory with a ton of sub-directories.  Will I be able to OCR from the top folder level and will it OCR each file in each sub-directory?
0
 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
ID: 40018032
Hi sXmont1j6,
I'm sorry for taking so long to reply to your post. I normally get an email from EE every time there's a post on threads, but I either didn't get it this time or I missed it. In any case, the answer to your question for OmniPage is YES. You will be able to start the OCR process at a folder and tell it to include all subfolders. Regards, Joe
0
 

Author Comment

by:sXmont1j6
ID: 40018124
I presume many different file formats as well?  tiff, doc, wpd, pdf...etc...
0
Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
ID: 40018186
Yes, all of those and more. Of course, it doesn't OCR file types that already have text, such as DOC and WPD, but it can save to those. Attached is a document created from the OmniPage Ultimate (aka version 19) help file that details its supported file types. Regards, Joe
OP19-supported-file-types.pdf
0
 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
ID: 40040700
Hi sXmont1j6,
Any update on this? I'm happy to help in any way. Just let me know. Regards, Joe
0
 
LVL 51

Accepted Solution

by:
Joe Winograd, EE MVE earned 500 total points
ID: 40059907
Hi sXmont1j6,
Since my last post, Nuance has released a new product called Power PDF. It comes in two versions — Standard and Advanced. There's a 30-day free trial of the Advanced version here:
http://www.nuance.com/for-business/document-imaging-and-scanning/power-pdf-converter-advanced/index.htm#resources

I recently installed it and have been experimenting with it — so far, I'm happy with it. It has a Watched Folder feature that you could use to OCR from the top folder into each subfolder. Here's what the Watched Folder dialog looks like:

PPA WF include subsThe list of supported file types for the source files is extensive:

PPA supported source file typesAnd here's the list of supported file types for the destination files:

PPA supported destination file typesOf course, for your purposes, you would select Searchable PDF, which performs OCR.

It also has a Batch Converter feature, but the GUI version of it doesn't support subfolders, so that is why I recommended the Watched Folder feature. However, the Batch Converter also comes in a command line version, which does support subfolders, so you may be interested in that.

This is just one more idea for the OCR piece of it, and then you can feed the OCRed files (Searchable PDFs) to dtSearch for indexing/searching, as I mentioned in a previous post. Speaking of dtSearch, here's a link to a recent thread here at EE where I provided some details about it:
http://www.experts-exchange.com/Software/Office_Productivity/Q_28418575.html#a40018425

Regards, Joe
0
 

Author Comment

by:sXmont1j6
ID: 40067773
I tested this with a dummy set of data and this worked properly.   I truly thank you for your help.
0
 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
ID: 40067812
You're very welcome. I'm really glad that it worked for you. Regards, Joe
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Power PDF (http://www.nuance.com/for-business/document-imaging-and-scanning/power-pdf-converter/index.htm) is the newest product from the Document Imaging division of Nuance Communications (http://www.nuance.com/). It is available in two editions — …
Online collaboration can help businesses be more efficient, help employees grow their skills and foster a team environment.
Microsoft Office Picture Manager is not included in Office 2013. This comes as quite a surprise to users upgrading from earlier versions of Office, such as 2007 and 2010, where Picture Manager was included as a standard application. This video expla…
Microsoft Office Picture Manager has a Picture Shortcuts pane that shows a list with the Recently Browsed folders. While creating my video Micro Tutorial here at Experts Exchange showing How to Install Microsoft Office Picture Manager in Office 2013…

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

17 Experts available now in Live!

Get 1:1 Help Now