Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people, just like you, are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
Solved

Need to be able to OCR and then Text search a 1TB directory containing a variety of documents, tiffs, PDFs, WPD, DOC...

Posted on 2014-03-26
9
591 Views
Last Modified: 2014-05-15
Hello,

As I stated in the title, I have a large directly with many sub-folders and files which contain different file types.  Tiff, WPD, DOC, PDF, being the bulk of them. I need a way to OCR that directory and then text search for specific words to find documents and the exact file paths for those documents.  I am sure there is software out there, but I am lost with this.
0
Comment
Question by:sXmont1j6
  • 6
  • 3
9 Comments
 
LVL 53

Expert Comment

by:Joe Winograd, EE MVE
ID: 39956871
I recommend a top quality OCR package and a top quality search product. You haven't mentioned budget, so I'm going to assume that you're willing to spend money to get really good software. But if you have a limited budget for the project, let me know and I can make some other recommendations.

For OCR, two very well regarded programs are Nuance OmniPage and ABBYY FineReader. Here are links to more information:
http://nuance.com/for-individuals/by-product/omnipage/index.htm
http://finereader.abbyy.com/

Here are links to feature comparison charts:
http://nuance.com/ucmprod/groups/imaging/@web-enus/documents/collateral/nc_016052.pdf
http://finereader.abbyy.com/editions_comparison_chart/

I use both and can say that both are very accurate, but I can't say that one is always better than the other. I've tested them on the same documents, and sometimes one is better, sometimes the other is, but for the most part, the accuracy is similar - both very good!

Another idea is Nuance's PaperPort product, which is not a dedicated OCR package, but can perform OCR via Nuance's OmniPage, which is included "under the covers" (the OmniPage OCR engine is built into PaperPort):
http://nuance.com/for-individuals/by-product/paperport/index.htm

PaperPort is a robust scanning/imaging package that does a lot more than just OCR (but for pure OCR, is not as robust as OmniPage and FineReader). I use PaperPort extensively (more than OmniPage and FineReader combined). Its OCR capabilities (via the built-in OmniPage) may be adequate for your purposes. But if not, then go with OmniPage or FineReader. PaperPort has a built-in search tool (called All-in-One Search), but I do not use (or recommend) it.

For search, I recommend dtSearch:
http://www.dtsearch.com

This is an extraordinary search product that can index and search any file type that has text in it. You may be interested in an article I wrote called "dtSearch and PaperPort", available here:
https://sites.google.com/site/wikipaperport/articles-tutorials

As a disclaimer, I want to emphasize that I have no affiliation with these companies and no financial interest in them whatsoever. I am simply a happy user/customer. Regards, Joe
0
 

Author Comment

by:sXmont1j6
ID: 39957031
I am familiar with all of the products.  I suppose my scenerio is the following:

a Shared directory with a ton of sub-directories.  Will I be able to OCR from the top folder level and will it OCR each file in each sub-directory?
0
 
LVL 53

Expert Comment

by:Joe Winograd, EE MVE
ID: 40018032
Hi sXmont1j6,
I'm sorry for taking so long to reply to your post. I normally get an email from EE every time there's a post on threads, but I either didn't get it this time or I missed it. In any case, the answer to your question for OmniPage is YES. You will be able to start the OCR process at a folder and tell it to include all subfolders. Regards, Joe
0
Efficient way to get backups off site to Azure

This user guide provides instructions on how to deploy and configure both a StoneFly Scale Out NAS Enterprise Cloud Drive virtual machine and Veeam Cloud Connect in the Microsoft Azure Cloud.

 

Author Comment

by:sXmont1j6
ID: 40018124
I presume many different file formats as well?  tiff, doc, wpd, pdf...etc...
0
 
LVL 53

Expert Comment

by:Joe Winograd, EE MVE
ID: 40018186
Yes, all of those and more. Of course, it doesn't OCR file types that already have text, such as DOC and WPD, but it can save to those. Attached is a document created from the OmniPage Ultimate (aka version 19) help file that details its supported file types. Regards, Joe
OP19-supported-file-types.pdf
0
 
LVL 53

Expert Comment

by:Joe Winograd, EE MVE
ID: 40040700
Hi sXmont1j6,
Any update on this? I'm happy to help in any way. Just let me know. Regards, Joe
0
 
LVL 53

Accepted Solution

by:
Joe Winograd, EE MVE earned 500 total points
ID: 40059907
Hi sXmont1j6,
Since my last post, Nuance has released a new product called Power PDF. It comes in two versions — Standard and Advanced. There's a 30-day free trial of the Advanced version here:
http://www.nuance.com/for-business/document-imaging-and-scanning/power-pdf-converter-advanced/index.htm#resources

I recently installed it and have been experimenting with it — so far, I'm happy with it. It has a Watched Folder feature that you could use to OCR from the top folder into each subfolder. Here's what the Watched Folder dialog looks like:

PPA WF include subsThe list of supported file types for the source files is extensive:

PPA supported source file typesAnd here's the list of supported file types for the destination files:

PPA supported destination file typesOf course, for your purposes, you would select Searchable PDF, which performs OCR.

It also has a Batch Converter feature, but the GUI version of it doesn't support subfolders, so that is why I recommended the Watched Folder feature. However, the Batch Converter also comes in a command line version, which does support subfolders, so you may be interested in that.

This is just one more idea for the OCR piece of it, and then you can feed the OCRed files (Searchable PDFs) to dtSearch for indexing/searching, as I mentioned in a previous post. Speaking of dtSearch, here's a link to a recent thread here at EE where I provided some details about it:
http://www.experts-exchange.com/Software/Office_Productivity/Q_28418575.html#a40018425

Regards, Joe
0
 

Author Comment

by:sXmont1j6
ID: 40067773
I tested this with a dummy set of data and this worked properly.   I truly thank you for your help.
0
 
LVL 53

Expert Comment

by:Joe Winograd, EE MVE
ID: 40067812
You're very welcome. I'm really glad that it worked for you. Regards, Joe
0

Featured Post

Announcing the Most Valuable Experts of 2016

MVEs are more concerned with the satisfaction of those they help than with the considerable points they can earn. They are the types of people you feel privileged to call colleagues. Join us in honoring this amazing group of Experts.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Online collaboration can help businesses be more efficient, help employees grow their skills and foster a team environment.
Online collaboration is quickly becoming embedded in the workplace, and its benefits are tangible. See what the current landscape looks like and what the future holds for collaboration tools and the future of work.
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…
Microsoft Office Picture Manager is not included in Office 2013. This comes as quite a surprise to users upgrading from earlier versions of Office, such as 2007 and 2010, where Picture Manager was included as a standard application. This video expla…

765 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question