[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

Using Meastro for OCR, now looking for a good way to search the text in those PDFs.

Posted on 2014-07-15
2
Medium Priority
?
232 Views
Last Modified: 2014-12-03
Hello,

I am using Maestro to OCR a large amount of files.  We are in the process of determining the best solution for searching those files for specific key words.  In essence, we would like to "Data Mine"  those files for different sets of keywords\phrases and output the result.  We do not want anything free for this.  We are looking at this from an higher level for old data and also moving forward.  Anyone have any suggestions for what search software to look into?
0
Comment
Question by:sXmont1j6
  • 2
2 Comments
 
LVL 57

Accepted Solution

by:
Joe Winograd, EE MVE 2015&2016 earned 1500 total points
ID: 40197670
For indexing and searching files that already have text in them (such as after Maestro OCRs them), I strongly recommend dtSearch:
http://www.dtsearch.com/

I have been using it for around 20 years...extraordinarily good piece of software!

When it indexes documents that are mixed binary and text files (such as a PDF Searchable Image file that has been created by scanning and OCR), it has an option to filter out the binary. This makes the index much smaller than other products which also index the binary code (for no good reason). dtSearch has an interesting filtering algorithm that scans a binary file for anything that looks like text using multiple encoding detection methods. The algorithm detects sequences of text with different encodings or formats, and ignores the binary. This is perfect for PDF Searchable Image files created by OCR.

It has built-in viewers for most common file types (PDF, of course...see below), but can also launch an external program automatically when the hit is on a file type for which it doesn't have a viewer. You can control whether or not the external viewer is launched on a case-by-case basis, that is, you can have different actions for each and every file type.

It has special handling for PDF files, allowing you either to view the PDF file in place (in dtSearch) or in a separate instance of Adobe Reader (and in both cases, hits are highlighted). Also, to improve performance, there's an option that lets you tell dtSearch to automatically open Adobe Reader for PDF files (the point is that Adobe Reader runs embedded in dtSearch and it opens PDF files much more quickly if Adobe Reader is already running separately when a PDF is opened in dtSearch).

It has extensive search options, including stemming, phonic, fuzzy, synonym, any words, all words, Boolean, and, of course, exact/specific phrases. Here's the search request dialog:

dtSearch Search Request
It utilizes the Windows Task Scheduler to update indexes. I currently have 44 indexes set up and I have it configured to update (a subset of) them every day in the wee hours. Of course, you may set it up to update the indexes as frequently/infrequently as you want (and you may specify which ones get updated – if some data is static, there's no need to update its index). You may have any number of indexes, each of which may index any number of folders/files, and searches may take place on one or more of the indexes. I often build an index on the fly for a folder/subfolders that I want to search – indexing is very fast (as is searching).

The capabilities go on and on, but at $199, it is not an inexpensive product, so I'm glad to hear you say that you're not looking for a freebie. But one comment about the high initial cost of dtSearch. A positive point is their approach to technical support and product updates. Their store page says, "Technical support and product updates are free for a minimum of one year with all purchases." The "minimum of one year" statement is vague and there is no fee mentioned. Also, the dtSearch Desktop/Network Upgrades page says it is a "free upgrade", but it's not clear if these upgrades are forever free. So I wrote to dtSearch asking for a clarification of the policy and here's what they wrote back (with permission to share the answer publicly):

----- Begin dtSearch response -----

I appreciate your email, and sorry for the confusion!

Our setup licenses provide for a minimum of one year of support and upgrades on all licenses. That said, we have provided support and upgrades at no charge since Year 2000 for all end-user Desktop / Network licenses (!). Because of the higher average cost of developer support, we have been charging annually for developer (Web / Engine / Publish) upgrades and support, but again not Desktop / Network upgrades and support.

I can't always guarantee that this will be the case until the end of time, but that's why you don't find any "upgrade charge" indicators for Desktop / Network on our site currently.

----- End dtSearch response -----

Amortized over a large number of years for technical support and software upgrades/updates, the $199 license fee becomes much more reasonable. dtSearch was careful to say in the response that they "can't always guarantee" no upgrade charge, but I have been using dtSearch for around 20 years, have received technical support and product upgrades on a continuous basis (am currently running the latest release), and have never paid anything beyond the initial license fee. So it's a pretty good bet, if not a guarantee.

As a disclaimer, I want to emphasize that I have no affiliation with dtSearch and no financial interest in it whatsoever. I am simply a happy user/customer. Regards, Joe
0
 
LVL 57

Expert Comment

by:Joe Winograd, EE MVE 2015&2016
ID: 40411886
Hi sXmont1j6,
I'm trying to clean up some open questions and noticed that I haven't heard from you in more than three months on this one. Please let me know where things stand. Have you tried dtSearch? If so, did it work well for you? If not, what issues did you have with it? Thanks, Joe
0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Online collaboration can help businesses be more efficient, help employees grow their skills and foster a team environment.
In this blog, I will share you some basic tips for content marketing and to rank your website on Google.
Please read the paragraph below before following the instructions in the video — there are important caveats in the paragraph that I did not mention in the video. If your PaperPort 12 or PaperPort 14 is failing to start, or crashing, or hanging, …
In a question here at Experts Exchange (https://www.experts-exchange.com/questions/29062564/Adobe-acrobat-reader-DC.html), a member asked how to create a signature in Adobe Acrobat Reader DC (the free Reader product, not the paid, full Acrobat produ…
Suggested Courses

868 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question