Solved

Indexing Large Sets of Word Documents for Searches

Posted on 2013-12-09
7
298 Views
Last Modified: 2013-12-24
Hello,

I have been working to find a solution for a client that is currently using some very old Law software to maintain some of their older records. The program stores its information in a WordPerfect format that can be converted to a Microsoft Word format with some products I have come across. The plan is to convert these files into Word documents and index these files so that they can be easily searched for specific information.

The problem arises when trying to use Windows search to find entries in such a large group of text files. Searches often take a very long time, without reliable search hits (sometimes Search cannot find data that I know is there).

Does anyone know of any indexing services that can tackle this task, while returning fast and reliable search results? Any help would be appreciated.

Thanks
0
Comment
Question by:productivetech
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
7 Comments
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39707771
If you don't mind the fact that it has been discontinued, Google Desktop Search would do a good job in your case.

http://google-desktop.en.softonic.com/

HTH,
Dan
0
 
LVL 35

Expert Comment

by:Dan Craciun
ID: 39707777
Another free option appears to be DocFetcher, especially since you can use the portable version (I'm a big fan of self contained programs that can run from anywhere and don't litter the OS).

http://docfetcher.sourceforge.net/en/index.html
0
 
LVL 54

Accepted Solution

by:
Joe Winograd, EE MVE 2015&2016 earned 250 total points
ID: 39707787
Hi productivetech,

For indexing and searching files that already have text in them (such as Word documents, PDF Normal files, PDF Searchable Image files, etc.), I strongly recommend dtSearch:
http://www.dtsearch.com/

I have been using it for around 20 years...extraordinarily good piece of software!

When it indexes documents that are mixed binary and text files (such as Word files and PDF Searchable Image files that have been created by scanning and OCR), it has an option to filter out the binary. This makes the index much smaller than other products which also index the binary code (for no good reason). dtSearch has an interesting filtering algorithm that scans a binary file for anything that looks like text using multiple encoding detection methods. The algorithm detects sequences of text with different encodings or formats, and ignores the binary. This is perfect for Word documents and PDF Searchable Image files created by OCR.

It has built-in viewers for most common file types, including, of course, Word (both DOC and DOCX) and PDF, but can also launch an external program automatically when the hit is on a file type for which it doesn't have a viewer (in case your client decides to use it on files other than Word docs). You can control whether or not the external viewer is launched on a case-by-case basis, that is, you can have different actions for each and every file type.

It has special handling for PDF files, allowing you either to view the PDF file in place (in dtSearch) or in a separate instance of Adobe Reader (and in both cases, hits are highlighted). Also, to improve performance, there's an option that lets you tell dtSearch to automatically open Adobe Reader for PDF files (the point is that Adobe Reader runs embedded in dtSearch and it opens PDF files much more quickly if Adobe Reader is already running separately when a PDF is opened in dtSearch). I realize this project is strictly about Word docs, but law firms have plenty of PDF files, and I'd be surprised if your client doesn't eventually use dtSearch on those.

It has extensive search options, including exact phrase, stemming, phonic, fuzzy, synonym, any words, all words, and Boolean. Here's the search request dialog:
dtSearch search requestdtSearch has a strong presence in the Legal profession, as you can see here:
http://www.dtsearch.com/CS_legal.html

It is not an inexpensive product, but it is worth every penny. You are getting what you pay for! It is an excellent search tool.

As a disclaimer, I want to emphasize that I have no affiliation with this company and no financial interest in it whatsoever. I am simply a happy user/customer. Regards, Joe
0
Space-Age Communications Transitions to DevOps

ViaSat, a global provider of satellite and wireless communications, securely connects businesses, governments, and organizations to the Internet. Learn how ViaSat’s Network Solutions Engineer, drove the transition from a traditional network support to a DevOps-centric model.

 
LVL 32

Assisted Solution

by:Paul Sauvé
Paul Sauvé earned 250 total points
ID: 39709027
This product has also been around for some time: Copernic Desktop Search Lite Free.

A paid version is available for $ 49.95 US...
0
 
LVL 54

Expert Comment

by:Joe Winograd, EE MVE 2015&2016
ID: 39709114
I agree with Paul – Copernic is well-regarded. Another good search product is X1:
http://www.x1.com

But if you can afford it, I think dtSearch is the best choice. Just my opinion, of course. Regards, Joe
0
 

Author Comment

by:productivetech
ID: 39711199
Thanks everyone for the great advice. DtSearch looks awesome, but I might not need all the features it provides.
0
 
LVL 54

Expert Comment

by:Joe Winograd, EE MVE 2015&2016
ID: 39711679
> might not need all the features it provides.

Good point! I suggest asking your client how much they're willing to spend on the project and how important this search function is for them. My experience with law firms is that a powerful search capability is of critical importance due to the number of documents they deal with (typically Word and PDF), and that they're willing to pay for a top quality search product, but, of course, that may not be the case with this particular firm (or this particular project). Regards, Joe
0

Featured Post

Enterprise Mobility and BYOD For Dummies

Like “For Dummies” books, you can read this in whatever order you choose and learn about mobility and BYOD; and how to put a competitive mobile infrastructure in place. Developed for SMBs and large enterprises alike, you will find helpful use cases, planning, and implementation.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Skype is a P2P (Peer to Peer) instant messaging and VOIP (Voice over IP) service – as well as a whole lot more.
There are many software programs on offer that will claim to magically speed up your computer. The best advice I can give you is to avoid them like the plague, because they will often cause far more problems than they solve. Try some of these "do it…
Learn how to create and modify your own paragraph styles in Microsoft Word. This can be helpful when wanting to make consistently referenced styles throughout a document or template.
In a previous video Micro Tutorial here at Experts Exchange (http://www.experts-exchange.com/videos/1358/How-to-get-a-free-trial-of-Office-365-with-the-Office-2016-desktop-applications.html), I explained how to get a free, one-month trial of Office …

728 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question