Solved

Indexing Large Sets of Word Documents for Searches

Posted on 2013-12-09
7
283 Views
Last Modified: 2013-12-24
Hello,

I have been working to find a solution for a client that is currently using some very old Law software to maintain some of their older records. The program stores its information in a WordPerfect format that can be converted to a Microsoft Word format with some products I have come across. The plan is to convert these files into Word documents and index these files so that they can be easily searched for specific information.

The problem arises when trying to use Windows search to find entries in such a large group of text files. Searches often take a very long time, without reliable search hits (sometimes Search cannot find data that I know is there).

Does anyone know of any indexing services that can tackle this task, while returning fast and reliable search results? Any help would be appreciated.

Thanks
0
Comment
Question by:productivetech
7 Comments
 
LVL 34

Expert Comment

by:Dan Craciun
ID: 39707771
If you don't mind the fact that it has been discontinued, Google Desktop Search would do a good job in your case.

http://google-desktop.en.softonic.com/

HTH,
Dan
0
 
LVL 34

Expert Comment

by:Dan Craciun
ID: 39707777
Another free option appears to be DocFetcher, especially since you can use the portable version (I'm a big fan of self contained programs that can run from anywhere and don't litter the OS).

http://docfetcher.sourceforge.net/en/index.html
0
 
LVL 51

Accepted Solution

by:
Joe Winograd, EE MVE earned 250 total points
ID: 39707787
Hi productivetech,

For indexing and searching files that already have text in them (such as Word documents, PDF Normal files, PDF Searchable Image files, etc.), I strongly recommend dtSearch:
http://www.dtsearch.com/

I have been using it for around 20 years...extraordinarily good piece of software!

When it indexes documents that are mixed binary and text files (such as Word files and PDF Searchable Image files that have been created by scanning and OCR), it has an option to filter out the binary. This makes the index much smaller than other products which also index the binary code (for no good reason). dtSearch has an interesting filtering algorithm that scans a binary file for anything that looks like text using multiple encoding detection methods. The algorithm detects sequences of text with different encodings or formats, and ignores the binary. This is perfect for Word documents and PDF Searchable Image files created by OCR.

It has built-in viewers for most common file types, including, of course, Word (both DOC and DOCX) and PDF, but can also launch an external program automatically when the hit is on a file type for which it doesn't have a viewer (in case your client decides to use it on files other than Word docs). You can control whether or not the external viewer is launched on a case-by-case basis, that is, you can have different actions for each and every file type.

It has special handling for PDF files, allowing you either to view the PDF file in place (in dtSearch) or in a separate instance of Adobe Reader (and in both cases, hits are highlighted). Also, to improve performance, there's an option that lets you tell dtSearch to automatically open Adobe Reader for PDF files (the point is that Adobe Reader runs embedded in dtSearch and it opens PDF files much more quickly if Adobe Reader is already running separately when a PDF is opened in dtSearch). I realize this project is strictly about Word docs, but law firms have plenty of PDF files, and I'd be surprised if your client doesn't eventually use dtSearch on those.

It has extensive search options, including exact phrase, stemming, phonic, fuzzy, synonym, any words, all words, and Boolean. Here's the search request dialog:
dtSearch search requestdtSearch has a strong presence in the Legal profession, as you can see here:
http://www.dtsearch.com/CS_legal.html

It is not an inexpensive product, but it is worth every penny. You are getting what you pay for! It is an excellent search tool.

As a disclaimer, I want to emphasize that I have no affiliation with this company and no financial interest in it whatsoever. I am simply a happy user/customer. Regards, Joe
0
How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

 
LVL 31

Assisted Solution

by:Paul Sauvé
Paul Sauvé earned 250 total points
ID: 39709027
This product has also been around for some time: Copernic Desktop Search Lite Free.

A paid version is available for $ 49.95 US...
0
 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
ID: 39709114
I agree with Paul – Copernic is well-regarded. Another good search product is X1:
http://www.x1.com

But if you can afford it, I think dtSearch is the best choice. Just my opinion, of course. Regards, Joe
0
 

Author Comment

by:productivetech
ID: 39711199
Thanks everyone for the great advice. DtSearch looks awesome, but I might not need all the features it provides.
0
 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
ID: 39711679
> might not need all the features it provides.

Good point! I suggest asking your client how much they're willing to spend on the project and how important this search function is for them. My experience with law firms is that a powerful search capability is of critical importance due to the number of documents they deal with (typically Word and PDF), and that they're willing to pay for a top quality search product, but, of course, that may not be the case with this particular firm (or this particular project). Regards, Joe
0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

If your app took Google’s lash recently, here are the 5 most likely reasons.
In this article, you will read about the trends across the human resources departments for the upcoming year. Some of them include improving employee experience, adopting new technologies, using HR software to its full extent, and integrating artifi…
This video walks the viewer through the process of creating an MLA formatted document, as well as a bibliography with citations.
This video shows and describes the main difference between both orientations in Microsoft Word. Viewers will understand when to use each orientation and how to get the most out of them.

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

17 Experts available now in Live!

Get 1:1 Help Now