Indexing Large Sets of Word Documents for Searches

Hello,

I have been working to find a solution for a client that is currently using some very old Law software to maintain some of their older records. The program stores its information in a WordPerfect format that can be converted to a Microsoft Word format with some products I have come across. The plan is to convert these files into Word documents and index these files so that they can be easily searched for specific information.

The problem arises when trying to use Windows search to find entries in such a large group of text files. Searches often take a very long time, without reliable search hits (sometimes Search cannot find data that I know is there).

Does anyone know of any indexing services that can tackle this task, while returning fast and reliable search results? Any help would be appreciated.

Thanks
productivetechAsked:
Who is Participating?
 
Joe Winograd, Fellow&MVEDeveloperCommented:
Hi productivetech,

For indexing and searching files that already have text in them (such as Word documents, PDF Normal files, PDF Searchable Image files, etc.), I strongly recommend dtSearch:
http://www.dtsearch.com/

I have been using it for around 20 years...extraordinarily good piece of software!

When it indexes documents that are mixed binary and text files (such as Word files and PDF Searchable Image files that have been created by scanning and OCR), it has an option to filter out the binary. This makes the index much smaller than other products which also index the binary code (for no good reason). dtSearch has an interesting filtering algorithm that scans a binary file for anything that looks like text using multiple encoding detection methods. The algorithm detects sequences of text with different encodings or formats, and ignores the binary. This is perfect for Word documents and PDF Searchable Image files created by OCR.

It has built-in viewers for most common file types, including, of course, Word (both DOC and DOCX) and PDF, but can also launch an external program automatically when the hit is on a file type for which it doesn't have a viewer (in case your client decides to use it on files other than Word docs). You can control whether or not the external viewer is launched on a case-by-case basis, that is, you can have different actions for each and every file type.

It has special handling for PDF files, allowing you either to view the PDF file in place (in dtSearch) or in a separate instance of Adobe Reader (and in both cases, hits are highlighted). Also, to improve performance, there's an option that lets you tell dtSearch to automatically open Adobe Reader for PDF files (the point is that Adobe Reader runs embedded in dtSearch and it opens PDF files much more quickly if Adobe Reader is already running separately when a PDF is opened in dtSearch). I realize this project is strictly about Word docs, but law firms have plenty of PDF files, and I'd be surprised if your client doesn't eventually use dtSearch on those.

It has extensive search options, including exact phrase, stemming, phonic, fuzzy, synonym, any words, all words, and Boolean. Here's the search request dialog:
dtSearch search requestdtSearch has a strong presence in the Legal profession, as you can see here:
http://www.dtsearch.com/CS_legal.html

It is not an inexpensive product, but it is worth every penny. You are getting what you pay for! It is an excellent search tool.

As a disclaimer, I want to emphasize that I have no affiliation with this company and no financial interest in it whatsoever. I am simply a happy user/customer. Regards, Joe
0
 
Dan CraciunIT ConsultantCommented:
If you don't mind the fact that it has been discontinued, Google Desktop Search would do a good job in your case.

http://google-desktop.en.softonic.com/

HTH,
Dan
0
 
Dan CraciunIT ConsultantCommented:
Another free option appears to be DocFetcher, especially since you can use the portable version (I'm a big fan of self contained programs that can run from anywhere and don't litter the OS).

http://docfetcher.sourceforge.net/en/index.html
0
Cloud Class® Course: Microsoft Office 2010

This course will introduce you to the interfaces and features of Microsoft Office 2010 Word, Excel, PowerPoint, Outlook, and Access. You will learn about the features that are shared between all products in the Office suite, as well as the new features that are product specific.

 
Paul SauvéRetiredCommented:
This product has also been around for some time: Copernic Desktop Search Lite Free.

A paid version is available for $ 49.95 US...
0
 
Joe Winograd, Fellow&MVEDeveloperCommented:
I agree with Paul – Copernic is well-regarded. Another good search product is X1:
http://www.x1.com

But if you can afford it, I think dtSearch is the best choice. Just my opinion, of course. Regards, Joe
0
 
productivetechAuthor Commented:
Thanks everyone for the great advice. DtSearch looks awesome, but I might not need all the features it provides.
0
 
Joe Winograd, Fellow&MVEDeveloperCommented:
> might not need all the features it provides.

Good point! I suggest asking your client how much they're willing to spend on the project and how important this search function is for them. My experience with law firms is that a powerful search capability is of critical importance due to the number of documents they deal with (typically Word and PDF), and that they're willing to pay for a top quality search product, but, of course, that may not be the case with this particular firm (or this particular project). Regards, Joe
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.