Go Premium for a chance to win a PS4. Enter to Win


C# or SQL:  What is the best way to search/match large number of keywords against large text/context.

Posted on 2015-02-09
Medium Priority
Last Modified: 2015-02-15

We have requirement to search huge number of key words against text from large documents. The search includes few fuzzy logic as well.  
Fuzzy logic Ref: http://www.codeguru.com/vb/gen/vb_database/microsoftaccess/article.php/c13137/Fuzzy-Matching-Demo-in-Access.htm

Currently we are OCR the scanned document and converting to text and then performing search by looping the list of key words one by one against the scanned text/words. We have 4 set of fuzzy logic to go thru.

Current keyword list is limited to 10 – 20 words but in production the list can go up to 1000 or more. We need to perform the search as fast as possible. What would be the best approach is? Using SQL Server Full Text search or any search algorithm apart from fuzzy logic?

Thought about multi-threading, but had a bad experience in the past. We always encountered database deadlocks. Is there any other efficient approach or architecture?

Any suggestions or approaches are welcome, either programming logic, application architecture, hardware setup etc.

Question by:Nadarajan Rajappen
  • 2
  • 2
LVL 24

Assisted Solution

by:Phillip Burton
Phillip Burton earned 750 total points
ID: 40598283
Have you considered using SQL Server full text search? That could reduce the number of database deadlocks you may encounter.
LVL 25

Accepted Solution

SStory earned 750 total points
ID: 40598302
Just some thoughts: Maybe you scan a doc building a list of unique works in that document first to avoid repeated searches in keyword list. (I'm not sure if this would take longer than just doing the search for every word.)

As to algorithms in C#, what about b-trees?
Here is an article about b-trees. You can also Google and find plenty.

However, if the list is fixed and not being added to at run time, perhaps a simple binary tree would work well, keyed alphabetically.

Also there is an article on built in dotnet collections here:

The thing about a binary tree is that you look at the current node, say "Morris" and you are searching for Apple. You immediately know that Apple is on the left side of the tree and eliminate close to half of the search terms.  For each node that repeats and search time is very quick.  This does require loading the tree correctly so that it is balanced on both sides, or using a balanced binary tree.  I think, if I remember correctly that is done by having another field on each node and it being set to -1,1 or 0, etc depending upon how balanced or unbalanced it is and sometimes balancing the tree.  However, again, if I understand you will load a fixed list of 1000 words or so, so you should be able to load that in a balanced fashion from the start since that part won't be changing. Then I'd experiment. Try just taking every word not in your ignoredwords list which will probably contain "the", "a","an","of" etc and sending it to the search function for a result. Then try running through the entire document building a collection of unique words and the run through that unique word list seeing if found or not. Compare the times needed for both and decide which is better.

Personally I'd try some of the built in classes first and see if they yield tolerable results. If so, there is no need to reinvent the wheel. If you did have multiple processors and could get each thread running on a different one, you could collect so many words then send off to one thread and so many more to send to the other, but I suspect it will be so fast already that the overhead wouldn't be worth it.

I'd try building an index of unique words in one of those large documents to see how many actual unique words are contained. I'll bet not as many as you might think. People generally use a limited vocabulary.

Author Comment

by:Nadarajan Rajappen
ID: 40598340
Hi Phillip Burton,

Thanks for your comment.

I have not tried SQL Server full text search yet, definitely will try it out but not sure if full text search will fulfill the fuzzy logic features cause when OCR reads the content wont be 100% , text will be mixed with junk char. Example,
Original Text: Nadarajan :
OCR Text : N2dra|an

Meanwhile how can I scale it to enterprise architecture is there alternative approach?

LVL 24

Expert Comment

by:Phillip Burton
ID: 40598347
This slideshow might be of interest. It shows the different ways you can use fuzzy matching in full text search.
LVL 25

Expert Comment

ID: 40600615
One more thought. The grep program or egrep program will already do this for you.
egrep "(word1|word2|word3)" yourfilename

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Why is this different from all of the other step by step guides?  Because I make a living as a DBA and not as a writer and I lived through this experience. Defining the name: When I talk to people they say different names on this subject stuff l…
In this article we will learn how to fix  “Cannot install SQL Server 2014 Service Pack 2: Unable to install windows installer msi file” error ?
Via a live example combined with referencing Books Online, show some of the information that can be extracted from the Catalog Views in SQL Server.
Via a live example, show how to backup a database, simulate a failure backup the tail of the database transaction log and perform the restore.

824 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question