C# or SQL: What is the best way to search/match large number of keywords against large text/context.


We have requirement to search huge number of key words against text from large documents. The search includes few fuzzy logic as well.  
Fuzzy logic Ref: http://www.codeguru.com/vb/gen/vb_database/microsoftaccess/article.php/c13137/Fuzzy-Matching-Demo-in-Access.htm

Currently we are OCR the scanned document and converting to text and then performing search by looping the list of key words one by one against the scanned text/words. We have 4 set of fuzzy logic to go thru.

Current keyword list is limited to 10 – 20 words but in production the list can go up to 1000 or more. We need to perform the search as fast as possible. What would be the best approach is? Using SQL Server Full Text search or any search algorithm apart from fuzzy logic?

Thought about multi-threading, but had a bad experience in the past. We always encountered database deadlocks. Is there any other efficient approach or architecture?

Any suggestions or approaches are welcome, either programming logic, application architecture, hardware setup etc.

Nadarajan RajappenDeveloperAsked:
Who is Participating?
SStoryConnect With a Mentor Commented:
Just some thoughts: Maybe you scan a doc building a list of unique works in that document first to avoid repeated searches in keyword list. (I'm not sure if this would take longer than just doing the search for every word.)

As to algorithms in C#, what about b-trees?
Here is an article about b-trees. You can also Google and find plenty.

However, if the list is fixed and not being added to at run time, perhaps a simple binary tree would work well, keyed alphabetically.

Also there is an article on built in dotnet collections here:

The thing about a binary tree is that you look at the current node, say "Morris" and you are searching for Apple. You immediately know that Apple is on the left side of the tree and eliminate close to half of the search terms.  For each node that repeats and search time is very quick.  This does require loading the tree correctly so that it is balanced on both sides, or using a balanced binary tree.  I think, if I remember correctly that is done by having another field on each node and it being set to -1,1 or 0, etc depending upon how balanced or unbalanced it is and sometimes balancing the tree.  However, again, if I understand you will load a fixed list of 1000 words or so, so you should be able to load that in a balanced fashion from the start since that part won't be changing. Then I'd experiment. Try just taking every word not in your ignoredwords list which will probably contain "the", "a","an","of" etc and sending it to the search function for a result. Then try running through the entire document building a collection of unique words and the run through that unique word list seeing if found or not. Compare the times needed for both and decide which is better.

Personally I'd try some of the built in classes first and see if they yield tolerable results. If so, there is no need to reinvent the wheel. If you did have multiple processors and could get each thread running on a different one, you could collect so many words then send off to one thread and so many more to send to the other, but I suspect it will be so fast already that the overhead wouldn't be worth it.

I'd try building an index of unique words in one of those large documents to see how many actual unique words are contained. I'll bet not as many as you might think. People generally use a limited vocabulary.
Phillip BurtonConnect With a Mentor Director, Practice Manager and Computing ConsultantCommented:
Have you considered using SQL Server full text search? That could reduce the number of database deadlocks you may encounter.
Nadarajan RajappenDeveloperAuthor Commented:
Hi Phillip Burton,

Thanks for your comment.

I have not tried SQL Server full text search yet, definitely will try it out but not sure if full text search will fulfill the fuzzy logic features cause when OCR reads the content wont be 100% , text will be mixed with junk char. Example,
Original Text: Nadarajan :
OCR Text : N2dra|an

Meanwhile how can I scale it to enterprise architecture is there alternative approach?

Phillip BurtonDirector, Practice Manager and Computing ConsultantCommented:
This slideshow might be of interest. It shows the different ways you can use fuzzy matching in full text search.
One more thought. The grep program or egrep program will already do this for you.
egrep "(word1|word2|word3)" yourfilename
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.