Solved

C# or SQL:  What is the best way to search/match large number of keywords against large text/context.

Posted on 2015-02-09
5
464 Views
Last Modified: 2015-02-15
Hi,

We have requirement to search huge number of key words against text from large documents. The search includes few fuzzy logic as well.  
Fuzzy logic Ref: http://www.codeguru.com/vb/gen/vb_database/microsoftaccess/article.php/c13137/Fuzzy-Matching-Demo-in-Access.htm

Currently we are OCR the scanned document and converting to text and then performing search by looping the list of key words one by one against the scanned text/words. We have 4 set of fuzzy logic to go thru.

Current keyword list is limited to 10 – 20 words but in production the list can go up to 1000 or more. We need to perform the search as fast as possible. What would be the best approach is? Using SQL Server Full Text search or any search algorithm apart from fuzzy logic?

Thought about multi-threading, but had a bad experience in the past. We always encountered database deadlocks. Is there any other efficient approach or architecture?

Any suggestions or approaches are welcome, either programming logic, application architecture, hardware setup etc.

Thanks.
Nada
0
Comment
Question by:Ash_dotnet
  • 2
  • 2
5 Comments
 
LVL 24

Assisted Solution

by:Phillip Burton
Phillip Burton earned 250 total points
ID: 40598283
Have you considered using SQL Server full text search? That could reduce the number of database deadlocks you may encounter.
0
 
LVL 25

Accepted Solution

by:
SStory earned 250 total points
ID: 40598302
Just some thoughts: Maybe you scan a doc building a list of unique works in that document first to avoid repeated searches in keyword list. (I'm not sure if this would take longer than just doing the search for every word.)

As to algorithms in C#, what about b-trees?
Here is an article about b-trees. You can also Google and find plenty.
http://www.codeproject.com/Articles/96397/B-Tree-Sorted-Dictionary

However, if the list is fixed and not being added to at run time, perhaps a simple binary tree would work well, keyed alphabetically.

Also there is an article on built in dotnet collections here:
http://geekswithblogs.net/BlackRabbitCoder/archive/2011/06/16/c.net-fundamentals-choosing-the-right-collection-class.aspx

The thing about a binary tree is that you look at the current node, say "Morris" and you are searching for Apple. You immediately know that Apple is on the left side of the tree and eliminate close to half of the search terms.  For each node that repeats and search time is very quick.  This does require loading the tree correctly so that it is balanced on both sides, or using a balanced binary tree.  I think, if I remember correctly that is done by having another field on each node and it being set to -1,1 or 0, etc depending upon how balanced or unbalanced it is and sometimes balancing the tree.  However, again, if I understand you will load a fixed list of 1000 words or so, so you should be able to load that in a balanced fashion from the start since that part won't be changing. Then I'd experiment. Try just taking every word not in your ignoredwords list which will probably contain "the", "a","an","of" etc and sending it to the search function for a result. Then try running through the entire document building a collection of unique words and the run through that unique word list seeing if found or not. Compare the times needed for both and decide which is better.

Personally I'd try some of the built in classes first and see if they yield tolerable results. If so, there is no need to reinvent the wheel. If you did have multiple processors and could get each thread running on a different one, you could collect so many words then send off to one thread and so many more to send to the other, but I suspect it will be so fast already that the overhead wouldn't be worth it.

I'd try building an index of unique words in one of those large documents to see how many actual unique words are contained. I'll bet not as many as you might think. People generally use a limited vocabulary.
0
 

Author Comment

by:Ash_dotnet
ID: 40598340
Hi Phillip Burton,

Thanks for your comment.

I have not tried SQL Server full text search yet, definitely will try it out but not sure if full text search will fulfill the fuzzy logic features cause when OCR reads the content wont be 100% , text will be mixed with junk char. Example,
Original Text: Nadarajan :
OCR Text : N2dra|an

Meanwhile how can I scale it to enterprise architecture is there alternative approach?

Thanks
Nada
0
 
LVL 24

Expert Comment

by:Phillip Burton
ID: 40598347
This slideshow might be of interest. It shows the different ways you can use fuzzy matching in full text search.
0
 
LVL 25

Expert Comment

by:SStory
ID: 40600615
One more thought. The grep program or egrep program will already do this for you.
egrep "(word1|word2|word3)" yourfilename
0

Featured Post

Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

Join & Write a Comment

This article explains how to reset the password of the sa account on a Microsoft SQL Server.  The steps in this article work in SQL 2005, 2008, 2008 R2, 2012, 2014 and 2016.
Slowly Changing Dimension Transformation component in data task flow is very useful for us to manage and control how data changes in SSIS.
Via a live example combined with referencing Books Online, show some of the information that can be extracted from the Catalog Views in SQL Server.
Using examples as well as descriptions, and references to Books Online, show the different Recovery Models available in SQL Server and explain, as well as show how full, differential and transaction log backups are performed

706 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

21 Experts available now in Live!

Get 1:1 Help Now