Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1734
  • Last Modified:

How to combine these keywords quickly

Hi experts,

Here is a question on how to combine keywords quickly.

Assume that following is a small part of a large static key file that contains nearly one million lines.
30234804 highestpaysurveys.com
30234805 highestpaysurveys.org
30234806 highpaysurveys
30234807 highpaysurveys.com
30234808 hoem based
30234809 home automation market research
30234810 home based accounting jobs
30234811 home based employment
30234812 home based extra income
30234813 home based free jobs
30234814 home based internet
30234816 home based internet work
30234817 home based jobs

Now I have keywords "home", "based", "free", "jobs", how can I quickly combine them to get
30234813 home based free jobs
30234817 home based jobs

This post follows thread http://www.experts-exchange.com/Programming/Languages/C/Q_23417637.html. I have got an algorithm that is already very fast, however I want to see if I can get one faster than that

I know this question is quite challenging so I assign the points to 500.

Thanks for all replies.


0
davidw88
Asked:
davidw88
  • 15
  • 8
  • 5
2 Solutions
 
Infinity08Commented:
>> 30234813 home based free jobs
>> 30234817 home based jobs

How do you decide whether a line matches ?

Why didn't these get matched for example :

30234810 home based accounting jobs
30234808 hoem based
0
 
davidw88Author Commented:
Hi Infinity08,

Thanks for your inquiry.

A line is matched if and only if all it's words are key words. I still use that example. For keywords "home", "based", "free", "jobs",

> 30234810 home based accounting jobs
This is not matched as it has "accounting" that is not a keyword.

> 30234808 hoem based
It is the same reason as above. "hoem" is not a keyword therefore this is not matched.

Case and orders are all ignored. For example,

12345   Free home based jobs      
12346   free based HOME jobs      
12347   home free baseD jobS      
30234813    home based free jobs
30234817    home based jobs

Given key keywords "home", "free", "based" and "jobs", then I should retrieve all above 5 lines of information.

Is this clear?





0
 
davidw88Author Commented:
For case issue, I can handle it by myself.

Please feel free to ask me if you have more questions or you need any other information.
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
Infinity08Commented:
Do you want to delete this question ? Or do you want to keep it ?
0
 
davidw88Author Commented:
I want to delete this question as there are no replies except yours.
0
 
Infinity08Commented:
If you want to delete, no problem. But, yuou could give people some time (including me ...). I am however not sure what more you think to get out of this thread ? Your other thread already covered quite a lot of ideas. If the current code is not fast enough, then maybe you should look into the more specialized algorithms that were proposed ?
0
 
davidw88Author Commented:
Hi Infinity08 and Night-Eagle,

Thanks so much for your replies. I truly appreciated. I take your suggestions. Let's leave this thread here and see.

Actually I have studied ideas in that thread in detail. To be frank I feel that none of them can meet my requirement. Or there is no resolution to my requirement with C?

The running time of my code implementing Infinity08's algorithm was later found to be 1000+ ms. So I am still thinking about getting a new algorithm to implement.

The best time that I can reach now is 100+ ms. I wonder if there is no solution to reach a time of less than 40 ms.



0
 
Infinity08Commented:
>> I wonder if there is no solution to reach a time of less than 40 ms.

Get faster hardware ;)

Seriously, these speed measurements are highly dependent on the data you feed into the algorithm (amount as well as the data values themselves), as well as the platform the algorithm is running on.

So, it's difficult to say whether a different algorithm might be that much faster.

In fact, you might find a faster algorithm for your test data, but different data might be processed slower by the same algorithm.


I also don't know which algorithm is your current fastest.


Btw, where did this requirement of 40 ms come from ? Is that based on an analysis of the specific problem domain ? Or is that just some value ?
0
 
davidw88Author Commented:
40ms comes from the current version that is in production. This version is java code and it only handles 3 keywords at most in a struct. The new requirement is to handle at least 6 keywords.

For the current version that is being used, we found that it is not fast enough to respond to internet requests therefore we wanted to implement a new code that could run fater than that one.

So far I have implemented 8 or 9 versions. I used suffix tree (trie), boyer-moore/aho-corasick algorithm, hash list/array, etc. Now I really wonder if there is a solution with C to beat 40ms.
0
 
sunnycoderCommented:
Did you try the hashing algorithm whose link I posted in your previous thread?
0
 
Infinity08Commented:
>> This version is java code and it only handles 3 keywords at most in a struct.

What algorithm did they use ?


>> The new requirement is to handle at least 6 keywords.

And you expect it to be as fast as the original 3-word code ? There's some serious optimizing to do there then heh. You'll definitely need to look into more specialized algorithms (read : more complicated lol), as sunnycoder already suggested.


Also, the earlier comment I made about upgrading the hardware was not just a joke.
0
 
davidw88Author Commented:
Hi Sunnycoder and I8,

Thanks so much for your replies!

I am now working on the fast hash that Sunnycoder suggested. For the java code algorithm, I am sorry I can not talk about it in details as my company's policy does not allow it. However, I can tell that it uses a similiar approach by counting how many keywords to the algorithm that I8 used.

>>And you expect it to be as fast as the original 3-word code ? There's some serious optimizing to do there then >>heh. You'll definitely need to look into more specialized algorithms (read : more complicated lol), as sunnycoder >>already suggested.

I am working on optimizing my code now. I also research on specialized algorithms.

>>Also, the earlier comment I made about upgrading the hardware was not just a joke.
I will mention this to my super.

Thanks again.


0
 
davidw88Author Commented:
HI Night-Eagle,

Can you please NOT delete this thread? The replies from I8 and Sunnycoder are helpful. As I am working on this project, I hope to get their comments. And, of course, I want to assign points to them for their time, knowledge and enthusiasm.

Thanks.
0
 
Infinity08Commented:
>> As I am working on this project, I hope to get their comments.

I'll be happy to do that :)
0
 
davidw88Author Commented:
Hi Sunnycoder,

I posted a new thread http://www.experts-exchange.com/Programming/Languages/C/Q_23509308.html. Can you check this new thread? any help?

Thanks.
0
 
davidw88Author Commented:
Hi experts,

Now back to this algorithm. For a linked list, we know that the traverse time is O(n). And we can only traverse either from the head (single linked list) or from the head and tail (double linked list).

Now my question is: is there any way to quicken this traverse speed so that I can use less time? I am not allowed to use multi-threading.

Any ideas?

Thanks so much.
0
 
sunnycoderCommented:
If you are using plain linked list - no. To get to a node, you need to get its address from another node ... no way to optimize that even if you are using multithreading.

You can use additional data structures such as trie index (B+ tree is logically similar) but insertions and deletions become a little complicated.
0
 
Infinity08Commented:
>> is there any way to quicken this traverse speed so that I can use less time?

Are you talking about the main hash table data structure, or about its buckets. For the main hash table data structure, an array is by far the fastest (for lookup), since you can do simple array indexing.

The dimensioning of the hash table should be such that the buckets don't get too big, so the bucket data structure is less of a concern. A linked list however is one of the slowest when it comes to retrieval of data.
In your case, the filling of the hash table will happen just once if I'm not mistaken, so you can optimize for retrieval rather than insertion. A linked list is probably not the best choice there. A tree-like structure would probably perform better for the buckets.
0
 
davidw88Author Commented:
>>In your case, the filling of the hash table will happen just once if I'm not mistaken, so you can optimize for >>retrieval rather than insertion.

Yes, you are completely right. The filling of the hash table will happen just once. All data from the key file (.txt) are loaded into linked lists. Then these lists are embedded into that hash table. And you are also right that I should optimize retrieval instead of insertion.

>>A linked list is probably not the best choice there. A tree-like structure would  probably perform better for the buckets.

Do you mean to use a (binary) tree? Do you mean to embed this tree into the buckets of the hash table?

Here is a little more information about one of my versions. I implemented a version with hash table --- 2 D array. The hash table just contains an integer value that means the index in the array. What I found surprisingly that it costs more time than the embeded linked list.

The critical thing here is that I need to traverse all the linked list retrieved from the hash table. Even though it is an array, I still need to check all values inside this array. So it does not matter if it is random access or sequential access, right? There is no way for me to omit any values otherwise I may lose a match.

Now let me replace list with tree and see how it works.

Thanks I8!









0
 
davidw88Author Commented:
Hi Night-Eagle,

Thanks for coming back to check!
0
 
sunnycoderCommented:
If you use a decent hash function, you would have minimal collisions and hence small linked lists. You dont necessarily have to use linked lists for collision resolution - use double hashing!!
Profile your program to identify the bottlenecks.

>What I found surprisingly that it costs more time than the embeded linked list.
How are you timing it? Are you using a profiler?
0
 
davidw88Author Commented:
Even for a tree, the traverse time is still O(n) as I have to visit each node... this is the same as linked list O(n).  In this case is there any advantages to use a tree?
0
 
sunnycoderCommented:
If you have to visit all nodes then you need O(n) time ... there is no way around that ...
The gain you can achieve is by minimizing n - better hashing, better collision resolution
0
 
davidw88Author Commented:
>> If you use a decent hash function, you would have minimal collisions and hence small linked lists. You dont necessarily have to use linked lists for collision resolution

The collision comes not because of hash function but because these keyword records in the .txt keyword file all have the same starting keyword. For example, "yahoo dot com", "yahoo comes" and "yahoo we love yahoo" are all saved in the same embedded list as they all start from "yahoo". Assume I have some key words such as "yahoo, comes, dot, com", the I use "yahoo" to query the hash table and retrieve this embeded list. Then I traverse the list to check values. In this case my final result will be "yahoo  dot com" and "yahoo comes".

Do you have any ideas here? How to do with double hashtable?

0
 
sunnycoderCommented:
Why was "yahoo we love yahoo" left out of the search results?
Would you be querying on a single keyword?
If you are going to use multiple keywords then do you require all keywords to be present in the results or you declare a match if any one word matches?

I remember from your older posts that you needed all keywords on a document to be present in the search string even if search string was a superset of these keywords. Does that still hold?
0
 
davidw88Author Commented:
>>Why was "yahoo we love yahoo" left out of the search results?
The rule is that a value (i.e.  a record in the .txt file) in the list is said to be matched if and only if all the contents of this value occur in the keyword set. In this case "we love" among "yahoo we love yahoo" do not occur in keyword set "yahoo, comes, dot, com" therefore it is not retrieved.

This rule is still the same as before.
0
 
davidw88Author Commented:
Hi Sunnycoder and I8:

I reported what I thought about this project to my super.  He sees that I have done all my best. We admit what we have got is the best so we close this project for now. However, I will continue searching better algoirthms.

Now it's time to say thanks to both of you. I truly appreciate your time, your knowledge and your enthusiasm. I should say that this project is quite challenging, however, you experts still showed your talents and I learnt so much from you.

I split points equally between you.

Thanks again!
0
 
Infinity08Commented:
Sorry for taking so long to get back here - today was quite hectic :)


>> Even for a tree, the traverse time is still O(n) as I have to visit each node... this is the same as linked list O(n).  In this case is there any advantages to use a tree?

Not if the data is sorted ... Ideally, a bucket shouldn't be large, and if the data in the bucket is sorted, it should be relatively fast to retrieve the needed value, especially if the bucket data structure takes good advantage of the fact that the data is sorted (like a search tree).
0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

  • 15
  • 8
  • 5
Tackle projects and never again get stuck behind a technical roadblock.
Join Now