Solved

How to combine these keywords quickly

Posted on 2008-06-18
30
1,147 Views
Last Modified: 2012-05-05
Hi experts,

Here is a question on how to combine keywords quickly.

Assume that following is a small part of a large static key file that contains nearly one million lines.
30234804 highestpaysurveys.com
30234805 highestpaysurveys.org
30234806 highpaysurveys
30234807 highpaysurveys.com
30234808 hoem based
30234809 home automation market research
30234810 home based accounting jobs
30234811 home based employment
30234812 home based extra income
30234813 home based free jobs
30234814 home based internet
30234816 home based internet work
30234817 home based jobs

Now I have keywords "home", "based", "free", "jobs", how can I quickly combine them to get
30234813 home based free jobs
30234817 home based jobs

This post follows thread http://www.experts-exchange.com/Programming/Languages/C/Q_23417637.html. I have got an algorithm that is already very fast, however I want to see if I can get one faster than that

I know this question is quite challenging so I assign the points to 500.

Thanks for all replies.


0
Comment
Question by:davidw88
  • 15
  • 8
  • 5
30 Comments
 
LVL 53

Expert Comment

by:Infinity08
ID: 21819494
>> 30234813 home based free jobs
>> 30234817 home based jobs

How do you decide whether a line matches ?

Why didn't these get matched for example :

30234810 home based accounting jobs
30234808 hoem based
0
 

Author Comment

by:davidw88
ID: 21823103
Hi Infinity08,

Thanks for your inquiry.

A line is matched if and only if all it's words are key words. I still use that example. For keywords "home", "based", "free", "jobs",

> 30234810 home based accounting jobs
This is not matched as it has "accounting" that is not a keyword.

> 30234808 hoem based
It is the same reason as above. "hoem" is not a keyword therefore this is not matched.

Case and orders are all ignored. For example,

12345   Free home based jobs      
12346   free based HOME jobs      
12347   home free baseD jobS      
30234813    home based free jobs
30234817    home based jobs

Given key keywords "home", "free", "based" and "jobs", then I should retrieve all above 5 lines of information.

Is this clear?





0
 

Author Comment

by:davidw88
ID: 21823503
For case issue, I can handle it by myself.

Please feel free to ask me if you have more questions or you need any other information.
0
 
LVL 53

Expert Comment

by:Infinity08
ID: 21833397
Do you want to delete this question ? Or do you want to keep it ?
0
 

Author Comment

by:davidw88
ID: 21833833
I want to delete this question as there are no replies except yours.
0
 
LVL 53

Expert Comment

by:Infinity08
ID: 21834082
If you want to delete, no problem. But, yuou could give people some time (including me ...). I am however not sure what more you think to get out of this thread ? Your other thread already covered quite a lot of ideas. If the current code is not fast enough, then maybe you should look into the more specialized algorithms that were proposed ?
0
 

Author Comment

by:davidw88
ID: 21835052
Hi Infinity08 and Night-Eagle,

Thanks so much for your replies. I truly appreciated. I take your suggestions. Let's leave this thread here and see.

Actually I have studied ideas in that thread in detail. To be frank I feel that none of them can meet my requirement. Or there is no resolution to my requirement with C?

The running time of my code implementing Infinity08's algorithm was later found to be 1000+ ms. So I am still thinking about getting a new algorithm to implement.

The best time that I can reach now is 100+ ms. I wonder if there is no solution to reach a time of less than 40 ms.



0
 
LVL 53

Accepted Solution

by:
Infinity08 earned 250 total points
ID: 21835106
>> I wonder if there is no solution to reach a time of less than 40 ms.

Get faster hardware ;)

Seriously, these speed measurements are highly dependent on the data you feed into the algorithm (amount as well as the data values themselves), as well as the platform the algorithm is running on.

So, it's difficult to say whether a different algorithm might be that much faster.

In fact, you might find a faster algorithm for your test data, but different data might be processed slower by the same algorithm.


I also don't know which algorithm is your current fastest.


Btw, where did this requirement of 40 ms come from ? Is that based on an analysis of the specific problem domain ? Or is that just some value ?
0
 

Author Comment

by:davidw88
ID: 21835269
40ms comes from the current version that is in production. This version is java code and it only handles 3 keywords at most in a struct. The new requirement is to handle at least 6 keywords.

For the current version that is being used, we found that it is not fast enough to respond to internet requests therefore we wanted to implement a new code that could run fater than that one.

So far I have implemented 8 or 9 versions. I used suffix tree (trie), boyer-moore/aho-corasick algorithm, hash list/array, etc. Now I really wonder if there is a solution with C to beat 40ms.
0
 
LVL 45

Assisted Solution

by:sunnycoder
sunnycoder earned 250 total points
ID: 21836184
Did you try the hashing algorithm whose link I posted in your previous thread?
0
 
LVL 53

Expert Comment

by:Infinity08
ID: 21836799
>> This version is java code and it only handles 3 keywords at most in a struct.

What algorithm did they use ?


>> The new requirement is to handle at least 6 keywords.

And you expect it to be as fast as the original 3-word code ? There's some serious optimizing to do there then heh. You'll definitely need to look into more specialized algorithms (read : more complicated lol), as sunnycoder already suggested.


Also, the earlier comment I made about upgrading the hardware was not just a joke.
0
 

Author Comment

by:davidw88
ID: 21848315
Hi Sunnycoder and I8,

Thanks so much for your replies!

I am now working on the fast hash that Sunnycoder suggested. For the java code algorithm, I am sorry I can not talk about it in details as my company's policy does not allow it. However, I can tell that it uses a similiar approach by counting how many keywords to the algorithm that I8 used.

>>And you expect it to be as fast as the original 3-word code ? There's some serious optimizing to do there then >>heh. You'll definitely need to look into more specialized algorithms (read : more complicated lol), as sunnycoder >>already suggested.

I am working on optimizing my code now. I also research on specialized algorithms.

>>Also, the earlier comment I made about upgrading the hardware was not just a joke.
I will mention this to my super.

Thanks again.


0
 

Author Comment

by:davidw88
ID: 21848596
HI Night-Eagle,

Can you please NOT delete this thread? The replies from I8 and Sunnycoder are helpful. As I am working on this project, I hope to get their comments. And, of course, I want to assign points to them for their time, knowledge and enthusiasm.

Thanks.
0
 
LVL 53

Expert Comment

by:Infinity08
ID: 21849461
>> As I am working on this project, I hope to get their comments.

I'll be happy to do that :)
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 

Author Comment

by:davidw88
ID: 21850798
Hi Sunnycoder,

I posted a new thread http://www.experts-exchange.com/Programming/Languages/C/Q_23509308.html. Can you check this new thread? any help?

Thanks.
0
 

Author Comment

by:davidw88
ID: 21851400
Hi experts,

Now back to this algorithm. For a linked list, we know that the traverse time is O(n). And we can only traverse either from the head (single linked list) or from the head and tail (double linked list).

Now my question is: is there any way to quicken this traverse speed so that I can use less time? I am not allowed to use multi-threading.

Any ideas?

Thanks so much.
0
 
LVL 45

Expert Comment

by:sunnycoder
ID: 21852395
If you are using plain linked list - no. To get to a node, you need to get its address from another node ... no way to optimize that even if you are using multithreading.

You can use additional data structures such as trie index (B+ tree is logically similar) but insertions and deletions become a little complicated.
0
 
LVL 53

Expert Comment

by:Infinity08
ID: 21853149
>> is there any way to quicken this traverse speed so that I can use less time?

Are you talking about the main hash table data structure, or about its buckets. For the main hash table data structure, an array is by far the fastest (for lookup), since you can do simple array indexing.

The dimensioning of the hash table should be such that the buckets don't get too big, so the bucket data structure is less of a concern. A linked list however is one of the slowest when it comes to retrieval of data.
In your case, the filling of the hash table will happen just once if I'm not mistaken, so you can optimize for retrieval rather than insertion. A linked list is probably not the best choice there. A tree-like structure would probably perform better for the buckets.
0
 

Author Comment

by:davidw88
ID: 21857416
>>In your case, the filling of the hash table will happen just once if I'm not mistaken, so you can optimize for >>retrieval rather than insertion.

Yes, you are completely right. The filling of the hash table will happen just once. All data from the key file (.txt) are loaded into linked lists. Then these lists are embedded into that hash table. And you are also right that I should optimize retrieval instead of insertion.

>>A linked list is probably not the best choice there. A tree-like structure would  probably perform better for the buckets.

Do you mean to use a (binary) tree? Do you mean to embed this tree into the buckets of the hash table?

Here is a little more information about one of my versions. I implemented a version with hash table --- 2 D array. The hash table just contains an integer value that means the index in the array. What I found surprisingly that it costs more time than the embeded linked list.

The critical thing here is that I need to traverse all the linked list retrieved from the hash table. Even though it is an array, I still need to check all values inside this array. So it does not matter if it is random access or sequential access, right? There is no way for me to omit any values otherwise I may lose a match.

Now let me replace list with tree and see how it works.

Thanks I8!









0
 

Author Comment

by:davidw88
ID: 21857434
Hi Night-Eagle,

Thanks for coming back to check!
0
 
LVL 45

Expert Comment

by:sunnycoder
ID: 21857466
If you use a decent hash function, you would have minimal collisions and hence small linked lists. You dont necessarily have to use linked lists for collision resolution - use double hashing!!
Profile your program to identify the bottlenecks.

>What I found surprisingly that it costs more time than the embeded linked list.
How are you timing it? Are you using a profiler?
0
 

Author Comment

by:davidw88
ID: 21857553
Even for a tree, the traverse time is still O(n) as I have to visit each node... this is the same as linked list O(n).  In this case is there any advantages to use a tree?
0
 
LVL 45

Expert Comment

by:sunnycoder
ID: 21857569
If you have to visit all nodes then you need O(n) time ... there is no way around that ...
The gain you can achieve is by minimizing n - better hashing, better collision resolution
0
 

Author Comment

by:davidw88
ID: 21857852
>> If you use a decent hash function, you would have minimal collisions and hence small linked lists. You dont necessarily have to use linked lists for collision resolution

The collision comes not because of hash function but because these keyword records in the .txt keyword file all have the same starting keyword. For example, "yahoo dot com", "yahoo comes" and "yahoo we love yahoo" are all saved in the same embedded list as they all start from "yahoo". Assume I have some key words such as "yahoo, comes, dot, com", the I use "yahoo" to query the hash table and retrieve this embeded list. Then I traverse the list to check values. In this case my final result will be "yahoo  dot com" and "yahoo comes".

Do you have any ideas here? How to do with double hashtable?

0
 
LVL 45

Expert Comment

by:sunnycoder
ID: 21857991
Why was "yahoo we love yahoo" left out of the search results?
Would you be querying on a single keyword?
If you are going to use multiple keywords then do you require all keywords to be present in the results or you declare a match if any one word matches?

I remember from your older posts that you needed all keywords on a document to be present in the search string even if search string was a superset of these keywords. Does that still hold?
0
 

Author Comment

by:davidw88
ID: 21858166
>>Why was "yahoo we love yahoo" left out of the search results?
The rule is that a value (i.e.  a record in the .txt file) in the list is said to be matched if and only if all the contents of this value occur in the keyword set. In this case "we love" among "yahoo we love yahoo" do not occur in keyword set "yahoo, comes, dot, com" therefore it is not retrieved.

This rule is still the same as before.
0
 

Author Comment

by:davidw88
ID: 21860674
Hi Sunnycoder and I8:

I reported what I thought about this project to my super.  He sees that I have done all my best. We admit what we have got is the best so we close this project for now. However, I will continue searching better algoirthms.

Now it's time to say thanks to both of you. I truly appreciate your time, your knowledge and your enthusiasm. I should say that this project is quite challenging, however, you experts still showed your talents and I learnt so much from you.

I split points equally between you.

Thanks again!
0
 
LVL 53

Expert Comment

by:Infinity08
ID: 21860975
Sorry for taking so long to get back here - today was quite hectic :)


>> Even for a tree, the traverse time is still O(n) as I have to visit each node... this is the same as linked list O(n).  In this case is there any advantages to use a tree?

Not if the data is sorted ... Ideally, a bucket shouldn't be large, and if the data in the bucket is sorted, it should be relatively fast to retrieve the needed value, especially if the bucket data structure takes good advantage of the fact that the data is sorted (like a search tree).
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Suggested Solutions

Have you thought about creating an iPhone application (app), but didn't even know where to get started? Here's how: ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ Important pre-programming comments: I’ve never tri…
An Outlet in Cocoa is a persistent reference to a GUI control; it connects a property (a variable) to a control.  For example, it is common to create an Outlet for the text field GUI control and change the text that appears in this field via that Ou…
The goal of this video is to provide viewers with basic examples to understand and use structures in the C programming language.
The goal of this video is to provide viewers with basic examples to understand recursion in the C programming language.

706 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now