redpencilmedia
asked on
Fastest database design?
I have a database with more than 2000K words. Each word can have many synonyms, e.g. color = red, blue, green etc.
The application I am building is quite simple: you can search for a word/string, and in the results, all matching words are returned.
I wonder what is the best database design for this application?
Should each row consist of word, synonym
or should a word have a self join to another word in the db.
The application I am building is quite simple: you can search for a word/string, and in the results, all matching words are returned.
I wonder what is the best database design for this application?
Should each row consist of word, synonym
or should a word have a self join to another word in the db.
Do you mean 2000 or 2 million records? I can make recommendations that vary greatly on what you meant by 2000k.
Side note, your indexes and queries that search the DB itself will have the biggest impact, and the underlying schema will have only so much of an impact. You are searching full text, or you want begins / ends with? Do you use any kind of stemmer already or care about that?
http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
I suppose with such a general question, my best recommendation is to take a look at SPHINX SEARCH:
http://sphinxsearch.com/
It's faster than MySQL, it is short for "search phrase index" so it will be more apt to handle the types of searches you mentioned, and best of all it's free.
http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
I suppose with such a general question, my best recommendation is to take a look at SPHINX SEARCH:
http://sphinxsearch.com/
It's faster than MySQL, it is short for "search phrase index" so it will be more apt to handle the types of searches you mentioned, and best of all it's free.
ASKER
Thanks for your answers so far. What I'm creating is a crossword solver, and this how the application should work:
1) Enter hint (using autocomplete)
2) Enter pattern
...and search.
Database design:
|hint|word|
|english king|arthur
|color|blue|
|color|red
|dan brown|angels and deamons
|dan brown|writer
So for the search, stemming is not important, because in 1) when you search for a hint, using auto-complete, you will only search for a word/phrase that exist.
Regarding number of words, I weren't precise. The database consist of ca 2000K rows with words and phrases. So I am looking for how I can do this better.
1) Enter hint (using autocomplete)
2) Enter pattern
...and search.
Database design:
|hint|word|
|english king|arthur
|color|blue|
|color|red
|dan brown|angels and deamons
|dan brown|writer
So for the search, stemming is not important, because in 1) when you search for a hint, using auto-complete, you will only search for a word/phrase that exist.
Regarding number of words, I weren't precise. The database consist of ca 2000K rows with words and phrases. So I am looking for how I can do this better.
So to clarify, when you write 2000K you really mean 2,000,000 (two million), right?
And how many hits per second?
Thanks.
And how many hits per second?
Thanks.
You're going to need to use MATCH() AGAINST() and set a score to the value of the hint you want in my opinion.
ASKER
Ray:
Yes, I mean two million.
The number of hits per second are not many. Less than 10.
Yes, I mean two million.
The number of hits per second are not many. Less than 10.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
My sense is that the design will not matter much at low utilization levels, and you can make any easy and expedient choice for the table design, including normalized junction tables. At scale, however, you will want to de-normalize your tables and throw hardware at the problem.
How many hits per second do you forecast?