Link to home
Start Free TrialLog in
Avatar of redpencilmedia
redpencilmedia

asked on

Fastest database design?

I have a database with more than 2000K words. Each word can have many synonyms, e.g. color = red, blue, green etc.
The application I am building is quite simple: you can search for a word/string, and in the results, all matching words are returned.

I wonder what is the best database design for this application?

Should each row consist of word, synonym
or should a word have a self join to another word in the db.
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

Please show us a little of your test data.  Also, please confirm that when you say 2000K what you really mean is two million words (there aren't that many in English.  A robust vocabulary is usually around 80,000 words)

My sense is that the design will not matter much at low utilization levels, and you can make any easy and expedient choice for the table design, including normalized junction tables.  At scale, however, you will want to de-normalize your tables and throw hardware at the problem.

How many hits per second do you forecast?
Do you mean 2000 or 2 million records? I can make recommendations that vary greatly on what you meant by 2000k.
Side note, your indexes and queries that search the DB itself will have the biggest impact, and the underlying schema will have only so much of an impact. You are searching full text, or you want begins / ends with? Do you use any kind of stemmer already or care about that?

http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

I suppose with such a general question, my best recommendation is to take a look at SPHINX SEARCH:

http://sphinxsearch.com/

It's faster than MySQL, it is short for "search phrase index" so it will be more apt to handle the types of searches you mentioned, and best of all it's free.
Avatar of redpencilmedia
redpencilmedia

ASKER

Thanks for your answers so far. What I'm creating is a crossword solver, and this how the application should work:

1) Enter hint (using autocomplete)
2) Enter pattern
...and search.

Database design:
|hint|word|
|english king|arthur
|color|blue|
|color|red
|dan brown|angels and deamons
|dan brown|writer

So for the search, stemming is not important, because in 1) when you search for a hint, using auto-complete, you will only search for a word/phrase that exist.

Regarding number of words, I weren't precise. The database consist of ca 2000K rows with words and phrases. So I am looking for how I can do this better.
So to clarify, when you write 2000K you really mean 2,000,000 (two million), right?

And how many hits per second?

Thanks.
You're going to need to use MATCH() AGAINST() and set a score to the value of the hint you want in my opinion.
Ray:
Yes, I mean two million.
The number of hits per second are not many. Less than 10.
ASKER CERTIFIED SOLUTION
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial