Fastest database design?

I have a database with more than 2000K words. Each word can have many synonyms, e.g. color = red, blue, green etc.
The application I am building is quite simple: you can search for a word/string, and in the results, all matching words are returned.

I wonder what is the best database design for this application?

Should each row consist of word, synonym
or should a word have a self join to another word in the db.
redpencilmediaAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Ray PaseurCommented:
Please show us a little of your test data.  Also, please confirm that when you say 2000K what you really mean is two million words (there aren't that many in English.  A robust vocabulary is usually around 80,000 words)

My sense is that the design will not matter much at low utilization levels, and you can make any easy and expedient choice for the table design, including normalized junction tables.  At scale, however, you will want to de-normalize your tables and throw hardware at the problem.

How many hits per second do you forecast?
0
F PCommented:
Do you mean 2000 or 2 million records? I can make recommendations that vary greatly on what you meant by 2000k.
0
F PCommented:
Side note, your indexes and queries that search the DB itself will have the biggest impact, and the underlying schema will have only so much of an impact. You are searching full text, or you want begins / ends with? Do you use any kind of stemmer already or care about that?

http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

I suppose with such a general question, my best recommendation is to take a look at SPHINX SEARCH:

http://sphinxsearch.com/

It's faster than MySQL, it is short for "search phrase index" so it will be more apt to handle the types of searches you mentioned, and best of all it's free.
0
Cloud Class® Course: Ruby Fundamentals

This course will introduce you to Ruby, as well as teach you about classes, methods, variables, data structures, loops, enumerable methods, and finishing touches.

redpencilmediaAuthor Commented:
Thanks for your answers so far. What I'm creating is a crossword solver, and this how the application should work:

1) Enter hint (using autocomplete)
2) Enter pattern
...and search.

Database design:
|hint|word|
|english king|arthur
|color|blue|
|color|red
|dan brown|angels and deamons
|dan brown|writer

So for the search, stemming is not important, because in 1) when you search for a hint, using auto-complete, you will only search for a word/phrase that exist.

Regarding number of words, I weren't precise. The database consist of ca 2000K rows with words and phrases. So I am looking for how I can do this better.
0
Ray PaseurCommented:
So to clarify, when you write 2000K you really mean 2,000,000 (two million), right?

And how many hits per second?

Thanks.
0
F PCommented:
You're going to need to use MATCH() AGAINST() and set a score to the value of the hint you want in my opinion.
0
redpencilmediaAuthor Commented:
Ray:
Yes, I mean two million.
The number of hits per second are not many. Less than 10.
0
Ray PaseurCommented:
OK, you can use a full-text search with a wild card (the percent character) and that might work if the indexes are good.  But full-text searching 2,000,000 rows 10X per second means that each search must process each row in 1/20,000,000 of a second.  That implies the need for a separate database engine.  I think I might design it this way.

Keep each of the hint letters in its own column, indexed.  Start the auto-complete when you have four or five characters of input (or maybe fewer if the client only gives you fewer, like "dan").  Your first query would be a down-select to create a temporary table ENGINE=MEMORY containing just the matching rows.  As each additional character can be applied, you would delete from the memory table WHERE the next letters do not match each positional letter.  Eventually you will have a sufficiently succinct collection of hint and word to make it worth displaying to the client.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
MySQL Server

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.