Table Schema Best practices

Hi, I have used mySQL before and never had any issues, however, mydatabase was small scale (<100,000 rows)
So I never really considered the impact of my table schema too much...

Lets take the example from phpbb - there are three important tables I am concerned with in this question (trying to keep it simple and somewhat theoretical):

Lets say I have 1 million users, each user creates 10 topics, and each topic has 10 posts to it.

user table now has 1 million rows
topics table now has 10 million rows
posts table now has 100 million rows

Is this the common practice? I mean, is this the best way for performance?
Maybe in order to answer that question, you need more information - such as what kinds of queries I will want to do...

Here are some examples (pretty obvious, but...)
1. I would want to show all the users topics - so I would just do a select where (topic.user_id = loggedinuser)
2. Show the posts for a particular topic - so I would just do a select where (post.topic_id = selectedtopic)

I mean, is this the way very large sites are setup, like friendster, etc...?
Or is there another way out there that scales that is a "trade secret" for DBA's?

Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

This is a pretty standard normalized data structure, yes.  Indexing will play a key role in performance of the application that depends on this data.  In some cases, you might want to create a summary table for specific tasks (say, watched_topics, which can contain up to 25 different topics per user, etc.).

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
A million rows isn't too big for MySQL. 10 million is useable, but you'll need to be careful with your indexes and consider each query carefully.  100 million might be getting a little big for a single table, but that depends more on the OS imposed limit for file sizes than MySQL's limits.  For that table, and possibly the topics table, consider using a MERGE table type, where the data is physically split among several different tables (and therefore files), but you can logically view them (or a subset of them) as one.  The manual at explains how to set that up, and shows its limitations.
cdfllcAuthor Commented:
todd, so I imagine that the hypothetical "watched_topics" table contains this information: user_id, topic_id
So if I wanted to show the users watched topics,
then I would just:    select from topics where the topic_id in (SELECT from watched_topics where user_id = loggedinuser)

Something like that?

Is that faster than having a flag in the topics table to have something like this:
SELECT from topics where user_id = loggedinuser AND watch_flag = 1
Cloud Class® Course: Microsoft Exchange Server

The MCTS: Microsoft Exchange Server 2010 certification validates your skills in supporting the maintenance and administration of the Exchange servers in an enterprise environment. Learn everything you need to know with this course.

Yes - you could also do this with a straight join instead of a subquery:

SELECT t.* FROM topics t INNER JOIN watched_topics w ON (w.topic_id=t.topic_id)
WHERE w.user_id= loggedinuser;

You don't want to have a watch_flag on the topics table because there should be only one entry in that table per distinct topic.  Whether a topic is watched by a user is specific to a user/topic combination, so you want to keep that out of the topics table.

In fact, you don't want to have a user_id column on the topics table, either.  Think of a topic in abstract terms - they are independent of users.  For example, if there is a topic of "Computer Hardware", is it any different for user_1 than user_2?  Not in general terms, no.  If you want to associate topics to users, you need another table (the watched_topics table is an example of this) to define the associations.  But keep the topics table pure - don't include information it does not need to include.
cdfllcAuthor Commented:
:) todd, you're right - I wasn't really thinking it through there... about the watched topics. I was more focused on the AND watch_flag = 1 part.
Which, I think I found another answer to this type of question in another post -
if you have the "AND watch_flag = 1" part - I think they said that it would have to do a table scan to find all matches where the flag = 1 -- is that correct?

I guess I am making it harder than it really is - it just seems too easy...
I just remember we tried to create something like this where I used to work - we had this huge join table (user_id, related_user_id, relationship_type)
and the DBA said it wouldn't scale - it's haunted me ever since :)
I don't think that adding the watch_flag = 1 would REQUIRE a full table scan - it depends on the indexes.  If you have appropriate indexes, you can do it without a full table scan.

Even extremely large tables can be joined efficiently if the SQL makes good use of indexes.  I cringe to think of a DBA saying that well-normalized table definitions won't scale well - I'd love to see the alternative!  There are times (particularly for reporting), where de-normalized data is more efficient.  But generally you want to use normalized data to generate de-normalized data in summary (rollup) tables for more efficient reporting, while keeping and using the normalized data in normal applications.

I'm sure this is how EE does it.  Check it out - the points summary tables in the left-hand column are likely generated from summary tables that are updated periodically.  If you look at an actual expert profile, the point total may be slightly different.  That data is pulled in real time, which is easy to do efficiently because you know what you are looking for (a particular expert's point total).  But the leaderboard is a different thing - you don't want to calculate the total points per a topic area for each user for a given period of time EVERY time a user visits this page.  Periodically refreshed summary data is sufficient.
cdfllcAuthor Commented:
yeah, the DBA couldn't ever give us any alternatives! ;)
I think it was an excuse for something else... anyways thanks for the help todd!
I am planning on splitting the points 400 for you and 100 for snoyes, since you both contributed...
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
MySQL Server

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.