asked on

Table Schema Best practices

Hi, I have used mySQL before and never had any issues, however, mydatabase was small scale (<100,000 rows)
So I never really considered the impact of my table schema too much...

Lets take the example from phpbb - there are three important tables I am concerned with in this question (trying to keep it simple and somewhat theoretical):
user,topic,post

Lets say I have 1 million users, each user creates 10 topics, and each topic has 10 posts to it.

user table now has 1 million rows
topics table now has 10 million rows
posts table now has 100 million rows

Is this the common practice? I mean, is this the best way for performance?
Maybe in order to answer that question, you need more information - such as what kinds of queries I will want to do...

Here are some examples (pretty obvious, but...)
1. I would want to show all the users topics - so I would just do a select where (topic.user_id = loggedinuser)
2. Show the posts for a particular topic - so I would just do a select where (post.topic_id = selectedtopic)

I mean, is this the way very large sites are setup, like friendster, etc...?
Or is there another way out there that scales that is a "trade secret" for DBA's?

thanks!

ASKER CERTIFIED SOLUTION

todd_farmer

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

snoyes_jw

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

cdfllc

ASKER

todd, so I imagine that the hypothetical "watched_topics" table contains this information: user_id, topic_id
So if I wanted to show the users watched topics,
then I would just: select from topics where the topic_id in (SELECT from watched_topics where user_id = loggedinuser)

Something like that?

Is that faster than having a flag in the topics table to have something like this:
SELECT from topics where user_id = loggedinuser AND watch_flag = 1

todd_farmer

Yes - you could also do this with a straight join instead of a subquery:

SELECT t.* FROM topics t INNER JOIN watched_topics w ON (w.topic_id=t.topic_id)
WHERE w.user_id= loggedinuser;

You don't want to have a watch_flag on the topics table because there should be only one entry in that table per distinct topic. Whether a topic is watched by a user is specific to a user/topic combination, so you want to keep that out of the topics table.

In fact, you don't want to have a user_id column on the topics table, either. Think of a topic in abstract terms - they are independent of users. For example, if there is a topic of "Computer Hardware", is it any different for user_1 than user_2? Not in general terms, no. If you want to associate topics to users, you need another table (the watched_topics table is an example of this) to define the associations. But keep the topics table pure - don't include information it does not need to include.

cdfllc

ASKER

:) todd, you're right - I wasn't really thinking it through there... about the watched topics. I was more focused on the AND watch_flag = 1 part.
Which, I think I found another answer to this type of question in another post -
if you have the "AND watch_flag = 1" part - I think they said that it would have to do a table scan to find all matches where the flag = 1 -- is that correct?

I guess I am making it harder than it really is - it just seems too easy...
I just remember we tried to create something like this where I used to work - we had this huge join table (user_id, related_user_id, relationship_type)
and the DBA said it wouldn't scale - it's haunted me ever since :)

todd_farmer

I don't think that adding the watch_flag = 1 would REQUIRE a full table scan - it depends on the indexes. If you have appropriate indexes, you can do it without a full table scan.

Even extremely large tables can be joined efficiently if the SQL makes good use of indexes. I cringe to think of a DBA saying that well-normalized table definitions won't scale well - I'd love to see the alternative! There are times (particularly for reporting), where de-normalized data is more efficient. But generally you want to use normalized data to generate de-normalized data in summary (rollup) tables for more efficient reporting, while keeping and using the normalized data in normal applications.

I'm sure this is how EE does it. Check it out - the points summary tables in the left-hand column are likely generated from summary tables that are updated periodically. If you look at an actual expert profile, the point total may be slightly different. That data is pulled in real time, which is easy to do efficiently because you know what you are looking for (a particular expert's point total). But the leaderboard is a different thing - you don't want to calculate the total points per a topic area for each user for a given period of time EVERY time a user visits this page. Periodically refreshed summary data is sufficient.

cdfllc

ASKER

yeah, the DBA couldn't ever give us any alternatives! ;)
I think it was an excuse for something else... anyways thanks for the help todd!
I am planning on splitting the points 400 for you and 100 for snoyes, since you both contributed...