Table Schema Best practices

Posted on 2006-03-23
Last Modified: 2012-08-14
Hi, I have used mySQL before and never had any issues, however, mydatabase was small scale (<100,000 rows)
So I never really considered the impact of my table schema too much...

Lets take the example from phpbb - there are three important tables I am concerned with in this question (trying to keep it simple and somewhat theoretical):

Lets say I have 1 million users, each user creates 10 topics, and each topic has 10 posts to it.

user table now has 1 million rows
topics table now has 10 million rows
posts table now has 100 million rows

Is this the common practice? I mean, is this the best way for performance?
Maybe in order to answer that question, you need more information - such as what kinds of queries I will want to do...

Here are some examples (pretty obvious, but...)
1. I would want to show all the users topics - so I would just do a select where (topic.user_id = loggedinuser)
2. Show the posts for a particular topic - so I would just do a select where (post.topic_id = selectedtopic)

I mean, is this the way very large sites are setup, like friendster, etc...?
Or is there another way out there that scales that is a "trade secret" for DBA's?

Question by:cdfllc
    LVL 30

    Accepted Solution

    This is a pretty standard normalized data structure, yes.  Indexing will play a key role in performance of the application that depends on this data.  In some cases, you might want to create a summary table for specific tasks (say, watched_topics, which can contain up to 25 different topics per user, etc.).
    LVL 33

    Assisted Solution

    A million rows isn't too big for MySQL. 10 million is useable, but you'll need to be careful with your indexes and consider each query carefully.  100 million might be getting a little big for a single table, but that depends more on the OS imposed limit for file sizes than MySQL's limits.  For that table, and possibly the topics table, consider using a MERGE table type, where the data is physically split among several different tables (and therefore files), but you can logically view them (or a subset of them) as one.  The manual at explains how to set that up, and shows its limitations.
    LVL 1

    Author Comment

    todd, so I imagine that the hypothetical "watched_topics" table contains this information: user_id, topic_id
    So if I wanted to show the users watched topics,
    then I would just:    select from topics where the topic_id in (SELECT from watched_topics where user_id = loggedinuser)

    Something like that?

    Is that faster than having a flag in the topics table to have something like this:
    SELECT from topics where user_id = loggedinuser AND watch_flag = 1
    LVL 30

    Expert Comment

    Yes - you could also do this with a straight join instead of a subquery:

    SELECT t.* FROM topics t INNER JOIN watched_topics w ON (w.topic_id=t.topic_id)
    WHERE w.user_id= loggedinuser;

    You don't want to have a watch_flag on the topics table because there should be only one entry in that table per distinct topic.  Whether a topic is watched by a user is specific to a user/topic combination, so you want to keep that out of the topics table.

    In fact, you don't want to have a user_id column on the topics table, either.  Think of a topic in abstract terms - they are independent of users.  For example, if there is a topic of "Computer Hardware", is it any different for user_1 than user_2?  Not in general terms, no.  If you want to associate topics to users, you need another table (the watched_topics table is an example of this) to define the associations.  But keep the topics table pure - don't include information it does not need to include.
    LVL 1

    Author Comment

    :) todd, you're right - I wasn't really thinking it through there... about the watched topics. I was more focused on the AND watch_flag = 1 part.
    Which, I think I found another answer to this type of question in another post -
    if you have the "AND watch_flag = 1" part - I think they said that it would have to do a table scan to find all matches where the flag = 1 -- is that correct?

    I guess I am making it harder than it really is - it just seems too easy...
    I just remember we tried to create something like this where I used to work - we had this huge join table (user_id, related_user_id, relationship_type)
    and the DBA said it wouldn't scale - it's haunted me ever since :)
    LVL 30

    Expert Comment

    I don't think that adding the watch_flag = 1 would REQUIRE a full table scan - it depends on the indexes.  If you have appropriate indexes, you can do it without a full table scan.

    Even extremely large tables can be joined efficiently if the SQL makes good use of indexes.  I cringe to think of a DBA saying that well-normalized table definitions won't scale well - I'd love to see the alternative!  There are times (particularly for reporting), where de-normalized data is more efficient.  But generally you want to use normalized data to generate de-normalized data in summary (rollup) tables for more efficient reporting, while keeping and using the normalized data in normal applications.

    I'm sure this is how EE does it.  Check it out - the points summary tables in the left-hand column are likely generated from summary tables that are updated periodically.  If you look at an actual expert profile, the point total may be slightly different.  That data is pulled in real time, which is easy to do efficiently because you know what you are looking for (a particular expert's point total).  But the leaderboard is a different thing - you don't want to calculate the total points per a topic area for each user for a given period of time EVERY time a user visits this page.  Periodically refreshed summary data is sufficient.
    LVL 1

    Author Comment

    yeah, the DBA couldn't ever give us any alternatives! ;)
    I think it was an excuse for something else... anyways thanks for the help todd!
    I am planning on splitting the points 400 for you and 100 for snoyes, since you both contributed...

    Write Comment

    Please enter a first name

    Please enter a last name

    We will never share this with anyone.

    Featured Post

    How your wiki can always stay up-to-date

    Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
    - Increase transparency
    - Onboard new hires faster
    - Access from mobile/offline

    Foreword In the years since this article was written, numerous hacking attacks have targeted password-protected web sites.  The storage of client passwords has become a subject of much discussion, some of it useful and some of it misguided.  Of cou…
    All XML, All the Time; More Fun MySQL Tidbits – Dynamically Generate XML via Stored Procedure in MySQL Extensible Markup Language (XML) and database systems, a marriage we are seeing more and more of.  So the topics of parsing and manipulating XM…
    Hi everyone! This is Experts Exchange customer support.  This quick video will show you how to change your primary email address.  If you have any questions, then please Write a Comment below!
    This video discusses moving either the default database or any database to a new volume.

    779 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    11 Experts available now in Live!

    Get 1:1 Help Now