[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now


Table Schema Best practices

Posted on 2006-03-23
Medium Priority
Last Modified: 2012-08-14
Hi, I have used mySQL before and never had any issues, however, mydatabase was small scale (<100,000 rows)
So I never really considered the impact of my table schema too much...

Lets take the example from phpbb - there are three important tables I am concerned with in this question (trying to keep it simple and somewhat theoretical):

Lets say I have 1 million users, each user creates 10 topics, and each topic has 10 posts to it.

user table now has 1 million rows
topics table now has 10 million rows
posts table now has 100 million rows

Is this the common practice? I mean, is this the best way for performance?
Maybe in order to answer that question, you need more information - such as what kinds of queries I will want to do...

Here are some examples (pretty obvious, but...)
1. I would want to show all the users topics - so I would just do a select where (topic.user_id = loggedinuser)
2. Show the posts for a particular topic - so I would just do a select where (post.topic_id = selectedtopic)

I mean, is this the way very large sites are setup, like friendster, etc...?
Or is there another way out there that scales that is a "trade secret" for DBA's?

Question by:cdfllc
  • 3
  • 3
LVL 30

Accepted Solution

todd_farmer earned 1600 total points
ID: 16271206
This is a pretty standard normalized data structure, yes.  Indexing will play a key role in performance of the application that depends on this data.  In some cases, you might want to create a summary table for specific tasks (say, watched_topics, which can contain up to 25 different topics per user, etc.).
LVL 33

Assisted Solution

snoyes_jw earned 400 total points
ID: 16271268
A million rows isn't too big for MySQL. 10 million is useable, but you'll need to be careful with your indexes and consider each query carefully.  100 million might be getting a little big for a single table, but that depends more on the OS imposed limit for file sizes than MySQL's limits.  For that table, and possibly the topics table, consider using a MERGE table type, where the data is physically split among several different tables (and therefore files), but you can logically view them (or a subset of them) as one.  The manual at mysql.com explains how to set that up, and shows its limitations.

Author Comment

ID: 16272890
todd, so I imagine that the hypothetical "watched_topics" table contains this information: user_id, topic_id
So if I wanted to show the users watched topics,
then I would just:    select from topics where the topic_id in (SELECT from watched_topics where user_id = loggedinuser)

Something like that?

Is that faster than having a flag in the topics table to have something like this:
SELECT from topics where user_id = loggedinuser AND watch_flag = 1
Nothing ever in the clear!

This technical paper will help you implement VMware’s VM encryption as well as implement Veeam encryption which together will achieve the nothing ever in the clear goal. If a bad guy steals VMs, backups or traffic they get nothing.

LVL 30

Expert Comment

ID: 16273028
Yes - you could also do this with a straight join instead of a subquery:

SELECT t.* FROM topics t INNER JOIN watched_topics w ON (w.topic_id=t.topic_id)
WHERE w.user_id= loggedinuser;

You don't want to have a watch_flag on the topics table because there should be only one entry in that table per distinct topic.  Whether a topic is watched by a user is specific to a user/topic combination, so you want to keep that out of the topics table.

In fact, you don't want to have a user_id column on the topics table, either.  Think of a topic in abstract terms - they are independent of users.  For example, if there is a topic of "Computer Hardware", is it any different for user_1 than user_2?  Not in general terms, no.  If you want to associate topics to users, you need another table (the watched_topics table is an example of this) to define the associations.  But keep the topics table pure - don't include information it does not need to include.

Author Comment

ID: 16273283
:) todd, you're right - I wasn't really thinking it through there... about the watched topics. I was more focused on the AND watch_flag = 1 part.
Which, I think I found another answer to this type of question in another post -
if you have the "AND watch_flag = 1" part - I think they said that it would have to do a table scan to find all matches where the flag = 1 -- is that correct?

I guess I am making it harder than it really is - it just seems too easy...
I just remember we tried to create something like this where I used to work - we had this huge join table (user_id, related_user_id, relationship_type)
and the DBA said it wouldn't scale - it's haunted me ever since :)
LVL 30

Expert Comment

ID: 16273413
I don't think that adding the watch_flag = 1 would REQUIRE a full table scan - it depends on the indexes.  If you have appropriate indexes, you can do it without a full table scan.

Even extremely large tables can be joined efficiently if the SQL makes good use of indexes.  I cringe to think of a DBA saying that well-normalized table definitions won't scale well - I'd love to see the alternative!  There are times (particularly for reporting), where de-normalized data is more efficient.  But generally you want to use normalized data to generate de-normalized data in summary (rollup) tables for more efficient reporting, while keeping and using the normalized data in normal applications.

I'm sure this is how EE does it.  Check it out - the points summary tables in the left-hand column are likely generated from summary tables that are updated periodically.  If you look at an actual expert profile, the point total may be slightly different.  That data is pulled in real time, which is easy to do efficiently because you know what you are looking for (a particular expert's point total).  But the leaderboard is a different thing - you don't want to calculate the total points per a topic area for each user for a given period of time EVERY time a user visits this page.  Periodically refreshed summary data is sufficient.

Author Comment

ID: 16273508
yeah, the DBA couldn't ever give us any alternatives! ;)
I think it was an excuse for something else... anyways thanks for the help todd!
I am planning on splitting the points 400 for you and 100 for snoyes, since you both contributed...

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article shows the steps required to install WordPress on Azure. Web Apps, Mobile Apps, API Apps, or Functions, in Azure all these run in an App Service plan. WordPress is no exception and requires an App Service Plan and Database to install
In this blog, we’ll look at how improvements to Percona XtraDB Cluster improved IST performance.
In this video, Percona Solution Engineer Dimitri Vanoverbeke discusses why you want to use at least three nodes in a database cluster. To discuss how Percona Consulting can help with your design and architecture needs for your database and infras…
In this video, Percona Solutions Engineer Barrett Chambers discusses some of the basic syntax differences between MySQL and MongoDB. To learn more check out our webinar on MongoDB administration for MySQL DBA: https://www.percona.com/resources/we…
Suggested Courses
Course of the Month19 days, 9 hours left to enroll

872 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question