Optimizing database design: TEXT vs. VARCHAR?

I'm working on some statistical queries that I need to run against a single table with about 3 million records.  I'm having some issues optimizing one query in particular, and while running EXPLAIN against it, I noticed something that bothers me.  So, I sat down and wrote two painfully artificial, minimal test cases to verify, and here are the results, run on MySQL Ver 14.12 Distrib 5.0.18, for Win32 (ia32), tables are both MyISAM (default):

(1)
-- Create the first table
CREATE TABLE ex1
(
    firstname VARCHAR(255),
    age INTEGER
);

-- Create the second table
CREATE TABLE ex2
(
    firstname TINYTEXT,
    age INTEGER
);

-- load ex1 with two records
INSERT INTO ex1(firstname, age) VALUES ('Bob', 23);
INSERT INTO ex1(firstname, age) VALUES ('Bob', 25);
INSERT INTO ex1(firstname, age) VALUES ('Bob', 30);

-- load ex2 with the same data
INSERT INTO ex2 SELECT * FROM ex1;

-- Create indexes
CREATE INDEX index_on_both ON ex1 (firstname, age);
CREATE INDEX index_on_both ON ex2 (firstname, age);

-- Test the queries
EXPLAIN SELECT firstname, age FROM ex1 GROUP BY firstname;
+----+-------------+-------+-------+---------------+---------------+---------+------+------+-------------+
| id | select_type | table | type  | possible_keys | key           | key_len | ref  | rows | Extra       |
+----+-------------+-------+-------+---------------+---------------+---------+------+------+-------------+
|  1 | SIMPLE      | ex1   | index | NULL          | index_on_both | 263     | NULL |    3 | Using index |
+----+-------------+-------+-------+---------------+---------------+---------+------+------+-------------+

EXPLAIN SELECT firstname, age FROM ex2 GROUP BY firstname;
+----+-------------+-------+------+---------------+------+---------+------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra                           |
+----+-------------+-------+------+---------------+------+---------+------+------+---------------------------------+
|  1 | SIMPLE      | ex2   | ALL  | NULL          | NULL | NULL    | NULL |    3 | Using temporary; Using filesort |
+----+-------------+-------+------+---------------+------+---------+------+------+---------------------------------+

Why isn't my index being used on columns of any of the TEXT dataypes?  Is there a way to modify my index or table to get it to work?  Is it using the index, but claiming not to?  Surely there's some explanation that doesn't require me to change all my TEXT fields to VARCHAR?

Thanks!
LVL 9
smidgie82Asked:
Who is Participating?
 
Kim RyanIT ConsultantCommented:
Ok then, index will be valid. Your query is very simple, just grouping on a single column so presumably the optimizer thinks  filesort is the most effiecient way to run this query when using TINYTEXT. As to why, I do not know.  You could try a more complex query with a join and see if that uses the index.

My thought would be the less data, the quicker the search time, so a varchar would use much less data than fixed width text.
0
 
Kim RyanIT ConsultantCommented:
For indexes on BLOB and TEXT columns, you must specify an index prefix length, such as
CREATE INDEX index_on_both ON ex2 (firstname(5), age);

Using varchars would save a lot of space if there is a big variation in the length of your text data.
0
 
smidgie82Author Commented:
Sorry, I was mistaken in my initial post.  I did specify the prefix length (MySQL wouldn't let me create the index without it).  I was just typing the commands above from memory, which apparently isn't as good as I thought.  (c:

So, here's the correction:
CREATE INDEX index_on_both ON ex2 (firstname(255), age);

But the rest of the data is accurate.  It's not using the index, apparently, even though the index prefix length is the maximum allowable length of the TINYTEXT field.  Why?
0
 
smidgie82Author Commented:
Okay, I've done a little research and a lot of experimenting, and here's what I've come up with:

I tried a more complicated query against a larger table, but a comparable scenario (the table that prompted this question in the first place).  Here's the table:

CREATE TABLE log
(
    id SERIAL PRIMARY KEY,
    ip VARCHAR(15),
    server VARCHAR(40),
    request TINYTEXT,
    date DATE,
    time TIME,
    response SMALLINT(3)
);

And the index:
CREATE INDEX index_all ON log (server, request(255), date, time);

And the query:
SELECT C.server, C.request, C.date, C.time, C.response
FROM
(
      SELECT A.server AS server, A.request AS request, A.date AS date, B.time AS time
      FROM
      (
            SELECT server, request, MAX(date) AS date
            FROM log
            GROUP BY server, request
      ) AS A
      INNER JOIN
      (
            SELECT server, request, date, MAX(time) AS time
            FROM log
            GROUP BY server, request, date
      ) AS B
      ON (A.server = B.server) AND (A.request = B.request) AND (A.date = B.date)
) AS D
INNER JOIN
log AS C
ON (D.server = C.server) AND (D.request = C.request) AND (D.date = c.date) AND (D.time = C.time);

The ultimate goal is to find the most recent response for each request, and the IP of the host making the request.  (Which I've found a much faster way to do since using a summary table, but whatever.)

For 10,000 records, the query took 53 seconds.  Here's the result of EXPLAIN:
+----+-------------+------------+------+---------------+-----------+---------+----------+-------+---------------------------------+
| id | select_type | table      | type | possible_keys | key       | key_len | ref      | rows  | Extra                           |
+----+-------------+------------+------+---------------+-----------+---------+----------+-------+---------------------------------+
|  1 | PRIMARY     | <derived2> | ALL  | NULL          | NULL      | NULL    | NULL     |  1006 |                                 |
|  1 | PRIMARY     | C          | ref  | index_all     | index_all | 32      | D.server |  3333 | Using where                     |
|  2 | DERIVED     | <derived3> | ALL  | NULL          | NULL      | NULL    | NULL     |  1006 |                                 |
|  2 | DERIVED     | <derived4> | ALL  | NULL          | NULL      | NULL    | NULL     |  1228 | Using where                     |
|  4 | DERIVED     | log3       | ALL  | NULL          | NULL      | NULL    | NULL     | 10000 | Using temporary; Using filesort |
|  3 | DERIVED     | log3       | ALL  | NULL          | NULL      | NULL    | NULL     | 10000 | Using temporary; Using filesort |
+----+-------------+------------+------+---------------+-----------+---------+----------+-------+---------------------------------+

After dropping the index, changing the request to VARCHAR(255), and adding the index back in (tried it both with and without a prefix length, same results).
Repeat the query, takes 0.64 seconds.
+----+-------------+------------+-------+---------------+-----------+---------+----------------------------------+-------+-------------+
| id | select_type | table      | type  | possible_keys | key       | key_len | ref                              | rows  | Extra       |
+----+-------------+------------+-------+---------------+-----------+---------+----------------------------------+-------+-------------+
|  1 | PRIMARY     | <derived2> | ALL   | NULL          | NULL      | NULL    | NULL                             |  1006 |             |
|  1 | PRIMARY     | C          | ref   | index_all     | index_all | 296     | D.server,D.request,D.date,D.time |     1 | Using where |
|  2 | DERIVED     | <derived3> | ALL   | NULL          | NULL      | NULL    | NULL                             |  1006 |             |
|  2 | DERIVED     | <derived4> | ALL   | NULL          | NULL      | NULL    | NULL                             |  1228 | Using where |
|  4 | DERIVED     | log2       | index | NULL          | index_all | 296     | NULL                             | 10000 | Using index |
|  3 | DERIVED     | log2       | index | NULL          | index_all | 296     | NULL                             | 10000 | Using index |
+----+-------------+------------+-------+---------------+-----------+---------+----------------------------------+-------+-------------+

After reading the section of the manual that deals with GROUP BY optimization via indexes, I came to the (apparently incorrect) conclusion that it decided to do a table scan because the index on the TINYTEXT column was assumed to be only a prefix, and therefore didn't necessarily uniquely identify all the columns.  But I used several prefix lengths on the VARCHAR version of the request field, and it always used the index, despite the fact that the index didn't uniquely identify all the values of the request column in that case, either.

Then I dropped all rows from the log table and added a unique index on log (server, request(255)) while request was specified as a TINYTEXT field, and repopulated the table, ignoring duplicates.  Tried again, and even though the index was guaranteed to uniquely identify all server, request pairs, it still wasn't being used.  I can't imagine that the optimizer would think filesort and a temporary table would be the fastest way to do this, given the evidence.  So, I'm gonna assume it's a bug, until I get evidence to the contrary.

In the mean time, while your advice didn't really answer my question (why isn't it using my index on the TINYTEXT column?), it certainly solved the ultimate concern (why is it taking forever to run this query?).  From 53 seconds to 0.64 for the query when I changed it from TINYTEXT to VARCHAR.  So, the points are yours.

Thanks!
0
 
Kim RyanIT ConsultantCommented:
Thanks for that. It may be woht psoting this to a an official MySQL forum or bug list.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.