Solved

MySQL: creating an index of type TEXT

Posted on 2009-05-11
26
573 Views
Last Modified: 2013-12-12
I'm trying to improve a slow query in my database by adding an index to a text column, but it's not working.
How can I speed up this query?

I've added an index to the table below for the 'msg' column:
ALTER TABLE logs ADD INDEX msg (msg(50));

But when I run EXPLAIN, it still isn't using the index:
EXPLAIN SELECT  host,msg, count(msg) as count FROM logs GROUP by host ORDER BY count DESC LIMIT 10;

 +----+-------------+-------+-------+---------------+------+---------+------+-------+---------------------------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows  | Extra                           |
+----+-------------+-------+-------+---------------+------+---------+------+-------+---------------------------------+
|  1 | SIMPLE      | logs  | index | NULL          | host | 131     | NULL | 33638 | Using temporary; Using filesort |
+----+-------------+-------+-------+---------------+------+---------+------+-------+---------------------------------+




My Table:
 

Create Table: CREATE TABLE `logs` (

  `id` bigint(20) NOT NULL auto_increment,

  `host` varchar(128) default NULL,

  `facility` varchar(10) default NULL,

  `priority` varchar(10) default NULL,

  `level` varchar(10) default NULL,

  `tag` varchar(10) default NULL,

  `datetime` datetime default NULL,

  `program` varchar(15) default NULL,

  `msg` text,

  `seq` bigint(20) NOT NULL default '0',

  `counter` int(11) NOT NULL default '1',

  `fo` datetime default NULL,

  `lo` datetime default NULL,

  PRIMARY KEY  (`id`),

  KEY `datetime` (`datetime`),

  KEY `sequence` (`seq`),

  KEY `priority` (`priority`),

  KEY `facility` (`facility`),

  KEY `program` (`program`),

  KEY `host` (`host`)

) ENGINE=MyISAM AUTO_INCREMENT=15550475 DEFAULT CHARSET=latin1

Open in new window

0
Comment
Question by:cdukes
  • 13
  • 10
  • 3
26 Comments
 
LVL 8

Expert Comment

by:Bobaran98
ID: 24357714
I've never used an ALTER TABLE statement to add an index after creation.  I've done some looking around, though, and it doesn't appear you can use an ALTER TABLE statement to do this... looks like it should be a CREATE INDEX table.  Check out the documentation here:

http://dev.mysql.com/doc/refman/5.0/en/create-index.html
0
 

Author Comment

by:cdukes
ID: 24358790
It's the same as using ADD INDEX.

0
 
LVL 8

Expert Comment

by:Bobaran98
ID: 24358921
Okay.  Like I said, I've never used it.  Frankly, I've not used much in the way of indexes other than keys.  I'll let you know if I see anything else.  I'm always interested in learning of optimization techniques.
0
 
LVL 23

Expert Comment

by:Tiggerito
ID: 24363170
The GROUP BY host has triggered MySQL to use your host index. MySQL only uses one index per table!

When you do the count(msg), that returns the number of non-null messages for each host. is that what you want? If all entries have messages then I think count(*) gets the same results and may be a little faster.

I'm not sure if count(msg) is indexable. If it is, you should create a compound index with [host,msg] and see if that helps.

Changing host to an integer that is a foreign key to a host table would improve the host index size and speed. numbers are better to index on than text.

You have a GROUP BY and are sorting by grouped data and have no WHERE clause. This means MySQL has to load the whole table, find the counts, sort by the counts then pick the top 10. I don't think indexing will help much as the whole table needs loading anyhow.
0
 

Author Comment

by:cdukes
ID: 24364619
Hmm...what I'm trying to do is get the top 10 messages per host in the database. The output should look something like this:

Can you help me by providing some examples?
mysql> SELECT  host,msg, count(msg) as count FROM logs GROUP by host ORDER BY count DESC LIMIT 10;

+-----------------------+-----------------------------------------+-------+

| host                  | msg                                                                                                                                                                             | count |

+-----------------------+-----------------------------------------+-------+

| 14.1.71.60            | msg1                                    | 23070 |

| 11.16.254.202         | msg2                                    | 16503 |

| 11.12.21.222          | msg3                                    | 11095 |

| 11.16.254.208         | msg4                                    |  9952 |

| 11.16.254.195         | msg5                                    |  7858 |

| 14.1.71.179           | msg6                                    |  7637 |

| 14.1.71.64            | msg7                                    |  6549 |

| 14.1.71.63            | msg8                                    |  6504 |

| 11.31.130.22          | msg9                                    |  6500 |

| 11.16.254.210         | msg10                                   |  5569 |

+-----------------------+-----------------------------------------+-------+

10 rows in set (0.95 sec)

Open in new window

0
 

Author Comment

by:cdukes
ID: 24364632
note that I replaced the real messages above with msgX for readability...



0
 

Author Comment

by:cdukes
ID: 24364694
Also, I don't understand what you mean by:
Changing host to an integer that is a foreign key to a host table would improve the host index size and speed. numbers are better to index on than text.
0
 
LVL 23

Expert Comment

by:Tiggerito
ID: 24364853
Argh, firefox crashed before I submitted! I'll try again...

Your query returns the top 10 hosts with the most messages, not the top 10 messages per host. And the message displayed will be a random one from each host. Could you clarify your needs?




0
 
LVL 23

Expert Comment

by:Tiggerito
ID: 24364894
I was suggesting you create a new hosts table with fields like 'id', 'ip', 'domainname' etc.

Then have your logs table refer to the id of the host, which would be an integer.

This way you reduce the size of your logs and speed up the indexing on the host field which will then be an integer.

You can then add more data about hosts without affecting the size of your logs.
0
 

Author Comment

by:cdukes
ID: 24366168
>Your query returns the top 10 hosts with the most messages, not the top 10 messages per host.
That's correct - I want the top 10 hosts with the most messages (not the top 10 messages per host)

>I was suggesting you create a new hosts table with fields like 'id', 'ip', 'domainname' etc.
I still don't understand how to do that, can you help?

Here's an example of and entry (hopefully it will format properly):



+----------+-----------------------+----------+----------+-------+------+---------------------+-----------+--------------------------------------------------------+-----+---------+------+------+

| id       | host                  | facility | priority | level | tag  | datetime            | program   | msg                                                    | seq | counter | fo   | lo   |

+----------+-----------------------+----------+----------+-------+------+---------------------+-----------+--------------------------------------------------------+-----+---------+------+------+

| 15551439 | 12.11.222.1           | syslog   | err      | err   | 2b   | 2009-05-11 11:16:02 | syslog-ng | syslog-ng[30729]: Connection broken; time_reopen='10'  |   0 |       1 | NULL | NULL |

+----------+-----------------------+----------+----------+-------+------+---------------------+-----------+--------------------------------------------------------+-----+---------+------+------+

Open in new window

0
 
LVL 23

Expert Comment

by:Tiggerito
ID: 24372775
The code snippet includes table structures and example data. The query you would use is this:

SELECT  hosts.ip, count(*) as count FROM logs LEFT JOIN hosts ON (logs.host = hosts.id) GROUP by host ORDER BY count DESC LIMIT 10;

Notes:

As I suspected, indexing in the logs table has no use here. your query requires every row.

I dropped the msg field in the results as it has no meaning

I changed count(msg) to count(*) as it will be slightly faster

I've added the join to the host table to get the ip. The join will slow things down a bit but improves your structure and the index on the GROUP BY. You may want to do timings with or without the join and hosts table to see which is faster in your real world environment. Things will vary depending on the sizes of your logs and hosts tables.

logs

=====

CREATE TABLE  `logs` (

  `id` bigint(20) NOT NULL AUTO_INCREMENT,

  `host` int(10) unsigned NOT NULL,

  `facility` varchar(10) DEFAULT NULL,

  `priority` varchar(10) DEFAULT NULL,

  `level` varchar(10) DEFAULT NULL,

  `tag` varchar(10) DEFAULT NULL,

  `datetime` datetime DEFAULT NULL,

  `program` varchar(15) DEFAULT NULL,

  `msg` text,

  `seq` bigint(20) NOT NULL DEFAULT '0',

  `counter` int(11) NOT NULL DEFAULT '1',

  `fo` datetime DEFAULT NULL,

  `lo` datetime DEFAULT NULL,

  PRIMARY KEY (`id`),

  KEY `datetime` (`datetime`),

  KEY `sequence` (`seq`),

  KEY `priority` (`priority`),

  KEY `facility` (`facility`),

  KEY `program` (`program`),

  KEY `host` (`host`)

) ENGINE=MyISAM AUTO_INCREMENT=15550485 DEFAULT CHARSET=latin1;
 

+----------+-----------------------+----------+----------+-------+------+---------------------+-----------+--------------------------------------------------------+-----+---------+------+------+

| id       | host                  | facility | priority | level | tag  | datetime            | program   | msg                                                    | seq | counter | fo   | lo   |

+----------+-----------------------+----------+----------+-------+------+---------------------+-----------+--------------------------------------------------------+-----+---------+------+------+

| 15551439 | 1                     | syslog   | err      | err   | 2b   | 2009-05-11 11:16:02 | syslog-ng | syslog-ng[30729]: Connection broken; time_reopen='10'  |   0 |       1 | NULL | NULL |

+----------+-----------------------+----------+----------+-------+------+---------------------+-----------+--------------------------------------------------------+-----+---------+------+------+
 
 
 

hosts

=====

CREATE TABLE  `hosts` (

  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,

  `ip` varchar(15) NOT NULL,

  PRIMARY KEY (`id`)

) ENGINE=InnoDB AUTO_INCREMENT=3 DEFAULT CHARSET=latin1;
 

+----------+-----------------------+

| id       | ip                    | 

+----------+-----------------------+

| 1        | 12.11.222.1           |

+----------+-----------------------+

Open in new window

0
 

Author Comment

by:cdukes
ID: 24374478
Thanks!
I think I'm still missing something here though - how do I populate the data in the hosts table so that it matches the host in the logs table?

i.e. if I create the table and run that query I get:


mysql> SELECT  hosts.ip, count(*) as count FROM logs LEFT JOIN hosts ON (logs.host = hosts.id) GROUP by host ORDER BY count DESC LIMIT 10;
 

+------+-------+

| ip   | count |

+------+-------+

| NULL | 11870 |

| NULL |  6752 |

| NULL |  4534 |

| NULL |  4104 |

| NULL |  3863 |

| NULL |  3224 |

| NULL |  3199 |

| NULL |  3185 |

| NULL |  2670 |

| NULL |  2038 |

+------+-------+

Open in new window

0
 

Author Comment

by:cdukes
ID: 24374701
oh...I see that you converted the original logs.host to an int, but how do I get what's currently there and put it into the new hosts.ip table?

0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 23

Expert Comment

by:Tiggerito
ID: 24374898
If your not familiar with joining tables via primary and foreign keys then this may be more effort than its worth. I'm not sure if it will improve things anyhow.

You would also need to update your logging code so that it detects or creates new hosts as required. Even more work!

so stick with

SELECT  host, count(*) as count FROM logs GROUP by host ORDER BY count DESC LIMIT 10;

A simpler trick is that ip addresses can be represented as a number. If you store it that way (i.e. the field is numeric) then your select may be faster.

http://www.globalweb.com.ru/forum/about74.html
http://www.justin-cook.com/wp/2006/11/28/convert-an-ip-address-to-ip-number-with-php-asp-c-and-vbnet/
0
 

Author Comment

by:cdukes
ID: 24375053
This:
SELECT  host, count(*) as count FROM logs GROUP by host ORDER BY count DESC LIMIT 10;
Provides the top 10 hosts and is fairly fast(compared to messages)
But I also need to get the top 10 messages, thus I was trying to do:
SELECT  host, msg, count(*) as count FROM logs GROUP by msg ORDER BY count DESC LIMIT 10;
which is very slow on a large table (6-10 million rows)

There has to be a fast way to do this as there are commercial software packages out there for logging systems that perform quickly - they're doing it somehow :-)

0
 
LVL 23

Expert Comment

by:Tiggerito
ID: 24380910

SELECT  host, msg, count(*) as count FROM logs GROUP by msg ORDER BY count DESC LIMIT 10;

Different query and different issues. In this case a msg index may help, but as msg is TEXT it cannot be indexed. You would need to change it to a limited size varchar so you can index it.

Another way is to have a hash code for each msg and index on that, but thats another story in itself.


Best yet, you should track each message group with its own id which you can index and query on very quickly

0
 

Author Comment

by:cdukes
ID: 24381155
>you should track each message group with its own id which you can index
Can you expand upon this? How can I accomplish it?

btw, thanks for all your help!

0
 
LVL 23

Expert Comment

by:Tiggerito
ID: 24382937
Could you explain how you group messages. Is it just by the msg field?

What you should do is something like forums which create threads of messages. When a new thread is started it is allocate a number/id. Logs will have a thread field so each new log is linked with a thread.

Then your select woud be

SELECT  host, msg, count(*) as count FROM logs GROUP by thread ORDER BY count DESC LIMIT 10;
0
 

Author Comment

by:cdukes
ID: 24391352
Messages are grouped alone, an incoming syslog message is stored like so:


INSERT INTO logs (host, facility, priority, level, tag, datetime, program, msg) VALUES ( '$HOST', '$FACILITY', '$PRIORITY', '$LEVEL', '$TAG', '$YEAR-$MONTH-$DAY $HOUR:$MIN:$SEC', '$PROGRAM', '$MSG')

Open in new window

0
 
LVL 23

Expert Comment

by:Tiggerito
ID: 24410206
If its just a plain dump of a syslog then you probably will have to stick to using msg as your index.

It would be quite a bit more complex to take the data and re-structure it on the fly.
0
 

Author Comment

by:cdukes
ID: 24411801
I do have the option of selecting how the data get's inserted from a config file if that helps? (see template below)
I can also use a FIFO pipe to do more magic through a perl script, etc. if needed.

 
 template("INSERT INTO logs (host, facility, priority, level, tag, datetime, program, msg)

    VALUES ( '$HOST', '$FACILITY', '$PRIORITY', '$LEVEL', '$TAG', '$YEAR-$MONTH-$DAY $HOUR:$MIN:$SEC', '$PROGRAM', '$MSG');\n")

Open in new window

0
 
LVL 23

Accepted Solution

by:
Tiggerito earned 500 total points
ID: 24412107
Its quite a bit more work to pull out the msg for a seperate table, see if its already there (add or use), get the id for the msg and then add your log.

I don't know php so you will need other help in setting up that process. It will end up slowing down logging for a potential improvement in your query speed.

Another option is to log as normal then run nightly scripts to adjust the data so this it can be indexed better. Again not trivial.

If your log table is going to get very large you may find using indexes may become a speed issue when adding logs. In that case the solution is often to do daily transfers of the data from the live log to an archived log. the archived log is tuned for queries while the live log is tuned for inserts (ie no indexes). Even less trivial.

The simplest idea I can think of to try and improve the index is to create a hash of the msg field. hashes are smaller and so index better. The downside is you may get the odd false possitive match.

0
 
LVL 8

Expert Comment

by:Bobaran98
ID: 24457774
@cdukes... Well, I got left behind early in this discussion, but I continued to get the email updates every time one of you posted.  I've not received anything for a while, though.  Did you get your issue resolved?  Did one of these posts help?  If so, I suggest closing out the question and awarding Tiggerito some well-deserved points.
0
 

Author Comment

by:cdukes
ID: 24473323
@Bobran,
Although the solution still hasn't been found, I do feel Tiggerito should be awarded the points for his work.
I'll go ahead and assign them.
Thanks.
0
 

Author Closing Comment

by:cdukes
ID: 31580178
Although we didn't find an exact solution, Tiggerito has been very helpful so I feel he deserves the points.
0
 
LVL 23

Expert Comment

by:Tiggerito
ID: 24478834
thanks guys
0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

As a database administrator, you may need to audit your table(s) to determine whether the data types are optimal for your real-world data needs.  This Article is intended to be a resource for such a task. Preface The other day, I was involved …
Creating and Managing Databases with phpMyAdmin in cPanel.
The viewer will learn how to dynamically set the form action using jQuery.
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…

757 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

17 Experts available now in Live!

Get 1:1 Help Now