cdukes
asked on
MySQL: creating an index of type TEXT
I'm trying to improve a slow query in my database by adding an index to a text column, but it's not working.
How can I speed up this query?
I've added an index to the table below for the 'msg' column:
ALTER TABLE logs ADD INDEX msg (msg(50));
But when I run EXPLAIN, it still isn't using the index:
EXPLAIN SELECT host,msg, count(msg) as count FROM logs GROUP by host ORDER BY count DESC LIMIT 10;
+----+-------------+------ -+-------+ ---------- -----+---- --+------- --+------+ -------+-- ---------- ---------- ---------- -+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------ -+-------+ ---------- -----+---- --+------- --+------+ -------+-- ---------- ---------- ---------- -+
| 1 | SIMPLE | logs | index | NULL | host | 131 | NULL | 33638 | Using temporary; Using filesort |
+----+-------------+------ -+-------+ ---------- -----+---- --+------- --+------+ -------+-- ---------- ---------- ---------- -+
How can I speed up this query?
I've added an index to the table below for the 'msg' column:
ALTER TABLE logs ADD INDEX msg (msg(50));
But when I run EXPLAIN, it still isn't using the index:
EXPLAIN SELECT host,msg, count(msg) as count FROM logs GROUP by host ORDER BY count DESC LIMIT 10;
+----+-------------+------
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------
| 1 | SIMPLE | logs | index | NULL | host | 131 | NULL | 33638 | Using temporary; Using filesort |
+----+-------------+------
My Table:
Create Table: CREATE TABLE `logs` (
`id` bigint(20) NOT NULL auto_increment,
`host` varchar(128) default NULL,
`facility` varchar(10) default NULL,
`priority` varchar(10) default NULL,
`level` varchar(10) default NULL,
`tag` varchar(10) default NULL,
`datetime` datetime default NULL,
`program` varchar(15) default NULL,
`msg` text,
`seq` bigint(20) NOT NULL default '0',
`counter` int(11) NOT NULL default '1',
`fo` datetime default NULL,
`lo` datetime default NULL,
PRIMARY KEY (`id`),
KEY `datetime` (`datetime`),
KEY `sequence` (`seq`),
KEY `priority` (`priority`),
KEY `facility` (`facility`),
KEY `program` (`program`),
KEY `host` (`host`)
) ENGINE=MyISAM AUTO_INCREMENT=15550475 DEFAULT CHARSET=latin1
ASKER
It's the same as using ADD INDEX.
Okay. Like I said, I've never used it. Frankly, I've not used much in the way of indexes other than keys. I'll let you know if I see anything else. I'm always interested in learning of optimization techniques.
The GROUP BY host has triggered MySQL to use your host index. MySQL only uses one index per table!
When you do the count(msg), that returns the number of non-null messages for each host. is that what you want? If all entries have messages then I think count(*) gets the same results and may be a little faster.
I'm not sure if count(msg) is indexable. If it is, you should create a compound index with [host,msg] and see if that helps.
Changing host to an integer that is a foreign key to a host table would improve the host index size and speed. numbers are better to index on than text.
You have a GROUP BY and are sorting by grouped data and have no WHERE clause. This means MySQL has to load the whole table, find the counts, sort by the counts then pick the top 10. I don't think indexing will help much as the whole table needs loading anyhow.
When you do the count(msg), that returns the number of non-null messages for each host. is that what you want? If all entries have messages then I think count(*) gets the same results and may be a little faster.
I'm not sure if count(msg) is indexable. If it is, you should create a compound index with [host,msg] and see if that helps.
Changing host to an integer that is a foreign key to a host table would improve the host index size and speed. numbers are better to index on than text.
You have a GROUP BY and are sorting by grouped data and have no WHERE clause. This means MySQL has to load the whole table, find the counts, sort by the counts then pick the top 10. I don't think indexing will help much as the whole table needs loading anyhow.
ASKER
Hmm...what I'm trying to do is get the top 10 messages per host in the database. The output should look something like this:
Can you help me by providing some examples?
Can you help me by providing some examples?
mysql> SELECT host,msg, count(msg) as count FROM logs GROUP by host ORDER BY count DESC LIMIT 10;
+-----------------------+-----------------------------------------+-------+
| host | msg | count |
+-----------------------+-----------------------------------------+-------+
| 14.1.71.60 | msg1 | 23070 |
| 11.16.254.202 | msg2 | 16503 |
| 11.12.21.222 | msg3 | 11095 |
| 11.16.254.208 | msg4 | 9952 |
| 11.16.254.195 | msg5 | 7858 |
| 14.1.71.179 | msg6 | 7637 |
| 14.1.71.64 | msg7 | 6549 |
| 14.1.71.63 | msg8 | 6504 |
| 11.31.130.22 | msg9 | 6500 |
| 11.16.254.210 | msg10 | 5569 |
+-----------------------+-----------------------------------------+-------+
10 rows in set (0.95 sec)
ASKER
note that I replaced the real messages above with msgX for readability...
ASKER
Also, I don't understand what you mean by:
Changing host to an integer that is a foreign key to a host table would improve the host index size and speed. numbers are better to index on than text.
Changing host to an integer that is a foreign key to a host table would improve the host index size and speed. numbers are better to index on than text.
Argh, firefox crashed before I submitted! I'll try again...
Your query returns the top 10 hosts with the most messages, not the top 10 messages per host. And the message displayed will be a random one from each host. Could you clarify your needs?
Your query returns the top 10 hosts with the most messages, not the top 10 messages per host. And the message displayed will be a random one from each host. Could you clarify your needs?
I was suggesting you create a new hosts table with fields like 'id', 'ip', 'domainname' etc.
Then have your logs table refer to the id of the host, which would be an integer.
This way you reduce the size of your logs and speed up the indexing on the host field which will then be an integer.
You can then add more data about hosts without affecting the size of your logs.
Then have your logs table refer to the id of the host, which would be an integer.
This way you reduce the size of your logs and speed up the indexing on the host field which will then be an integer.
You can then add more data about hosts without affecting the size of your logs.
ASKER
>Your query returns the top 10 hosts with the most messages, not the top 10 messages per host.
That's correct - I want the top 10 hosts with the most messages (not the top 10 messages per host)
>I was suggesting you create a new hosts table with fields like 'id', 'ip', 'domainname' etc.
I still don't understand how to do that, can you help?
Here's an example of and entry (hopefully it will format properly):
That's correct - I want the top 10 hosts with the most messages (not the top 10 messages per host)
>I was suggesting you create a new hosts table with fields like 'id', 'ip', 'domainname' etc.
I still don't understand how to do that, can you help?
Here's an example of and entry (hopefully it will format properly):
+----------+-----------------------+----------+----------+-------+------+---------------------+-----------+--------------------------------------------------------+-----+---------+------+------+
| id | host | facility | priority | level | tag | datetime | program | msg | seq | counter | fo | lo |
+----------+-----------------------+----------+----------+-------+------+---------------------+-----------+--------------------------------------------------------+-----+---------+------+------+
| 15551439 | 12.11.222.1 | syslog | err | err | 2b | 2009-05-11 11:16:02 | syslog-ng | syslog-ng[30729]: Connection broken; time_reopen='10' | 0 | 1 | NULL | NULL |
+----------+-----------------------+----------+----------+-------+------+---------------------+-----------+--------------------------------------------------------+-----+---------+------+------+
The code snippet includes table structures and example data. The query you would use is this:
SELECT hosts.ip, count(*) as count FROM logs LEFT JOIN hosts ON (logs.host = hosts.id) GROUP by host ORDER BY count DESC LIMIT 10;
Notes:
As I suspected, indexing in the logs table has no use here. your query requires every row.
I dropped the msg field in the results as it has no meaning
I changed count(msg) to count(*) as it will be slightly faster
I've added the join to the host table to get the ip. The join will slow things down a bit but improves your structure and the index on the GROUP BY. You may want to do timings with or without the join and hosts table to see which is faster in your real world environment. Things will vary depending on the sizes of your logs and hosts tables.
SELECT hosts.ip, count(*) as count FROM logs LEFT JOIN hosts ON (logs.host = hosts.id) GROUP by host ORDER BY count DESC LIMIT 10;
Notes:
As I suspected, indexing in the logs table has no use here. your query requires every row.
I dropped the msg field in the results as it has no meaning
I changed count(msg) to count(*) as it will be slightly faster
I've added the join to the host table to get the ip. The join will slow things down a bit but improves your structure and the index on the GROUP BY. You may want to do timings with or without the join and hosts table to see which is faster in your real world environment. Things will vary depending on the sizes of your logs and hosts tables.
logs
=====
CREATE TABLE `logs` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`host` int(10) unsigned NOT NULL,
`facility` varchar(10) DEFAULT NULL,
`priority` varchar(10) DEFAULT NULL,
`level` varchar(10) DEFAULT NULL,
`tag` varchar(10) DEFAULT NULL,
`datetime` datetime DEFAULT NULL,
`program` varchar(15) DEFAULT NULL,
`msg` text,
`seq` bigint(20) NOT NULL DEFAULT '0',
`counter` int(11) NOT NULL DEFAULT '1',
`fo` datetime DEFAULT NULL,
`lo` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `datetime` (`datetime`),
KEY `sequence` (`seq`),
KEY `priority` (`priority`),
KEY `facility` (`facility`),
KEY `program` (`program`),
KEY `host` (`host`)
) ENGINE=MyISAM AUTO_INCREMENT=15550485 DEFAULT CHARSET=latin1;
+----------+-----------------------+----------+----------+-------+------+---------------------+-----------+--------------------------------------------------------+-----+---------+------+------+
| id | host | facility | priority | level | tag | datetime | program | msg | seq | counter | fo | lo |
+----------+-----------------------+----------+----------+-------+------+---------------------+-----------+--------------------------------------------------------+-----+---------+------+------+
| 15551439 | 1 | syslog | err | err | 2b | 2009-05-11 11:16:02 | syslog-ng | syslog-ng[30729]: Connection broken; time_reopen='10' | 0 | 1 | NULL | NULL |
+----------+-----------------------+----------+----------+-------+------+---------------------+-----------+--------------------------------------------------------+-----+---------+------+------+
hosts
=====
CREATE TABLE `hosts` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`ip` varchar(15) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=3 DEFAULT CHARSET=latin1;
+----------+-----------------------+
| id | ip |
+----------+-----------------------+
| 1 | 12.11.222.1 |
+----------+-----------------------+
ASKER
Thanks!
I think I'm still missing something here though - how do I populate the data in the hosts table so that it matches the host in the logs table?
i.e. if I create the table and run that query I get:
I think I'm still missing something here though - how do I populate the data in the hosts table so that it matches the host in the logs table?
i.e. if I create the table and run that query I get:
mysql> SELECT hosts.ip, count(*) as count FROM logs LEFT JOIN hosts ON (logs.host = hosts.id) GROUP by host ORDER BY count DESC LIMIT 10;
+------+-------+
| ip | count |
+------+-------+
| NULL | 11870 |
| NULL | 6752 |
| NULL | 4534 |
| NULL | 4104 |
| NULL | 3863 |
| NULL | 3224 |
| NULL | 3199 |
| NULL | 3185 |
| NULL | 2670 |
| NULL | 2038 |
+------+-------+
ASKER
oh...I see that you converted the original logs.host to an int, but how do I get what's currently there and put it into the new hosts.ip table?
If your not familiar with joining tables via primary and foreign keys then this may be more effort than its worth. I'm not sure if it will improve things anyhow.
You would also need to update your logging code so that it detects or creates new hosts as required. Even more work!
so stick with
SELECT host, count(*) as count FROM logs GROUP by host ORDER BY count DESC LIMIT 10;
A simpler trick is that ip addresses can be represented as a number. If you store it that way (i.e. the field is numeric) then your select may be faster.
http://www.globalweb.com.ru/forum/about74.html
http://www.justin-cook.com/wp/2006/11/28/convert-an-ip-address-to-ip-number-with-php-asp-c-and-vbnet/
You would also need to update your logging code so that it detects or creates new hosts as required. Even more work!
so stick with
SELECT host, count(*) as count FROM logs GROUP by host ORDER BY count DESC LIMIT 10;
A simpler trick is that ip addresses can be represented as a number. If you store it that way (i.e. the field is numeric) then your select may be faster.
http://www.globalweb.com.ru/forum/about74.html
http://www.justin-cook.com/wp/2006/11/28/convert-an-ip-address-to-ip-number-with-php-asp-c-and-vbnet/
ASKER
This:
SELECT host, count(*) as count FROM logs GROUP by host ORDER BY count DESC LIMIT 10;
Provides the top 10 hosts and is fairly fast(compared to messages)
But I also need to get the top 10 messages, thus I was trying to do:
SELECT host, msg, count(*) as count FROM logs GROUP by msg ORDER BY count DESC LIMIT 10;
which is very slow on a large table (6-10 million rows)
There has to be a fast way to do this as there are commercial software packages out there for logging systems that perform quickly - they're doing it somehow :-)
SELECT host, count(*) as count FROM logs GROUP by host ORDER BY count DESC LIMIT 10;
Provides the top 10 hosts and is fairly fast(compared to messages)
But I also need to get the top 10 messages, thus I was trying to do:
SELECT host, msg, count(*) as count FROM logs GROUP by msg ORDER BY count DESC LIMIT 10;
which is very slow on a large table (6-10 million rows)
There has to be a fast way to do this as there are commercial software packages out there for logging systems that perform quickly - they're doing it somehow :-)
SELECT host, msg, count(*) as count FROM logs GROUP by msg ORDER BY count DESC LIMIT 10;
Different query and different issues. In this case a msg index may help, but as msg is TEXT it cannot be indexed. You would need to change it to a limited size varchar so you can index it.
Another way is to have a hash code for each msg and index on that, but thats another story in itself.
Best yet, you should track each message group with its own id which you can index and query on very quickly
ASKER
>you should track each message group with its own id which you can index
Can you expand upon this? How can I accomplish it?
btw, thanks for all your help!
Can you expand upon this? How can I accomplish it?
btw, thanks for all your help!
Could you explain how you group messages. Is it just by the msg field?
What you should do is something like forums which create threads of messages. When a new thread is started it is allocate a number/id. Logs will have a thread field so each new log is linked with a thread.
Then your select woud be
SELECT host, msg, count(*) as count FROM logs GROUP by thread ORDER BY count DESC LIMIT 10;
What you should do is something like forums which create threads of messages. When a new thread is started it is allocate a number/id. Logs will have a thread field so each new log is linked with a thread.
Then your select woud be
SELECT host, msg, count(*) as count FROM logs GROUP by thread ORDER BY count DESC LIMIT 10;
ASKER
Messages are grouped alone, an incoming syslog message is stored like so:
INSERT INTO logs (host, facility, priority, level, tag, datetime, program, msg) VALUES ( '$HOST', '$FACILITY', '$PRIORITY', '$LEVEL', '$TAG', '$YEAR-$MONTH-$DAY $HOUR:$MIN:$SEC', '$PROGRAM', '$MSG')
If its just a plain dump of a syslog then you probably will have to stick to using msg as your index.
It would be quite a bit more complex to take the data and re-structure it on the fly.
It would be quite a bit more complex to take the data and re-structure it on the fly.
ASKER
I do have the option of selecting how the data get's inserted from a config file if that helps? (see template below)
I can also use a FIFO pipe to do more magic through a perl script, etc. if needed.
I can also use a FIFO pipe to do more magic through a perl script, etc. if needed.
template("INSERT INTO logs (host, facility, priority, level, tag, datetime, program, msg)
VALUES ( '$HOST', '$FACILITY', '$PRIORITY', '$LEVEL', '$TAG', '$YEAR-$MONTH-$DAY $HOUR:$MIN:$SEC', '$PROGRAM', '$MSG');\n")
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
@cdukes... Well, I got left behind early in this discussion, but I continued to get the email updates every time one of you posted. I've not received anything for a while, though. Did you get your issue resolved? Did one of these posts help? If so, I suggest closing out the question and awarding Tiggerito some well-deserved points.
ASKER
@Bobran,
Although the solution still hasn't been found, I do feel Tiggerito should be awarded the points for his work.
I'll go ahead and assign them.
Thanks.
Although the solution still hasn't been found, I do feel Tiggerito should be awarded the points for his work.
I'll go ahead and assign them.
Thanks.
ASKER
Although we didn't find an exact solution, Tiggerito has been very helpful so I feel he deserves the points.
thanks guys
http://dev.mysql.com/doc/refman/5.0/en/create-index.html