Link to home
Start Free TrialLog in
Avatar of cdukes
cdukesFlag for United States of America

asked on

MySQL: creating an index of type TEXT

I'm trying to improve a slow query in my database by adding an index to a text column, but it's not working.
How can I speed up this query?

I've added an index to the table below for the 'msg' column:
ALTER TABLE logs ADD INDEX msg (msg(50));

But when I run EXPLAIN, it still isn't using the index:
EXPLAIN SELECT  host,msg, count(msg) as count FROM logs GROUP by host ORDER BY count DESC LIMIT 10;

 +----+-------------+-------+-------+---------------+------+---------+------+-------+---------------------------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows  | Extra                           |
+----+-------------+-------+-------+---------------+------+---------+------+-------+---------------------------------+
|  1 | SIMPLE      | logs  | index | NULL          | host | 131     | NULL | 33638 | Using temporary; Using filesort |
+----+-------------+-------+-------+---------------+------+---------+------+-------+---------------------------------+




My Table:
 
Create Table: CREATE TABLE `logs` (
  `id` bigint(20) NOT NULL auto_increment,
  `host` varchar(128) default NULL,
  `facility` varchar(10) default NULL,
  `priority` varchar(10) default NULL,
  `level` varchar(10) default NULL,
  `tag` varchar(10) default NULL,
  `datetime` datetime default NULL,
  `program` varchar(15) default NULL,
  `msg` text,
  `seq` bigint(20) NOT NULL default '0',
  `counter` int(11) NOT NULL default '1',
  `fo` datetime default NULL,
  `lo` datetime default NULL,
  PRIMARY KEY  (`id`),
  KEY `datetime` (`datetime`),
  KEY `sequence` (`seq`),
  KEY `priority` (`priority`),
  KEY `facility` (`facility`),
  KEY `program` (`program`),
  KEY `host` (`host`)
) ENGINE=MyISAM AUTO_INCREMENT=15550475 DEFAULT CHARSET=latin1

Open in new window

Avatar of Bobaran98
Bobaran98
Flag of United States of America image

I've never used an ALTER TABLE statement to add an index after creation.  I've done some looking around, though, and it doesn't appear you can use an ALTER TABLE statement to do this... looks like it should be a CREATE INDEX table.  Check out the documentation here:

http://dev.mysql.com/doc/refman/5.0/en/create-index.html
Avatar of cdukes

ASKER

It's the same as using ADD INDEX.

Okay.  Like I said, I've never used it.  Frankly, I've not used much in the way of indexes other than keys.  I'll let you know if I see anything else.  I'm always interested in learning of optimization techniques.
The GROUP BY host has triggered MySQL to use your host index. MySQL only uses one index per table!

When you do the count(msg), that returns the number of non-null messages for each host. is that what you want? If all entries have messages then I think count(*) gets the same results and may be a little faster.

I'm not sure if count(msg) is indexable. If it is, you should create a compound index with [host,msg] and see if that helps.

Changing host to an integer that is a foreign key to a host table would improve the host index size and speed. numbers are better to index on than text.

You have a GROUP BY and are sorting by grouped data and have no WHERE clause. This means MySQL has to load the whole table, find the counts, sort by the counts then pick the top 10. I don't think indexing will help much as the whole table needs loading anyhow.
Avatar of cdukes

ASKER

Hmm...what I'm trying to do is get the top 10 messages per host in the database. The output should look something like this:

Can you help me by providing some examples?
mysql> SELECT  host,msg, count(msg) as count FROM logs GROUP by host ORDER BY count DESC LIMIT 10;
+-----------------------+-----------------------------------------+-------+
| host                  | msg                                                                                                                                                                             | count |
+-----------------------+-----------------------------------------+-------+
| 14.1.71.60            | msg1                                    | 23070 |
| 11.16.254.202         | msg2                                    | 16503 |
| 11.12.21.222          | msg3                                    | 11095 |
| 11.16.254.208         | msg4                                    |  9952 |
| 11.16.254.195         | msg5                                    |  7858 |
| 14.1.71.179           | msg6                                    |  7637 |
| 14.1.71.64            | msg7                                    |  6549 |
| 14.1.71.63            | msg8                                    |  6504 |
| 11.31.130.22          | msg9                                    |  6500 |
| 11.16.254.210         | msg10                                   |  5569 |
+-----------------------+-----------------------------------------+-------+
10 rows in set (0.95 sec)

Open in new window

Avatar of cdukes

ASKER

note that I replaced the real messages above with msgX for readability...



Avatar of cdukes

ASKER

Also, I don't understand what you mean by:
Changing host to an integer that is a foreign key to a host table would improve the host index size and speed. numbers are better to index on than text.
Argh, firefox crashed before I submitted! I'll try again...

Your query returns the top 10 hosts with the most messages, not the top 10 messages per host. And the message displayed will be a random one from each host. Could you clarify your needs?




I was suggesting you create a new hosts table with fields like 'id', 'ip', 'domainname' etc.

Then have your logs table refer to the id of the host, which would be an integer.

This way you reduce the size of your logs and speed up the indexing on the host field which will then be an integer.

You can then add more data about hosts without affecting the size of your logs.
Avatar of cdukes

ASKER

>Your query returns the top 10 hosts with the most messages, not the top 10 messages per host.
That's correct - I want the top 10 hosts with the most messages (not the top 10 messages per host)

>I was suggesting you create a new hosts table with fields like 'id', 'ip', 'domainname' etc.
I still don't understand how to do that, can you help?

Here's an example of and entry (hopefully it will format properly):



+----------+-----------------------+----------+----------+-------+------+---------------------+-----------+--------------------------------------------------------+-----+---------+------+------+
| id       | host                  | facility | priority | level | tag  | datetime            | program   | msg                                                    | seq | counter | fo   | lo   |
+----------+-----------------------+----------+----------+-------+------+---------------------+-----------+--------------------------------------------------------+-----+---------+------+------+
| 15551439 | 12.11.222.1           | syslog   | err      | err   | 2b   | 2009-05-11 11:16:02 | syslog-ng | syslog-ng[30729]: Connection broken; time_reopen='10'  |   0 |       1 | NULL | NULL |
+----------+-----------------------+----------+----------+-------+------+---------------------+-----------+--------------------------------------------------------+-----+---------+------+------+

Open in new window

The code snippet includes table structures and example data. The query you would use is this:

SELECT  hosts.ip, count(*) as count FROM logs LEFT JOIN hosts ON (logs.host = hosts.id) GROUP by host ORDER BY count DESC LIMIT 10;

Notes:

As I suspected, indexing in the logs table has no use here. your query requires every row.

I dropped the msg field in the results as it has no meaning

I changed count(msg) to count(*) as it will be slightly faster

I've added the join to the host table to get the ip. The join will slow things down a bit but improves your structure and the index on the GROUP BY. You may want to do timings with or without the join and hosts table to see which is faster in your real world environment. Things will vary depending on the sizes of your logs and hosts tables.

logs
=====
CREATE TABLE  `logs` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `host` int(10) unsigned NOT NULL,
  `facility` varchar(10) DEFAULT NULL,
  `priority` varchar(10) DEFAULT NULL,
  `level` varchar(10) DEFAULT NULL,
  `tag` varchar(10) DEFAULT NULL,
  `datetime` datetime DEFAULT NULL,
  `program` varchar(15) DEFAULT NULL,
  `msg` text,
  `seq` bigint(20) NOT NULL DEFAULT '0',
  `counter` int(11) NOT NULL DEFAULT '1',
  `fo` datetime DEFAULT NULL,
  `lo` datetime DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `datetime` (`datetime`),
  KEY `sequence` (`seq`),
  KEY `priority` (`priority`),
  KEY `facility` (`facility`),
  KEY `program` (`program`),
  KEY `host` (`host`)
) ENGINE=MyISAM AUTO_INCREMENT=15550485 DEFAULT CHARSET=latin1;
 
+----------+-----------------------+----------+----------+-------+------+---------------------+-----------+--------------------------------------------------------+-----+---------+------+------+
| id       | host                  | facility | priority | level | tag  | datetime            | program   | msg                                                    | seq | counter | fo   | lo   |
+----------+-----------------------+----------+----------+-------+------+---------------------+-----------+--------------------------------------------------------+-----+---------+------+------+
| 15551439 | 1                     | syslog   | err      | err   | 2b   | 2009-05-11 11:16:02 | syslog-ng | syslog-ng[30729]: Connection broken; time_reopen='10'  |   0 |       1 | NULL | NULL |
+----------+-----------------------+----------+----------+-------+------+---------------------+-----------+--------------------------------------------------------+-----+---------+------+------+
 
 
 
hosts
=====
CREATE TABLE  `hosts` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `ip` varchar(15) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=3 DEFAULT CHARSET=latin1;
 
+----------+-----------------------+
| id       | ip                    | 
+----------+-----------------------+
| 1        | 12.11.222.1           |
+----------+-----------------------+

Open in new window

Avatar of cdukes

ASKER

Thanks!
I think I'm still missing something here though - how do I populate the data in the hosts table so that it matches the host in the logs table?

i.e. if I create the table and run that query I get:


mysql> SELECT  hosts.ip, count(*) as count FROM logs LEFT JOIN hosts ON (logs.host = hosts.id) GROUP by host ORDER BY count DESC LIMIT 10;
 
+------+-------+
| ip   | count |
+------+-------+
| NULL | 11870 |
| NULL |  6752 |
| NULL |  4534 |
| NULL |  4104 |
| NULL |  3863 |
| NULL |  3224 |
| NULL |  3199 |
| NULL |  3185 |
| NULL |  2670 |
| NULL |  2038 |
+------+-------+

Open in new window

Avatar of cdukes

ASKER

oh...I see that you converted the original logs.host to an int, but how do I get what's currently there and put it into the new hosts.ip table?

If your not familiar with joining tables via primary and foreign keys then this may be more effort than its worth. I'm not sure if it will improve things anyhow.

You would also need to update your logging code so that it detects or creates new hosts as required. Even more work!

so stick with

SELECT  host, count(*) as count FROM logs GROUP by host ORDER BY count DESC LIMIT 10;

A simpler trick is that ip addresses can be represented as a number. If you store it that way (i.e. the field is numeric) then your select may be faster.

http://www.globalweb.com.ru/forum/about74.html
http://www.justin-cook.com/wp/2006/11/28/convert-an-ip-address-to-ip-number-with-php-asp-c-and-vbnet/
Avatar of cdukes

ASKER

This:
SELECT  host, count(*) as count FROM logs GROUP by host ORDER BY count DESC LIMIT 10;
Provides the top 10 hosts and is fairly fast(compared to messages)
But I also need to get the top 10 messages, thus I was trying to do:
SELECT  host, msg, count(*) as count FROM logs GROUP by msg ORDER BY count DESC LIMIT 10;
which is very slow on a large table (6-10 million rows)

There has to be a fast way to do this as there are commercial software packages out there for logging systems that perform quickly - they're doing it somehow :-)


SELECT  host, msg, count(*) as count FROM logs GROUP by msg ORDER BY count DESC LIMIT 10;

Different query and different issues. In this case a msg index may help, but as msg is TEXT it cannot be indexed. You would need to change it to a limited size varchar so you can index it.

Another way is to have a hash code for each msg and index on that, but thats another story in itself.


Best yet, you should track each message group with its own id which you can index and query on very quickly

Avatar of cdukes

ASKER

>you should track each message group with its own id which you can index
Can you expand upon this? How can I accomplish it?

btw, thanks for all your help!

Could you explain how you group messages. Is it just by the msg field?

What you should do is something like forums which create threads of messages. When a new thread is started it is allocate a number/id. Logs will have a thread field so each new log is linked with a thread.

Then your select woud be

SELECT  host, msg, count(*) as count FROM logs GROUP by thread ORDER BY count DESC LIMIT 10;
Avatar of cdukes

ASKER

Messages are grouped alone, an incoming syslog message is stored like so:


INSERT INTO logs (host, facility, priority, level, tag, datetime, program, msg) VALUES ( '$HOST', '$FACILITY', '$PRIORITY', '$LEVEL', '$TAG', '$YEAR-$MONTH-$DAY $HOUR:$MIN:$SEC', '$PROGRAM', '$MSG')

Open in new window

If its just a plain dump of a syslog then you probably will have to stick to using msg as your index.

It would be quite a bit more complex to take the data and re-structure it on the fly.
Avatar of cdukes

ASKER

I do have the option of selecting how the data get's inserted from a config file if that helps? (see template below)
I can also use a FIFO pipe to do more magic through a perl script, etc. if needed.

 
 template("INSERT INTO logs (host, facility, priority, level, tag, datetime, program, msg)
    VALUES ( '$HOST', '$FACILITY', '$PRIORITY', '$LEVEL', '$TAG', '$YEAR-$MONTH-$DAY $HOUR:$MIN:$SEC', '$PROGRAM', '$MSG');\n")

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of Tony McCreath
Tony McCreath
Flag of Australia image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
@cdukes... Well, I got left behind early in this discussion, but I continued to get the email updates every time one of you posted.  I've not received anything for a while, though.  Did you get your issue resolved?  Did one of these posts help?  If so, I suggest closing out the question and awarding Tiggerito some well-deserved points.
Avatar of cdukes

ASKER

@Bobran,
Although the solution still hasn't been found, I do feel Tiggerito should be awarded the points for his work.
I'll go ahead and assign them.
Thanks.
Avatar of cdukes

ASKER

Although we didn't find an exact solution, Tiggerito has been very helpful so I feel he deserves the points.
thanks guys