Link to home
Start Free TrialLog in
Avatar of dolythgoe
dolythgoe

asked on

High speed MySQL search on an index of integers

Hi all,

I'm really trying to squeeze speed and optimise where possible out of the search index I'm building.

I have a php worker script which looks for tags in a textdump and sorts them into a searchable index based on word_id's and link_id's.

I've setup the table link this:

CREATE TABLE IF NOT EXISTS `linkindex_tags` (
  `link_id` int(10) unsigned NOT NULL,
  `word_id` mediumint(10) unsigned NOT NULL,
  PRIMARY KEY (`link_id`,`word_id`),
  KEY `word_id` (`word_id`),
  KEY `link_id` (`link_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

Open in new window


Now, I'm using the below SQL query which takes a series of word_id's and orders them based on number of matches in the above table .

SELECT link_id 
FROM linkindex_tags
WHERE word_id <= 4129 
GROUP BY link_id
ORDER BY 
SUM( word_id IN ('56, 65,4129') ) 
DESC LIMIT 15

Open in new window


I've added some indexes and assigned a fair bit of resource from the server.

Here's the results of this query using 2 word_id's (using SQL_NO_CACHE) on a table with 7.9 million rows - this table will likely get to 30+ million.

(15 total, Query took 3.0226 sec)

Server: CentOs 64bit - 8GB RAM Intel(R) Xeon(R) CPU X3470 @ 2.93GHz (x8)

Now, ideally need this down to 0.5 or less and it's looking like it might fall off a cliff if more and more rows are added.

Does anyone have any experience in this type of index and how to make it as rapid as possible. It's hard to think how I would split the tables up into any logical partioning as the word_id's can be 1 to 99,000 and any one of them be related to the link id.

Help is much appreciated to help me take it that extra mile :)

Edited - here's the EXPLAIN:

SQL result

Host: localhost
Database: rampants_db
Generation Time: Sep 08, 2011 at 10:27 PM
Generated by: phpMyAdmin 3.4.4 / MySQL 5.1.56
SQL query: EXPLAIN SELECT SQL_NO_CACHE link_id FROM linkindex_tags WHERE word_id IN ('107','27') GROUP BY link_id ORDER BY SUM( word_id IN ('107','27') ) DESC LIMIT 15;
Rows: 1
id       select_type       table       type       possible_keys       key       key_len       ref       rows       Extra
1       SIMPLE       linkindex_tags       index       word_id       PRIMARY       7       NULL      7958382       Using where; Using index; Using temporary; Using filesort
SOLUTION
Avatar of Guy Hengel [angelIII / a3]
Guy Hengel [angelIII / a3]
Flag of Luxembourg image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
First, I differ to a3's answer above regarding index and issues you are having there; however, I notice that the query you said you needed to run isn't the same as the one diagnosed by EXPLAIN. The one in the EXPLAIN is what a3 is discussing with passing strings that require implicit conversion to integer.

The query you showed though is:

word_id IN ('56, 65,4129')

Note that passing a string in this fashion does not translate into:

word_in IN ('56', '65', '4129') or word_in IN (56, 65, 4129)

But rather is treated as one big string literal. To get around this, you can use a function similar to one a3 has done for t-sql called parmsToList; however, another option is FIND_IN_SET(). Performance-wise, though, I am not sure it is any better given you have to use the column in FIND_IN_SET().

FIND_IN_SET(word_id, '56, 65,4129') > 0

So, to do your conditional aggregate, you can do something like:

SELECT link_id 
FROM linkindex_tags
WHERE word_id <= 4129 
GROUP BY link_id
ORDER BY SUM( IF(FIND_IN_SET(word_id, '56, 65,4129') > 0, 1, 0) ) DESC 
LIMIT 15

Open in new window


Or possibly better approach I just showed someone else:
SELECT link_id 
FROM linkindex_tags
WHERE word_id <= 4129 
GROUP BY link_id
ORDER BY SUM( IF(word_id REGEXP CONCAT('((', REPLACE('56, 65,4129', ', ', ')|('), '))'), 1, 0) ) DESC 
LIMIT 15

Open in new window


If the concern is really the passing of a comma-delimited list of integers, then try the above. If it is just a type-o in question, then as I said, please refer to a3's post.

Best regards,

Kevin
Avatar of dolythgoe
dolythgoe

ASKER

Thanks to you both - great suggestions.

So a3 - I changed the indexes to a PRIMARY with word_id FIRST.

I ran this query:

Here's the EXPLAIN:

id       select_type       table       type       possible_keys       key       key_len       ref       rows       Extra
1       SIMPLE       linkindex_tags       range       PRIMARY       PRIMARY       3       NULL      387586       Using where; Using index; Using temporary; Using filesort

Results...I ran this:

EXPLAIN SELECT SQL_NO_CACHE link_id
FROM linkindex_tags
WHERE word_id
IN ( 107, 27 )
GROUP BY link_id
ORDER BY SUM( word_id
IN ( 107, 27 ) ) DESC
LIMIT 15

Open in new window


( 15 total, Query took 0.6366 sec) Wow! Much faster! I see it's working off 387586 instead of the 7+ million!

Taking SQL_NO_CACHE out and doing a completely different word_id search:

SELECT link_id
FROM linkindex_tags
WHERE word_id
IN ( 102, 103, 553, 20990, 8092, 4218, 386 )
GROUP BY link_id
ORDER BY SUM( word_id
IN ( 102, 103, 553, 20990, 8092, 4218, 386 ) ) DESC
LIMIT 15

Open in new window


( 15 total, Query took 0.1647 sec) - pretty awesome to be honest!

Kevin, my apologies, I posted an EXPLAIN to a query that was different from another test which doesn't help much!! I have control of the input via php and can indeed build the query with integers so no need to pass it a string. Good spot.

Is there more that can be done to this with regards to my.cnf? (ref using filesort / using temporary) although this is now in the good usability speed zones :D
Further note to add this query will make up 95%+ of all queries being a search so any advice on optimising the my.cnf - want to allocate as much of the 8GB as poss to making this speedy where relevant...
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks Kevin,

I'm actually looking for the SUM  of the 'boolean matches' - so for each word_id match - give the score of 1 so the link_id's with the most matches comes out on top...

Do you think this can be done without the SUM?
Yes, since you have a where clause filtering out word_Id values other than those you want a simple count represents how many matches.
Thanks Kevin,

I ran these 2 queries - 1st one:

SELECT SQL_NO_CACHE link_id
FROM linkindex_tags
WHERE word_id
IN ( 103, 363, 850, 257 )
GROUP BY link_id
ORDER BY SUM( word_id
IN ( 103, 363, 850, 257 ) ) DESC
LIMIT 15

Open in new window


15 total, Query took 0.1163 sec

2nd one

SELECT SQL_NO_CACHE link_id
FROM linkindex_tags
WHERE word_id
IN ( 103, 363, 850, 257 )
GROUP BY link_id
ORDER BY COUNT(link_id) DESC
LIMIT 15

Open in new window


15 total, Query took 0.0947 sec

But both gave different results...however both are relevant results..

Here's the EXPLAINS:

1st Query

id       select_type       table       type       possible_keys       key       key_len       ref       rows       Extra
1       SIMPLE       linkindex_tags       range       PRIMARY       PRIMARY       3       NULL      88588       Using where; Using index; Using temporary; Using filesort
2nd Query

id       select_type       table       type       possible_keys       key       key_len       ref       rows       Extra
1       SIMPLE       linkindex_tags       range       PRIMARY       PRIMARY       3       NULL      88588       Using where; Using index; Using temporary; Using filesort

Identical explain outputs and similar speeds but different ordering of results. Wonder why that is..

Then your other query:

SELECT SQL_NO_CACHE link_id
FROM linkindex_tags
WHERE word_id
IN ( 103, 363, 850, 257 )
GROUP BY link_id
ORDER BY COUNT(1)
DESC
LIMIT 15

Open in new window


15 total, Query took 0.0948 sec

Returned same as the COUNT(link_id) - why is that? Is it counting the '1' boolean scores?

EXPLAIN on that one:

id       select_type       table       type       possible_keys       key       key_len       ref       rows       Extra
1       SIMPLE       linkindex_tags       range       PRIMARY       PRIMARY       3       NULL      88588       Using where; Using index; Using temporary; Using filesort


Do you think I've reached my limit in optimisation here? - results seem slightly faster with COUNT()



Yes. The last optimization is probably the SUM() to COUNT() one and really the optimization there is minor as you saw. It is removing the need to compare word_id to get boolean sum. Interesting they are different given you have already filtered on word_id in the WHERE clause making every row a boolean match or 1 in the SUM. Hmm.

As far as COUNT(1), since you have a WHERE clause already limiting the data to matches as I mentioned you can count a literal value.
Thanks Kevin and a3 for all your help here. Will close this down.

I ran the index script again under the new PRIMARY index which sorted the word_id's in a sequential order. I also bumped up key_buffer_size to 4GB and the results are really impressive for over 10 million rows. consistantly getting under 0.1 for words that occur heavily and 0.09 and under for less popular words.

The caching makes them almost instant and it's performing perfectly so thanks to you both for your support here., much appreciated :)

Cheers
David
Fast responsive help, very happy with outcome.