dolythgoe
asked on
High speed MySQL search on an index of integers
Hi all,
I'm really trying to squeeze speed and optimise where possible out of the search index I'm building.
I have a php worker script which looks for tags in a textdump and sorts them into a searchable index based on word_id's and link_id's.
I've setup the table link this:
Now, I'm using the below SQL query which takes a series of word_id's and orders them based on number of matches in the above table .
I've added some indexes and assigned a fair bit of resource from the server.
Here's the results of this query using 2 word_id's (using SQL_NO_CACHE) on a table with 7.9 million rows - this table will likely get to 30+ million.
(15 total, Query took 3.0226 sec)
Server: CentOs 64bit - 8GB RAM Intel(R) Xeon(R) CPU X3470 @ 2.93GHz (x8)
Now, ideally need this down to 0.5 or less and it's looking like it might fall off a cliff if more and more rows are added.
Does anyone have any experience in this type of index and how to make it as rapid as possible. It's hard to think how I would split the tables up into any logical partioning as the word_id's can be 1 to 99,000 and any one of them be related to the link id.
Help is much appreciated to help me take it that extra mile :)
Edited - here's the EXPLAIN:
SQL result
Host: localhost
Database: rampants_db
Generation Time: Sep 08, 2011 at 10:27 PM
Generated by: phpMyAdmin 3.4.4 / MySQL 5.1.56
SQL query: EXPLAIN SELECT SQL_NO_CACHE link_id FROM linkindex_tags WHERE word_id IN ('107','27') GROUP BY link_id ORDER BY SUM( word_id IN ('107','27') ) DESC LIMIT 15;
Rows: 1
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE linkindex_tags index word_id PRIMARY 7 NULL 7958382 Using where; Using index; Using temporary; Using filesort
I'm really trying to squeeze speed and optimise where possible out of the search index I'm building.
I have a php worker script which looks for tags in a textdump and sorts them into a searchable index based on word_id's and link_id's.
I've setup the table link this:
CREATE TABLE IF NOT EXISTS `linkindex_tags` (
`link_id` int(10) unsigned NOT NULL,
`word_id` mediumint(10) unsigned NOT NULL,
PRIMARY KEY (`link_id`,`word_id`),
KEY `word_id` (`word_id`),
KEY `link_id` (`link_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
Now, I'm using the below SQL query which takes a series of word_id's and orders them based on number of matches in the above table .
SELECT link_id
FROM linkindex_tags
WHERE word_id <= 4129
GROUP BY link_id
ORDER BY
SUM( word_id IN ('56, 65,4129') )
DESC LIMIT 15
I've added some indexes and assigned a fair bit of resource from the server.
Here's the results of this query using 2 word_id's (using SQL_NO_CACHE) on a table with 7.9 million rows - this table will likely get to 30+ million.
(15 total, Query took 3.0226 sec)
Server: CentOs 64bit - 8GB RAM Intel(R) Xeon(R) CPU X3470 @ 2.93GHz (x8)
Now, ideally need this down to 0.5 or less and it's looking like it might fall off a cliff if more and more rows are added.
Does anyone have any experience in this type of index and how to make it as rapid as possible. It's hard to think how I would split the tables up into any logical partioning as the word_id's can be 1 to 99,000 and any one of them be related to the link id.
Help is much appreciated to help me take it that extra mile :)
Edited - here's the EXPLAIN:
SQL result
Host: localhost
Database: rampants_db
Generation Time: Sep 08, 2011 at 10:27 PM
Generated by: phpMyAdmin 3.4.4 / MySQL 5.1.56
SQL query: EXPLAIN SELECT SQL_NO_CACHE link_id FROM linkindex_tags WHERE word_id IN ('107','27') GROUP BY link_id ORDER BY SUM( word_id IN ('107','27') ) DESC LIMIT 15;
Rows: 1
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE linkindex_tags index word_id PRIMARY 7 NULL 7958382 Using where; Using index; Using temporary; Using filesort
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thanks to you both - great suggestions.
So a3 - I changed the indexes to a PRIMARY with word_id FIRST.
I ran this query:
Here's the EXPLAIN:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE linkindex_tags range PRIMARY PRIMARY 3 NULL 387586 Using where; Using index; Using temporary; Using filesort
Results...I ran this:
( 15 total, Query took 0.6366 sec) Wow! Much faster! I see it's working off 387586 instead of the 7+ million!
Taking SQL_NO_CACHE out and doing a completely different word_id search:
( 15 total, Query took 0.1647 sec) - pretty awesome to be honest!
Kevin, my apologies, I posted an EXPLAIN to a query that was different from another test which doesn't help much!! I have control of the input via php and can indeed build the query with integers so no need to pass it a string. Good spot.
Is there more that can be done to this with regards to my.cnf? (ref using filesort / using temporary) although this is now in the good usability speed zones :D
So a3 - I changed the indexes to a PRIMARY with word_id FIRST.
I ran this query:
Here's the EXPLAIN:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE linkindex_tags range PRIMARY PRIMARY 3 NULL 387586 Using where; Using index; Using temporary; Using filesort
Results...I ran this:
EXPLAIN SELECT SQL_NO_CACHE link_id
FROM linkindex_tags
WHERE word_id
IN ( 107, 27 )
GROUP BY link_id
ORDER BY SUM( word_id
IN ( 107, 27 ) ) DESC
LIMIT 15
( 15 total, Query took 0.6366 sec) Wow! Much faster! I see it's working off 387586 instead of the 7+ million!
Taking SQL_NO_CACHE out and doing a completely different word_id search:
SELECT link_id
FROM linkindex_tags
WHERE word_id
IN ( 102, 103, 553, 20990, 8092, 4218, 386 )
GROUP BY link_id
ORDER BY SUM( word_id
IN ( 102, 103, 553, 20990, 8092, 4218, 386 ) ) DESC
LIMIT 15
( 15 total, Query took 0.1647 sec) - pretty awesome to be honest!
Kevin, my apologies, I posted an EXPLAIN to a query that was different from another test which doesn't help much!! I have control of the input via php and can indeed build the query with integers so no need to pass it a string. Good spot.
Is there more that can be done to this with regards to my.cnf? (ref using filesort / using temporary) although this is now in the good usability speed zones :D
ASKER
Further note to add this query will make up 95%+ of all queries being a search so any advice on optimising the my.cnf - want to allocate as much of the 8GB as poss to making this speedy where relevant...
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thanks Kevin,
I'm actually looking for the SUM of the 'boolean matches' - so for each word_id match - give the score of 1 so the link_id's with the most matches comes out on top...
Do you think this can be done without the SUM?
I'm actually looking for the SUM of the 'boolean matches' - so for each word_id match - give the score of 1 so the link_id's with the most matches comes out on top...
Do you think this can be done without the SUM?
Yes, since you have a where clause filtering out word_Id values other than those you want a simple count represents how many matches.
ASKER
Thanks Kevin,
I ran these 2 queries - 1st one:
15 total, Query took 0.1163 sec
2nd one
15 total, Query took 0.0947 sec
But both gave different results...however both are relevant results..
Here's the EXPLAINS:
1st Query
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE linkindex_tags range PRIMARY PRIMARY 3 NULL 88588 Using where; Using index; Using temporary; Using filesort
2nd Query
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE linkindex_tags range PRIMARY PRIMARY 3 NULL 88588 Using where; Using index; Using temporary; Using filesort
Identical explain outputs and similar speeds but different ordering of results. Wonder why that is..
Then your other query:
15 total, Query took 0.0948 sec
Returned same as the COUNT(link_id) - why is that? Is it counting the '1' boolean scores?
EXPLAIN on that one:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE linkindex_tags range PRIMARY PRIMARY 3 NULL 88588 Using where; Using index; Using temporary; Using filesort
Do you think I've reached my limit in optimisation here? - results seem slightly faster with COUNT()
I ran these 2 queries - 1st one:
SELECT SQL_NO_CACHE link_id
FROM linkindex_tags
WHERE word_id
IN ( 103, 363, 850, 257 )
GROUP BY link_id
ORDER BY SUM( word_id
IN ( 103, 363, 850, 257 ) ) DESC
LIMIT 15
15 total, Query took 0.1163 sec
2nd one
SELECT SQL_NO_CACHE link_id
FROM linkindex_tags
WHERE word_id
IN ( 103, 363, 850, 257 )
GROUP BY link_id
ORDER BY COUNT(link_id) DESC
LIMIT 15
15 total, Query took 0.0947 sec
But both gave different results...however both are relevant results..
Here's the EXPLAINS:
1st Query
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE linkindex_tags range PRIMARY PRIMARY 3 NULL 88588 Using where; Using index; Using temporary; Using filesort
2nd Query
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE linkindex_tags range PRIMARY PRIMARY 3 NULL 88588 Using where; Using index; Using temporary; Using filesort
Identical explain outputs and similar speeds but different ordering of results. Wonder why that is..
Then your other query:
SELECT SQL_NO_CACHE link_id
FROM linkindex_tags
WHERE word_id
IN ( 103, 363, 850, 257 )
GROUP BY link_id
ORDER BY COUNT(1)
DESC
LIMIT 15
15 total, Query took 0.0948 sec
Returned same as the COUNT(link_id) - why is that? Is it counting the '1' boolean scores?
EXPLAIN on that one:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE linkindex_tags range PRIMARY PRIMARY 3 NULL 88588 Using where; Using index; Using temporary; Using filesort
Do you think I've reached my limit in optimisation here? - results seem slightly faster with COUNT()
Yes. The last optimization is probably the SUM() to COUNT() one and really the optimization there is minor as you saw. It is removing the need to compare word_id to get boolean sum. Interesting they are different given you have already filtered on word_id in the WHERE clause making every row a boolean match or 1 in the SUM. Hmm.
As far as COUNT(1), since you have a WHERE clause already limiting the data to matches as I mentioned you can count a literal value.
As far as COUNT(1), since you have a WHERE clause already limiting the data to matches as I mentioned you can count a literal value.
ASKER
Thanks Kevin and a3 for all your help here. Will close this down.
I ran the index script again under the new PRIMARY index which sorted the word_id's in a sequential order. I also bumped up key_buffer_size to 4GB and the results are really impressive for over 10 million rows. consistantly getting under 0.1 for words that occur heavily and 0.09 and under for less popular words.
The caching makes them almost instant and it's performing perfectly so thanks to you both for your support here., much appreciated :)
Cheers
David
I ran the index script again under the new PRIMARY index which sorted the word_id's in a sequential order. I also bumped up key_buffer_size to 4GB and the results are really impressive for over 10 million rows. consistantly getting under 0.1 for words that occur heavily and 0.09 and under for less popular words.
The caching makes them almost instant and it's performing perfectly so thanks to you both for your support here., much appreciated :)
Cheers
David
ASKER
Fast responsive help, very happy with outcome.
The query you showed though is:
word_id IN ('56, 65,4129')
Note that passing a string in this fashion does not translate into:
word_in IN ('56', '65', '4129') or word_in IN (56, 65, 4129)
But rather is treated as one big string literal. To get around this, you can use a function similar to one a3 has done for t-sql called parmsToList; however, another option is FIND_IN_SET(). Performance-wise, though, I am not sure it is any better given you have to use the column in FIND_IN_SET().
FIND_IN_SET(word_id, '56, 65,4129') > 0
So, to do your conditional aggregate, you can do something like:
Open in new window
Or possibly better approach I just showed someone else:
Open in new window
If the concern is really the passing of a comma-delimited list of integers, then try the above. If it is just a type-o in question, then as I said, please refer to a3's post.
Best regards,
Kevin