Solved

High speed MySQL search on an index of integers

Posted on 2011-09-08
12
525 Views
Last Modified: 2013-12-13
Hi all,

I'm really trying to squeeze speed and optimise where possible out of the search index I'm building.

I have a php worker script which looks for tags in a textdump and sorts them into a searchable index based on word_id's and link_id's.

I've setup the table link this:

CREATE TABLE IF NOT EXISTS `linkindex_tags` (
  `link_id` int(10) unsigned NOT NULL,
  `word_id` mediumint(10) unsigned NOT NULL,
  PRIMARY KEY (`link_id`,`word_id`),
  KEY `word_id` (`word_id`),
  KEY `link_id` (`link_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

Open in new window


Now, I'm using the below SQL query which takes a series of word_id's and orders them based on number of matches in the above table .

SELECT link_id 
FROM linkindex_tags
WHERE word_id <= 4129 
GROUP BY link_id
ORDER BY 
SUM( word_id IN ('56, 65,4129') ) 
DESC LIMIT 15

Open in new window


I've added some indexes and assigned a fair bit of resource from the server.

Here's the results of this query using 2 word_id's (using SQL_NO_CACHE) on a table with 7.9 million rows - this table will likely get to 30+ million.

(15 total, Query took 3.0226 sec)

Server: CentOs 64bit - 8GB RAM Intel(R) Xeon(R) CPU X3470 @ 2.93GHz (x8)

Now, ideally need this down to 0.5 or less and it's looking like it might fall off a cliff if more and more rows are added.

Does anyone have any experience in this type of index and how to make it as rapid as possible. It's hard to think how I would split the tables up into any logical partioning as the word_id's can be 1 to 99,000 and any one of them be related to the link id.

Help is much appreciated to help me take it that extra mile :)

Edited - here's the EXPLAIN:

SQL result

Host: localhost
Database: rampants_db
Generation Time: Sep 08, 2011 at 10:27 PM
Generated by: phpMyAdmin 3.4.4 / MySQL 5.1.56
SQL query: EXPLAIN SELECT SQL_NO_CACHE link_id FROM linkindex_tags WHERE word_id IN ('107','27') GROUP BY link_id ORDER BY SUM( word_id IN ('107','27') ) DESC LIMIT 15;
Rows: 1
id       select_type       table       type       possible_keys       key       key_len       ref       rows       Extra
1       SIMPLE       linkindex_tags       index       word_id       PRIMARY       7       NULL      7958382       Using where; Using index; Using temporary; Using filesort
0
Comment
Question by:dolythgoe
  • 6
  • 4
  • 2
12 Comments
 
LVL 142

Assisted Solution

by:Guy Hengel [angelIII / a3]
Guy Hengel [angelIII / a3] earned 350 total points
ID: 36508304
but with this: WHERE word_id <= 4129
you basically do a full table scan, eventually.

then, you might try to put both word_id and link_id into 1 index (word_id being first), so just change your primary key definition to change the order of the 2 fields.

anyhow, the real issue is that word_id is numeric, but you pass strings.
try this, mysql is a bit picky about strings vs numerics when it goes to select possible indexes
EXPLAIN SELECT SQL_NO_CACHE link_id FROM linkindex_tags WHERE word_id IN (107,27) GROUP BY link_id ORDER BY SUM( word_id IN (107,27) ) DESC LIMIT 15; 

Open in new window

0
 
LVL 59

Expert Comment

by:Kevin Cross
ID: 36508381
First, I differ to a3's answer above regarding index and issues you are having there; however, I notice that the query you said you needed to run isn't the same as the one diagnosed by EXPLAIN. The one in the EXPLAIN is what a3 is discussing with passing strings that require implicit conversion to integer.

The query you showed though is:

word_id IN ('56, 65,4129')

Note that passing a string in this fashion does not translate into:

word_in IN ('56', '65', '4129') or word_in IN (56, 65, 4129)

But rather is treated as one big string literal. To get around this, you can use a function similar to one a3 has done for t-sql called parmsToList; however, another option is FIND_IN_SET(). Performance-wise, though, I am not sure it is any better given you have to use the column in FIND_IN_SET().

FIND_IN_SET(word_id, '56, 65,4129') > 0

So, to do your conditional aggregate, you can do something like:

SELECT link_id 
FROM linkindex_tags
WHERE word_id <= 4129 
GROUP BY link_id
ORDER BY SUM( IF(FIND_IN_SET(word_id, '56, 65,4129') > 0, 1, 0) ) DESC 
LIMIT 15

Open in new window


Or possibly better approach I just showed someone else:
SELECT link_id 
FROM linkindex_tags
WHERE word_id <= 4129 
GROUP BY link_id
ORDER BY SUM( IF(word_id REGEXP CONCAT('((', REPLACE('56, 65,4129', ', ', ')|('), '))'), 1, 0) ) DESC 
LIMIT 15

Open in new window


If the concern is really the passing of a comma-delimited list of integers, then try the above. If it is just a type-o in question, then as I said, please refer to a3's post.

Best regards,

Kevin
0
 

Author Comment

by:dolythgoe
ID: 36508490
Thanks to you both - great suggestions.

So a3 - I changed the indexes to a PRIMARY with word_id FIRST.

I ran this query:

Here's the EXPLAIN:

id       select_type       table       type       possible_keys       key       key_len       ref       rows       Extra
1       SIMPLE       linkindex_tags       range       PRIMARY       PRIMARY       3       NULL      387586       Using where; Using index; Using temporary; Using filesort

Results...I ran this:

EXPLAIN SELECT SQL_NO_CACHE link_id
FROM linkindex_tags
WHERE word_id
IN ( 107, 27 )
GROUP BY link_id
ORDER BY SUM( word_id
IN ( 107, 27 ) ) DESC
LIMIT 15

Open in new window


( 15 total, Query took 0.6366 sec) Wow! Much faster! I see it's working off 387586 instead of the 7+ million!

Taking SQL_NO_CACHE out and doing a completely different word_id search:

SELECT link_id
FROM linkindex_tags
WHERE word_id
IN ( 102, 103, 553, 20990, 8092, 4218, 386 )
GROUP BY link_id
ORDER BY SUM( word_id
IN ( 102, 103, 553, 20990, 8092, 4218, 386 ) ) DESC
LIMIT 15

Open in new window


( 15 total, Query took 0.1647 sec) - pretty awesome to be honest!

Kevin, my apologies, I posted an EXPLAIN to a query that was different from another test which doesn't help much!! I have control of the input via php and can indeed build the query with integers so no need to pass it a string. Good spot.

Is there more that can be done to this with regards to my.cnf? (ref using filesort / using temporary) although this is now in the good usability speed zones :D
0
 

Author Comment

by:dolythgoe
ID: 36508495
Further note to add this query will make up 95%+ of all queries being a search so any advice on optimising the my.cnf - want to allocate as much of the 8GB as poss to making this speedy where relevant...
0
 
LVL 142

Assisted Solution

by:Guy Hengel [angelIII / a3]
Guy Hengel [angelIII / a3] earned 350 total points
ID: 36508532
you might need to increase the key_buffer_size parameter, if not yet done:
http://dev.mysql.com/doc/refman/5.0/en/server-parameters.html
0
 
LVL 59

Accepted Solution

by:
Kevin Cross earned 150 total points
ID: 36509074
Again, I agree with a3. Just as a note:

SELECT link_id
FROM linkindex_tags
WHERE word_id
IN ( 102, 103, 553, 20990, 8092, 4218, 386 )
GROUP BY link_id
ORDER BY SUM( word_id
IN ( 102, 103, 553, 20990, 8092, 4218, 386 ) ) DESC
LIMIT 15

Since those are the same, just simply use:
SELECT link_id
FROM linkindex_tags
WHERE word_id
IN ( 102, 103, 553, 20990, 8092, 4218, 386 )
GROUP BY link_id
ORDER BY SUM( word_id ) DESC
LIMIT 15

Or

SELECT link_id
FROM linkindex_tags
WHERE word_id
IN ( 102, 103, 553, 20990, 8092, 4218, 386 )
GROUP BY link_id
ORDER BY COUNT(1) DESC
LIMIT 15

The second may perform better if all you want the number of matches. If you actually need the SUM() of word_id, then at least you do not have to do second IN.
0
Top 6 Sources for Identifying Threat Actor TTPs

Understanding your enemy is essential. These six sources will help you identify the most popular threat actor tactics, techniques, and procedures (TTPs).

 

Author Comment

by:dolythgoe
ID: 36510033
Thanks Kevin,

I'm actually looking for the SUM  of the 'boolean matches' - so for each word_id match - give the score of 1 so the link_id's with the most matches comes out on top...

Do you think this can be done without the SUM?
0
 
LVL 59

Expert Comment

by:Kevin Cross
ID: 36510232
Yes, since you have a where clause filtering out word_Id values other than those you want a simple count represents how many matches.
0
 

Author Comment

by:dolythgoe
ID: 36511240
Thanks Kevin,

I ran these 2 queries - 1st one:

SELECT SQL_NO_CACHE link_id
FROM linkindex_tags
WHERE word_id
IN ( 103, 363, 850, 257 )
GROUP BY link_id
ORDER BY SUM( word_id
IN ( 103, 363, 850, 257 ) ) DESC
LIMIT 15

Open in new window


15 total, Query took 0.1163 sec

2nd one

SELECT SQL_NO_CACHE link_id
FROM linkindex_tags
WHERE word_id
IN ( 103, 363, 850, 257 )
GROUP BY link_id
ORDER BY COUNT(link_id) DESC
LIMIT 15

Open in new window


15 total, Query took 0.0947 sec

But both gave different results...however both are relevant results..

Here's the EXPLAINS:

1st Query

id       select_type       table       type       possible_keys       key       key_len       ref       rows       Extra
1       SIMPLE       linkindex_tags       range       PRIMARY       PRIMARY       3       NULL      88588       Using where; Using index; Using temporary; Using filesort
2nd Query

id       select_type       table       type       possible_keys       key       key_len       ref       rows       Extra
1       SIMPLE       linkindex_tags       range       PRIMARY       PRIMARY       3       NULL      88588       Using where; Using index; Using temporary; Using filesort

Identical explain outputs and similar speeds but different ordering of results. Wonder why that is..

Then your other query:

SELECT SQL_NO_CACHE link_id
FROM linkindex_tags
WHERE word_id
IN ( 103, 363, 850, 257 )
GROUP BY link_id
ORDER BY COUNT(1)
DESC
LIMIT 15

Open in new window


15 total, Query took 0.0948 sec

Returned same as the COUNT(link_id) - why is that? Is it counting the '1' boolean scores?

EXPLAIN on that one:

id       select_type       table       type       possible_keys       key       key_len       ref       rows       Extra
1       SIMPLE       linkindex_tags       range       PRIMARY       PRIMARY       3       NULL      88588       Using where; Using index; Using temporary; Using filesort


Do you think I've reached my limit in optimisation here? - results seem slightly faster with COUNT()



0
 
LVL 59

Expert Comment

by:Kevin Cross
ID: 36513872
Yes. The last optimization is probably the SUM() to COUNT() one and really the optimization there is minor as you saw. It is removing the need to compare word_id to get boolean sum. Interesting they are different given you have already filtered on word_id in the WHERE clause making every row a boolean match or 1 in the SUM. Hmm.

As far as COUNT(1), since you have a WHERE clause already limiting the data to matches as I mentioned you can count a literal value.
0
 

Author Comment

by:dolythgoe
ID: 36514509
Thanks Kevin and a3 for all your help here. Will close this down.

I ran the index script again under the new PRIMARY index which sorted the word_id's in a sequential order. I also bumped up key_buffer_size to 4GB and the results are really impressive for over 10 million rows. consistantly getting under 0.1 for words that occur heavily and 0.09 and under for less popular words.

The caching makes them almost instant and it's performing perfectly so thanks to you both for your support here., much appreciated :)

Cheers
David
0
 

Author Closing Comment

by:dolythgoe
ID: 36514519
Fast responsive help, very happy with outcome.
0

Featured Post

How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

Join & Write a Comment

Suggested Solutions

I imagine that there are some, like me, who require a way of getting currency exchange rates for implementation in web project from time to time, so I thought I would share a solution that I have developed for this purpose. It turns out that Yaho…
This article discusses four methods for overlaying images in a container on a web page
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now