?
Solved

Query across 3 simple many-2-many tables

Posted on 2011-10-02
16
Medium Priority
?
250 Views
Last Modified: 2012-05-12
Hi all,

I have 3 simple many-to-many tables:

CREATE TABLE `title_index` (
  `word_id` MEDIUMINT(10) UNSIGNED NOT NULL,
  `link_id` INT(10) UNSIGNED NOT NULL,
  PRIMARY KEY (`word_id`,`link_id`)
) ENGINE=MYISAM DEFAULT CHARSET=latin1

CREATE TABLE `tag_index` (
  `word_id` MEDIUMINT(10) UNSIGNED NOT NULL,
  `link_id` INT(10) UNSIGNED NOT NULL,
  PRIMARY KEY (`word_id`,`link_id`)
) ENGINE=MYISAM DEFAULT CHARSET=latin1

CREATE TABLE `category_index` (
  `word_id` MEDIUMINT(10) UNSIGNED NOT NULL,
  `link_id` INT(10) UNSIGNED NOT NULL,
  PRIMARY KEY (`word_id`,`link_id`)
) ENGINE=MYISAM DEFAULT CHARSET=latin1

Open in new window


They record word_id's relationships to link_id's for title, tags and categories for a a large quantity of links (about 2 million)

In terms of table sizes, tags is the biggest and then titles with the smaller category table.

What I'm trying to do is query across all 3 with a set of word_ids and return the a list ordered by occurence score.

It might be better if I explain this with some sort of suedo style:


Select link_id from 3 tables
- If a match is made in the title table - score 5
- If a match is made in the tag table - score 1
- If a match is made in the category table - score 3
Where word_id IN (12,14,15,16,27,56)
Order by sum of the scores per link

A neat query that can achieve this with speed would be most appreciated!

Cheers
David


0
Comment
Question by:dolythgoe
  • 7
  • 5
  • 4
16 Comments
 
LVL 13

Expert Comment

by:Hugh McCurdy
ID: 36899927
I doubt if you can do this using only MySQL.  It doesn't appear possible if using PHP & MySQL  

http://php.net/manual/en/function.mysql-query.php   
mysql_query() sends a unique query (multiple queries are not supported) to the currently active database on the server that's associated with the specified link_identifier.

If it's not possible, then you'll need to write a program.  We can talk about that if that's the direction you use.

Before you do that, I suggest waiting until Monday (if possible) to see if someone appears with a way to solve this problem using just MySQL.
0
 

Author Comment

by:dolythgoe
ID: 36899997
Well I have this working when querying on 1 table but the speed is too slow for production use so I thought to have ago at trying 3 and see if it can be done.

For 1 table query:

SELECT link_id, SUM(cat_id) AS score 
FROM linkindex WHERE word_id IN (107, 27)
GROUP BY link_id 
ORDER BY score DESC LIMIT 0,16;

Open in new window


cat_id is either 5,3 or 1 depending if it's a title, category or tag - this table is 18.5 million rows and slows down a lot when query frequent word_ids.

INDEX is on (word_id, link_id, cat_id).

I'm trying every creative thing possible to get  faster solution. It's fast for one word_ids but for 10 frequent word_ids it slows down.

Mysql does mention something about a 30% threshold when it thinks a full table scan is better than an index but tmp_table_size is large enough to negate I/O on the disk.

Explain on an intensive query:

id  select_type  table      type   possible_keys  key      key_len  ref  rows     Extra                                                      
1   SIMPLE       linkindex  range  PRIMARY        PRIMARY  3             1817080  Using where; Using index; Using temporary; Using filesort  

Just can't help thinking there's a faster way - the big hitters are SUM, GROUP BY and ORDER BY.

Speed is about 4 secs for:
SELECT link_id, SUM(cat_id) AS score 
FROM linkindex WHERE word_id IN (102,103,107,46,45,469,68,27,104,260)
GROUP BY link_id 
ORDER BY score DESC LIMIT 0,16;

Open in new window


Ideally need split seconds - Creating tmp table takes most of the time - it sorts quickly. It's not creating tmp table on disk so don't know what to do about this!
0
 
LVL 60

Expert Comment

by:Kevin Cross
ID: 36900171
Yes, you are taking a hit because of the things you said. With 18.5 million rows, you have to aggregate the data and the sort it all only to select the top 16; therefore, the more results the more work the ORDER BY has to do. Further, as you said. The selectivity percentage matters. If you are practically selecting the whole table anyway, the SQL optimizer will likely just go with a full table scan. I have seen as low as 17% selectivity cap. To overcome this, you can use FORCE INDEX.

Another thought is, does the cat_id connect to anything dynamically or is it always going to be the subject of an aggregate? i.e., when the records for this table are generated, is it possible to calculate the total score then? So you can remove the GROUP BY and SUM operations. Combined with FORCE INDEX, this may make this more performant.

Hope that helps!

Best regards and happy coding,

Kevin
0
 [eBook] Windows Nano Server

Download this FREE eBook and learn all you need to get started with Windows Nano Server, including deployment options, remote management
and troubleshooting tips and tricks

 

Author Comment

by:dolythgoe
ID: 36900331
Cheers Kevin,

Initially the cat_id were going to reference something but then I realised they didn't need to and they just became wieghting scores if you like.

I also changed the indexing script to sum up the totals per word_id / link_id but that only shaved 2 million rows from it as there's a row for every tag which has a score of 1.

I'm now looking into Sphinx search engine since it is a specialist C++ search indexer that works well with MySQL and has a php API. It seems to do all the things I want like apply weighting etc to the data fields unlike MySQL's native FULLTEXT. A lot of guys also recommend it for speed as it indexes in a numerical way.

Looks like I'll be posting a few questions about that next hehe :)

0
 
LVL 13

Expert Comment

by:Hugh McCurdy
ID: 36900375
I understand the problem better now.  mwvisa1 makes a really good point about the sort if you are selecting most of the table and then sorting it before selecting the top 16.  Have you read this?

http://dev.mysql.com/doc/refman/5.0/en/limit-optimization.html

If you use LIMIT row_count  with ORDER BY, MySQL ends the sorting as soon as it has found the first row_count rows of the sorted result, rather than sorting the entire result. If ordering is done by using an index, this is very fast. If a filesort must be done, all rows that match the query without the LIMIT clause must be selected, and most or all of them must be sorted, before it can be ascertained that the first row_count rows have been found. In either case, after the initial rows have been found, there is no need to sort any remainder of the result set, and MySQL does not do so

They key is using an index just was mwvisa1 suggests.   The point is that you don't want to completely sort n records (where n is huge) when you only need the top 16.  You find the top 16 (one pass, linear time) and then sort those 16.  (If n was huge, then the entire sort should take "linear time" which is excellent for a sort.)
0
 

Author Comment

by:dolythgoe
ID: 36900421
Thanks guys,

I suppose the difficulty with this is the ordering by a score which needs to be worked out in realtime.

The search terms can vary so you can be never know what scores what links have for the various combinations of word_ids. The word_id | link_id | cat_id relationships provide the score for that particular relationship but the prob occurs when you throw a few word_id's at it and ask it to order by the sum of those 'cat_id' scores before the Limit.

It's been a bit of a learning curve but my attention has turned to an open source solution Sphinx which has been dedicated to this type of solution for many years so it makes sense to have a look at their solution.

I don't think I can optimise my query anymore based on the above...ah well!
0
 
LVL 60

Assisted Solution

by:Kevin Cross
Kevin Cross earned 1000 total points
ID: 36900428
Yes, exactly. When doing a calculation or sorting descending I noticed this on very large data sets. If it is the natural order of the queries and you LIMIT, it is fine. But if you have to do some work, it can crank.
David, if you can find a FULLTEXT engine that does what you need as mentioned in http:#36900331, then that IS probably best. However, on the "...that only shaved 2 million rows..." part, yes but it also removes the need to GROUP BY/SUM later. With your 16.5 million rows and index on word_id, link_id, and score (note: in the future, I believe the ASC|DESC designation of index columns will actually help in efficiency if as in your example you are always looking from top down) try this:

SELECT link_id, score FROM linkindex 
WHERE word_id IN (102,103,107,46,45,469,68,27,104,260)
ORDER BY score DESC LIMIT 0,16;

Open in new window


I cannot remember if you went through this before, but seeing that all the tables are MyISAM reminded me to mention you can tweak key_buffer_size if you have the resources to support it as that will help with your data size.
http://www.mysqlperformanceblog.com/2006/09/29/what-to-tune-in-mysql-server-after-installation/

Good luck!

Regards,

Kevin
0
 
LVL 13

Expert Comment

by:Hugh McCurdy
ID: 36900450
I don't know about Sphinx but I agree you should look.  

If this was my problem, I'd try to learn selection speeds without the sort to see if it was much faster (like .2 seconds instead of 4 seconds).  I suspect it wouldn't be but I'd find out.  If it was, I'd work on a plan based on benchmark speeds.

Best of luck.

Hugh
0
 
LVL 60

Expert Comment

by:Kevin Cross
ID: 36900457
AH, that is right. The score is over the work_id/link_id combination, but you are wanting the score over the link_id alone to get the top 16 rows based on the word search. Hmm. Darn. But worth a try still to see if tweaking key_buffer_size, using FORCE INDEX, etc. if this is improved until you get the C++ solution working.

SELECT link_id, SUM(score) AS score 
FROM linkindex FORCE INDEX(index_id_of_word_link_score_combo)
WHERE word_id IN (102,103,107,46,45,469,68,27,104,260)
ORDER BY score DESC LIMIT 0,16;

Open in new window

0
 
LVL 13

Accepted Solution

by:
Hugh McCurdy earned 1000 total points
ID: 36900475
I glossed over the C++ comment.  That's what I'd end up using if this was my problem (C++).  But I'd code it.   Sphynx is likely a good approach.  (No point in writing code that's already written by someone else).

I came back here because I didn't want to muddle up your new thread with a post of "I dunno" and have others think there's a solution already.  My guess is that you only need Sphynx on the server (where the mysql data resides).  But that's a guess (that I'd use only if you don't get an expert answer in that other thread.)
0
 

Author Comment

by:dolythgoe
ID: 36900480
Thanks again Kevin,

Unfortunately the SUM and GROUP BY still are required - this is illustrated with some data (ignore pos):

link_id  word_id  cat_id  pos  
2        26       5       10  
2        25       5       9    
2        24       5       8    
2        23       5       7    
2        22       5       6    
2        21       5       5    
2        20       5       4    
2        19       5       3    
2        18       5       2    
2        17       5       1    
2        31       1       0    
2        21       1       0    
2        30       1       0    
2        18       1       0    
2        29       1       0    
2        28       1       0    
2        22       1       0    
2        27       1       0    
2        25       1       0    
2        18       3       0    
4        46       5       5    
4        45       5       4    
4        44       5       3    
4        43       5       2    
4        42       5       1    
                       
You'd still need the link_id repeating and the word_id - instead of 3 rows for say, 5,3,1 for one word_id and one link_id you would have a score of 9 with one row - but that is one row per word_id | link_id relationship - there's still many more word_id's for that link_id - so although you have a total applied across the title/tag and category, you still need to sum by all the word_id's and group by link_id's - does that make sense?

Yes, I've tweaked those settings quite a bit so it's all in memory and although it's quick, it's not production quick for this type of thing.

0
 
LVL 60

Expert Comment

by:Kevin Cross
ID: 36900499
Yes, I remembered that, sorry for my forgetting previously. See http:#36900457 Your sample though shows rows for 2|25|5 then 2|25|1 -- I would create this table as 2|25|6 and use the FORCE INDEX to help with the fact that selecting more words increases the percent of rows selected to point MySQL opts to do a table scan. And again, I am not suggesting to do this instead of Sphinx, but as stop gap if it improves things. Anyway, I wish I could help with Sphinx, but cannot; therefore, good luck!

Best regards,

Kevin
0
 

Author Comment

by:dolythgoe
ID: 36900505
Ahh out of sync lol
0
 

Author Comment

by:dolythgoe
ID: 36900515
Huge thanks for your support guys - it's great to see the willingness to help out.

I will have a go at sphinx and let you know the outcome from some similar speed tests and also try your stopgap solution Kevin!

0
 
LVL 60

Expert Comment

by:Kevin Cross
ID: 36900521
You are most welcome! If you post back every 3 days or so on progress that should keep this active, so you can close when ready without having this marked as abandoned.
0
 

Author Comment

by:dolythgoe
ID: 36980391
Oops forgot about the 3 day thing!

Anyway results of the Sphinx integration are very positive.

It's exceedingly fast as compared to mysql and many more features than the FULLTEXT search. Documentation is a bit lacking in the 'examples' department as most of it is centered around what features do what and how to use them but not many 'cookbook' examples which developers can relate to when piecing things together in context.

Sphinx has proximity ranking built in and can be used for multi-faceted searches. Using it's API with AJAX can produce some incredible results. Would definately recommend it for searching where your products/docs enter the millions and for multi-faceted approach.

It's speed is down to hash tables written and accessed in C and also designed to make use of multi-core processors for a single query. A multi-keyword search will return in a split second on million+ index.
0

Featured Post

Visualize your virtual and backup environments

Create well-organized and polished visualizations of your virtual and backup environments when planning VMware vSphere, Microsoft Hyper-V or Veeam deployments. It helps you to gain better visibility and valuable business insights.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Does the idea of dealing with bits scare or confuse you? Does it seem like a waste of time in an age where we all have terabytes of storage? If so, you're missing out on one of the core tools in every professional programmer's toolbox. Learn how to …
Recursive SQL is one of the most fascinating and powerful and yet dangerous feature offered in many modern databases today using a Common Table Expression (CTE) first introduced in the ANSI SQL 99 standard. The first implementations of CTE began ap…
In this video, Percona Solution Engineer Dimitri Vanoverbeke discusses why you want to use at least three nodes in a database cluster. To discuss how Percona Consulting can help with your design and architecture needs for your database and infras…
In this video, Percona Solution Engineer Rick Golba discuss how (and why) you implement high availability in a database environment. To discuss how Percona Consulting can help with your design and architecture needs for your database and infrastr…
Suggested Courses

850 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question