Solved

MySQL query for proximity scoring

Posted on 2011-09-14
9
336 Views
Last Modified: 2012-05-12
Hi all,

I'm trying to work through a particular challenge with SQL to get to an end result which will score proximity of word_ids. I've got this table:

CREATE TABLE IF NOT EXISTS `linkindex` (
  `link_id` int(10) unsigned NOT NULL,
  `word_id` mediumint(10) unsigned NOT NULL,
  `cat_id` tinyint(3) unsigned NOT NULL,
  `pos` tinyint(3) unsigned NOT NULL,
  PRIMARY KEY (`word_id`,`link_id`,`cat_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

Open in new window


And I have this search query:

SELECT link_id, SUM(cat_id) AS score
				FROM linkindex
				WHERE word_id IN ('.$word_id_list.')
				GROUP BY link_id 
				ORDER BY score DESC
				LIMIT 0,16';

Open in new window


PHP feeds a series or word_ids into the IN (1,2,3,4,5,6) section.

As you can see,at the mo this query orders by the score on the cat_id column but I want to use the pos column to score proximity.

So a run through example - words 'one' 'two' are converted to id numbers 1 and 2.

These are then placed in the query to find matches on the word_id column in linkindex. The pos is the position of these words in the sentance when the index was first built up.

What I'm looking to do is apply some maths on the positions to find the 'distance' between the two matches with a view to building an overall relevancy score.

Apologies if I sound confusing! Any pointers on what I have here would be a massive help and I'm sure a good challenge for an SQL guru!
0
Comment
Question by:dolythgoe
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 3
9 Comments
 
LVL 24

Expert Comment

by:johanntagle
ID: 36539884
It would help if you can give a sample desired output.  Thanks.
0
 

Author Comment

by:dolythgoe
ID: 36540385
Hi,

Ok so here's some dummy data firs off:

word_id      link_id      cat_id      pos
1      1      5      1
3      1      5      2
5      1      5      3
2      1      5      4
6      2      5      1
8      2      5      2
11      2      5      3
2      2      5      4
1      3      5      1
2      3      5      2
5      3      5      3
8      3      5      4

The output I'm looking for is (in english terms):

"Find me the link_ids with occurences of word_id 1 or 2 and order by the lowest distance found between the 2 word_ids)

1st off, score more matches higher (so link_id's that have 1 and 2 in that case)
2nd - by looking at the pos column (which the position of the word in the sentence) score those higher where the word_ids are closer together.

So from the example above - link_id 3 is most relevant since it contians both 1 and 2 and the positions of those 2 word_ids is 1 and 2 in the sentance so 0 distance between them.

link_id 1 contains both word_ids too but the position is 1 and 4 respectively so not as good as link_id 3.

link_id 2 only contains word 2 so is bottom of the pile.

cat_id is just the weighting applied to that relationship - so whether the word_id is a tag or title for example. There are 3 of these in total 5, 3 and 1 - 5 being the title.

I would only like to score the proximity/distance on cat_id 5's but still include cat_ids 1 & 3 in the overall score which have a position of 0.

I've attached a view of a fuller table of data containing the cat_id 3 and 1 - you'll prob realise my query at the moment is only scoring on the SUM(cat_id) which is the weighting applied to the occurences (i.e. an occurence in the title carries a weight of 5). I've highlighted the word_ids and the areas which I want to take into account for scoring. The grey areas just distinguish the change in link_id.

 linkindex dummy data  
Pasted here as text here too:

word_id      link_id      cat_id      pos
1      1      5      1
3      1      5      2
5      1      5      3
2      1      5      4
2      1      3      0
3      1      3      0
1      1      3      0
4      1      1      0
15      1      1      0
2      1      1      0
6      2      5      1
8      2      5      2
11      2      5      3
2      2      5      4
1      2      3      0
3      2      3      0
11      2      1      0
45      2      1      0
1      3      5      1
2      3      5      2
5      3      5      3
8      3      5      4
22      3      3      0
21      3      3      0
33      3      1      0
1      3      1      0

The table with cat_id 5,3 and 1 contains 0 where a position is not appliceable - this can be changed to anything.

PRIMARY INDEX spans word_id, link_id and cat_id

So to recap, we have a score for matches on occurences through cat_id 5,3 and 1 (higher the better) and then want to look at the position of the cat_id 5's with a view to adding to the score based on the distance between the positions (lower the better).

Hope this makes sense - bit of a nasty one to explain!
0
 
LVL 24

Expert Comment

by:johanntagle
ID: 36540443
Looks interesting.  Unfortunately I'm busy at work now and can only look into this around 10 hours from now.  If nobody gives you an answer before I'll give it a try.
0
Windows Server 2016: All you need to know

Learn about Hyper-V features that increase functionality and usability of Microsoft Windows Server 2016. Also, throughout this eBook, you’ll find some basic PowerShell examples that will help you leverage the scripts in your environments!

 

Author Comment

by:dolythgoe
ID: 36540468
No worries - I think this is an interesting challenge - I've been looking at STDDEV() and other things which might play a part...
0
 

Author Comment

by:dolythgoe
ID: 36540669
Oh there's one more thing that I'd like to add which I've forgotten to mention :/ (Should be awarding a million points for this!).

That is the order in which the words occur in the position, not just the proximity,

Searching 1,2 is different to 2,1.

So there's 3 main things I would like to score on this table:
- 1: Occurences of the word_ids
- 2: The order in which they occur as matched to the orginal input (i.g. pos 1,2 is a great match for query of word_id IN(1,2)
- 3. The proximity of the positions of the word_ids

Now if this doesn't give you a headache I'm not sure what will!!
0
 

Author Comment

by:dolythgoe
ID: 36540688
Sorry...1 final thing! It needs to be as fast as possible - the real table has 20 million+ rows!
0
 

Author Comment

by:dolythgoe
ID: 36900041
Think I have the proximity part sorted for this to affect score but not the order.

Will post if anyone's still interested..
0
 
LVL 24

Accepted Solution

by:
johanntagle earned 500 total points
ID: 36900810
Hi sorry I honestly forgot all about this.  Do post what you have.  Also, can I ask what you are doing this for?  It looks like it's for a search engine so I'm wondering why build one vs use something like Sphinx or Solr.  Thanks.
0
 

Author Comment

by:dolythgoe
ID: 36900843
Hehe, well it was mainly to get an understanding - kind of like homework! I wanted to see how far just sql can go and the limitations -  I'm using sphinx now as it's designed for the purpose but at least this project has helped me understand a few more things - Sphinx is a beast! Oh, if you have any experience on it, I posted a question relating to the installation across different servers if you're able to comment?

Cheers
0

Featured Post

Major Incident Management Communications

Major incidents and IT service outages cost companies millions. Often the solution to minimizing damage is automated communication. Find out more in our Major Incident Management Communications infographic.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Containers like Docker and Rocket are getting more popular every day. In my conversations with customers, they consistently ask what containers are and how they can use them in their environment. If you’re as curious as most people, read on. . .
This post looks at MongoDB and MySQL, and covers high-level MongoDB strengths, weaknesses, features, and uses from the perspective of an SQL user.
This is a high-level webinar that covers the history of enterprise open source database use. It addresses both the advantages companies see in using open source database technologies, as well as the fears and reservations they might have. In this…
In this video, viewers will be given step by step instructions on adjusting mouse, pointer and cursor visibility in Microsoft Windows 10. The video seeks to educate those who are struggling with the new Windows 10 Graphical User Interface. Change Cu…

734 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question