Solved

MySQL query for proximity scoring

Posted on 2011-09-14
9
323 Views
Last Modified: 2012-05-12
Hi all,

I'm trying to work through a particular challenge with SQL to get to an end result which will score proximity of word_ids. I've got this table:

CREATE TABLE IF NOT EXISTS `linkindex` (
  `link_id` int(10) unsigned NOT NULL,
  `word_id` mediumint(10) unsigned NOT NULL,
  `cat_id` tinyint(3) unsigned NOT NULL,
  `pos` tinyint(3) unsigned NOT NULL,
  PRIMARY KEY (`word_id`,`link_id`,`cat_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

Open in new window


And I have this search query:

SELECT link_id, SUM(cat_id) AS score
				FROM linkindex
				WHERE word_id IN ('.$word_id_list.')
				GROUP BY link_id 
				ORDER BY score DESC
				LIMIT 0,16';

Open in new window


PHP feeds a series or word_ids into the IN (1,2,3,4,5,6) section.

As you can see,at the mo this query orders by the score on the cat_id column but I want to use the pos column to score proximity.

So a run through example - words 'one' 'two' are converted to id numbers 1 and 2.

These are then placed in the query to find matches on the word_id column in linkindex. The pos is the position of these words in the sentance when the index was first built up.

What I'm looking to do is apply some maths on the positions to find the 'distance' between the two matches with a view to building an overall relevancy score.

Apologies if I sound confusing! Any pointers on what I have here would be a massive help and I'm sure a good challenge for an SQL guru!
0
Comment
Question by:dolythgoe
  • 6
  • 3
9 Comments
 
LVL 24

Expert Comment

by:johanntagle
ID: 36539884
It would help if you can give a sample desired output.  Thanks.
0
 

Author Comment

by:dolythgoe
ID: 36540385
Hi,

Ok so here's some dummy data firs off:

word_id      link_id      cat_id      pos
1      1      5      1
3      1      5      2
5      1      5      3
2      1      5      4
6      2      5      1
8      2      5      2
11      2      5      3
2      2      5      4
1      3      5      1
2      3      5      2
5      3      5      3
8      3      5      4

The output I'm looking for is (in english terms):

"Find me the link_ids with occurences of word_id 1 or 2 and order by the lowest distance found between the 2 word_ids)

1st off, score more matches higher (so link_id's that have 1 and 2 in that case)
2nd - by looking at the pos column (which the position of the word in the sentence) score those higher where the word_ids are closer together.

So from the example above - link_id 3 is most relevant since it contians both 1 and 2 and the positions of those 2 word_ids is 1 and 2 in the sentance so 0 distance between them.

link_id 1 contains both word_ids too but the position is 1 and 4 respectively so not as good as link_id 3.

link_id 2 only contains word 2 so is bottom of the pile.

cat_id is just the weighting applied to that relationship - so whether the word_id is a tag or title for example. There are 3 of these in total 5, 3 and 1 - 5 being the title.

I would only like to score the proximity/distance on cat_id 5's but still include cat_ids 1 & 3 in the overall score which have a position of 0.

I've attached a view of a fuller table of data containing the cat_id 3 and 1 - you'll prob realise my query at the moment is only scoring on the SUM(cat_id) which is the weighting applied to the occurences (i.e. an occurence in the title carries a weight of 5). I've highlighted the word_ids and the areas which I want to take into account for scoring. The grey areas just distinguish the change in link_id.

 linkindex dummy data  
Pasted here as text here too:

word_id      link_id      cat_id      pos
1      1      5      1
3      1      5      2
5      1      5      3
2      1      5      4
2      1      3      0
3      1      3      0
1      1      3      0
4      1      1      0
15      1      1      0
2      1      1      0
6      2      5      1
8      2      5      2
11      2      5      3
2      2      5      4
1      2      3      0
3      2      3      0
11      2      1      0
45      2      1      0
1      3      5      1
2      3      5      2
5      3      5      3
8      3      5      4
22      3      3      0
21      3      3      0
33      3      1      0
1      3      1      0

The table with cat_id 5,3 and 1 contains 0 where a position is not appliceable - this can be changed to anything.

PRIMARY INDEX spans word_id, link_id and cat_id

So to recap, we have a score for matches on occurences through cat_id 5,3 and 1 (higher the better) and then want to look at the position of the cat_id 5's with a view to adding to the score based on the distance between the positions (lower the better).

Hope this makes sense - bit of a nasty one to explain!
0
 
LVL 24

Expert Comment

by:johanntagle
ID: 36540443
Looks interesting.  Unfortunately I'm busy at work now and can only look into this around 10 hours from now.  If nobody gives you an answer before I'll give it a try.
0
 

Author Comment

by:dolythgoe
ID: 36540468
No worries - I think this is an interesting challenge - I've been looking at STDDEV() and other things which might play a part...
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 

Author Comment

by:dolythgoe
ID: 36540669
Oh there's one more thing that I'd like to add which I've forgotten to mention :/ (Should be awarding a million points for this!).

That is the order in which the words occur in the position, not just the proximity,

Searching 1,2 is different to 2,1.

So there's 3 main things I would like to score on this table:
- 1: Occurences of the word_ids
- 2: The order in which they occur as matched to the orginal input (i.g. pos 1,2 is a great match for query of word_id IN(1,2)
- 3. The proximity of the positions of the word_ids

Now if this doesn't give you a headache I'm not sure what will!!
0
 

Author Comment

by:dolythgoe
ID: 36540688
Sorry...1 final thing! It needs to be as fast as possible - the real table has 20 million+ rows!
0
 

Author Comment

by:dolythgoe
ID: 36900041
Think I have the proximity part sorted for this to affect score but not the order.

Will post if anyone's still interested..
0
 
LVL 24

Accepted Solution

by:
johanntagle earned 500 total points
ID: 36900810
Hi sorry I honestly forgot all about this.  Do post what you have.  Also, can I ask what you are doing this for?  It looks like it's for a search engine so I'm wondering why build one vs use something like Sphinx or Solr.  Thanks.
0
 

Author Comment

by:dolythgoe
ID: 36900843
Hehe, well it was mainly to get an understanding - kind of like homework! I wanted to see how far just sql can go and the limitations -  I'm using sphinx now as it's designed for the purpose but at least this project has helped me understand a few more things - Sphinx is a beast! Oh, if you have any experience on it, I posted a question relating to the installation across different servers if you're able to comment?

Cheers
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Suggested Solutions

As a database administrator, you may need to audit your table(s) to determine whether the data types are optimal for your real-world data needs.  This Article is intended to be a resource for such a task. Preface The other day, I was involved …
Introduction Since I wrote the original article about Handling Date and Time in PHP and MySQL (http://www.experts-exchange.com/articles/201/Handling-Date-and-Time-in-PHP-and-MySQL.html) several years ago, it seemed like now was a good time to updat…
This video discusses moving either the default database or any database to a new volume.
Get a first impression of how PRTG looks and learn how it works.   This video is a short introduction to PRTG, as an initial overview or as a quick start for new PRTG users.

760 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now