Solved

How design a ranking algorithsm

Posted on 2011-09-14
14
310 Views
Last Modified: 2012-06-27
Hi,

I have a few many factors to consider when designing algorithms for ordering thousands of hit records.

1. The number of hits on the urls;
2. The number of the urls being shared among friends;
3. The number of recommendations on the urls;
4. The number of not-recommendations on the urls.

The score based on the above four factors should be converged into (0, 1]. In addition, there is a weight for each of the factors, say 20%, 30%, 40% and 10%.

Note please that the number of hits mostly is much larger than the number of each of the rest factors. I need to remove such skews in the score calculation.

Please advise!

Thanks
0
Comment
Question by:wsyy
  • 6
  • 4
  • 3
14 Comments
 
LVL 19

Expert Comment

by:Bardobrave
ID: 36535790
It should be easy enough.

For each of your factors you will have a value, and you should have stored the maximum value of any of those parameters anywhere (so when a value is modified and surpass one of those maximum values, it should be updated to reflect the new maximum).

Then, to achieve a ranking for any element, you compare your element value to the current maximum value on the system in each factor, this will return you a value between 0 and 1 for each factor. Then you apply the weight for each obtained value and finally add them all.

The most difficult (and resource consuming) part will be to maintain updated maximum values on all of your factors, but probably it will be better than checking maximum value each time a ranking must be done.
0
 

Author Comment

by:wsyy
ID: 36536029
Bardobrave:

Thought you underestimate the difficulty unfortunately.

For example, one url receives 100,000 hits over time, but only is shared by 100 friends. Since the hit number exceeds the shared time a lot, the weight of the hit factor doesn't matter any more.

The way I am seeing to the issue is how to remove the skewness of the seemly large hit number. Of course, the potential solution should address the skewness of other factors as well.
0
 
LVL 19

Expert Comment

by:Bardobrave
ID: 36536771
Maybe I've not explained myself clearly.

Supose an Url get 100000 hits and another one 275000 hits. Both are shared by 100 friends.
And the biggest values on your site are from an url getting 2500000 hits and 25000 sharings.

Then the values in your first web should be:
100000 / 2500000 = 0.04 * 0.2 = 0.008
100 / 25000 = 0.004 * 0.3 = 0.0012

And those of the second url will be:
275000 / 2500000 = 0.11 * 0.2 = 0.022
100 / 25000 = 0.004 * 0.3 = 0.0012

For partial results of 0.0092 and 0.0232 respectively. You see... here the values are more than two and a half times greater on the second url because it has more than two and a half times more hits, but shares return a value related directly to the max value of the shares on your site.

This way, comparing each parameter to it's max possible value, you separate them from it's relative size. In this example you have 2750 times less sharings than hits, but on the overall result, the sharing value is only 22 times lesser.

Also, in this case you can refer to another factor, the relation between hits and sharings, as in the previous example with same sharings over much lesser hits, both urls get same sharing value, although clearly the web with less hits should have a better score in this part.

To solve this replace the sharing value with sharings/hits as your measurement, and you'll have a much better result.
0
 

Author Comment

by:wsyy
ID: 36537213
Thanks Bardobrave! it is much clearer now.

"And the biggest values on your site are from an url getting 2500000 hits and 25000 sharings."

I assume that the largest hits and the largest sharings can come from different urls respectively, right?

Another question: what if correlation exists between any two factors?

Is there any handy calculation formula?
0
 
LVL 37

Expert Comment

by:TommySzalapski
ID: 36538043
Obviously there will be some correlation. A page that gets no hits will get very few recommendations too. You just have to decide what to put in for the weights and the correlations will sort themselves out.

If you want to be a bit more complex, you can use the ranking of the pages that link to the page in question to give more weight to them.

For example, if a high ranking page links to page x, that should count for more than if a useless page links to it. This prevents people from artificially inflating their own ranks by making lots of pages that link to it. (This is part of how Google does their page ranking).
0
 
LVL 19

Accepted Solution

by:
Bardobrave earned 125 total points
ID: 36541015
Of course those values can come from different urls. You record the maximum in each parameter to compare all possible values to this maximum.

This way, any url will compare all it's parameters to a value, and no one of them will obtain a number greater than 1 (this is the result that the url with the maximum value in any parameter will obtain on this parameter, although probably it will obtain lesser values on the rest of parameters).

The simplest way to measure a relationship between two values is to divide the lesser by the greater (this is exactly what I propose on the end of my second comment to measure a relation between hits and sharings). This way you can make that same value on a determinate parameter will throw a greater value if the related parameter is lesser (as you divide by the last one before applying the calculation).

In this case you need to compare your value (the result of the division) to the greatest result of that division in any of your recorded urls. If you simply divide the two greatest values you'll obtain a deviation, as both of them can come from different urls (as stated before), so this way probably the result of the division will be different than the maximum result obtained from the singular urls.
0
Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

 

Author Comment

by:wsyy
ID: 36553606
Bardobrave:

what about the lesser and the greater occasionally switch?

you mentioned deviation, which can be very likely in my opinion.

i.e. some url receives abnormal hits of 100 millions say, while the rest receive less than a million each. How can I remove the impact of the abnormal 100 million hits w/o using standard deviation method?

the reason why I am reluctant to apply the twice and three time SD is that I have to deal with a large number of urls, say billions, and have to calculate the ranking score every day.

thanks.
0
 

Author Comment

by:wsyy
ID: 36553612
TommySzalapski:

Thanks for mentioning the page rank issue. Unfortunately, I don't know an easy way to implement the algorithms, so that may not be helpful to solve the correlation.

Thanks
0
 
LVL 37

Assisted Solution

by:TommySzalapski
TommySzalapski earned 125 total points
ID: 36560913
You can calculate average and standard deviation on the fly.
Average is easy, just keep a running sum and a count. Then, when you need and average, just divide.
Or (even better) if you want the average to be more heavily weighted for recent days, use a sliding window approach. Let's say you use 60 days as your window size

For the first 60 days, track the average normally (maintain sum and count and divide) then after that use the following:

NewAvg = oldAvg*59/60 + hitsToday/60

It gives the overall average but adds a bit of weight to more recent data.

There is also a similar way to maintain the standard deviation. All the math/stat types will tell you that the square root of the average of squares minus the square of averages is the same as the standard deviation. I'll explain.

Keep a running sum of the hits and the squares of the hits.
avgSquare = ( day1^2 + day2^2 + day3^2 ... +dayN^2) / N
avg =  ( day1 + day2 + day3 ... +dayN) / N

standard deviation = sqareRoot(avgSquare - avg^2)

(If you test in Excel use stdevp not stddev) You can do the same sliding window thing with them too.
0
 

Author Comment

by:wsyy
ID: 36937873
Sorry should have given Bardobrave half of the points.
0
 
LVL 19

Expert Comment

by:Bardobrave
ID: 36938289
You can request for attention a manager and ask him to share the points.
0
 
LVL 37

Expert Comment

by:TommySzalapski
ID: 36938773
I put in the request for attention. They should open it back up soon and you can close it the way you meant to.
0
 

Author Comment

by:wsyy
ID: 36940583
Great help. Thanks for all!
0

Featured Post

Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

Join & Write a Comment

Suggested Solutions

This algorithm (in C#) will resize any image down to a given size while maintaining the original aspect ratio. The maximum width and max height are both optional but if neither are given, the original image is returned. This example is designed t…
Prime numbers are natural numbers greater than 1 that have only two divisors (the number itself and 1). By “divisible” we mean dividend % divisor = 0 (% indicates MODULAR. It gives the reminder of a division operation). We’ll follow multiple approac…
Polish reports in Access so they look terrific. Take yourself to another level. Equations, Back Color, Alternate Back Color. Write easy VBA Code. Tighten space to use less pages. Launch report from a menu, considering criteria only when it is filled…
This tutorial demonstrates a quick way of adding group price to multiple Magento products.

760 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

17 Experts available now in Live!

Get 1:1 Help Now