How design a ranking algorithsm


I have a few many factors to consider when designing algorithms for ordering thousands of hit records.

1. The number of hits on the urls;
2. The number of the urls being shared among friends;
3. The number of recommendations on the urls;
4. The number of not-recommendations on the urls.

The score based on the above four factors should be converged into (0, 1]. In addition, there is a weight for each of the factors, say 20%, 30%, 40% and 10%.

Note please that the number of hits mostly is much larger than the number of each of the rest factors. I need to remove such skews in the score calculation.

Please advise!

Who is Participating?
BardobraveConnect With a Mentor Commented:
Of course those values can come from different urls. You record the maximum in each parameter to compare all possible values to this maximum.

This way, any url will compare all it's parameters to a value, and no one of them will obtain a number greater than 1 (this is the result that the url with the maximum value in any parameter will obtain on this parameter, although probably it will obtain lesser values on the rest of parameters).

The simplest way to measure a relationship between two values is to divide the lesser by the greater (this is exactly what I propose on the end of my second comment to measure a relation between hits and sharings). This way you can make that same value on a determinate parameter will throw a greater value if the related parameter is lesser (as you divide by the last one before applying the calculation).

In this case you need to compare your value (the result of the division) to the greatest result of that division in any of your recorded urls. If you simply divide the two greatest values you'll obtain a deviation, as both of them can come from different urls (as stated before), so this way probably the result of the division will be different than the maximum result obtained from the singular urls.
It should be easy enough.

For each of your factors you will have a value, and you should have stored the maximum value of any of those parameters anywhere (so when a value is modified and surpass one of those maximum values, it should be updated to reflect the new maximum).

Then, to achieve a ranking for any element, you compare your element value to the current maximum value on the system in each factor, this will return you a value between 0 and 1 for each factor. Then you apply the weight for each obtained value and finally add them all.

The most difficult (and resource consuming) part will be to maintain updated maximum values on all of your factors, but probably it will be better than checking maximum value each time a ranking must be done.
wsyyAuthor Commented:

Thought you underestimate the difficulty unfortunately.

For example, one url receives 100,000 hits over time, but only is shared by 100 friends. Since the hit number exceeds the shared time a lot, the weight of the hit factor doesn't matter any more.

The way I am seeing to the issue is how to remove the skewness of the seemly large hit number. Of course, the potential solution should address the skewness of other factors as well.
Get your problem seen by more experts

Be seen. Boost your question’s priority for more expert views and faster solutions

Maybe I've not explained myself clearly.

Supose an Url get 100000 hits and another one 275000 hits. Both are shared by 100 friends.
And the biggest values on your site are from an url getting 2500000 hits and 25000 sharings.

Then the values in your first web should be:
100000 / 2500000 = 0.04 * 0.2 = 0.008
100 / 25000 = 0.004 * 0.3 = 0.0012

And those of the second url will be:
275000 / 2500000 = 0.11 * 0.2 = 0.022
100 / 25000 = 0.004 * 0.3 = 0.0012

For partial results of 0.0092 and 0.0232 respectively. You see... here the values are more than two and a half times greater on the second url because it has more than two and a half times more hits, but shares return a value related directly to the max value of the shares on your site.

This way, comparing each parameter to it's max possible value, you separate them from it's relative size. In this example you have 2750 times less sharings than hits, but on the overall result, the sharing value is only 22 times lesser.

Also, in this case you can refer to another factor, the relation between hits and sharings, as in the previous example with same sharings over much lesser hits, both urls get same sharing value, although clearly the web with less hits should have a better score in this part.

To solve this replace the sharing value with sharings/hits as your measurement, and you'll have a much better result.
wsyyAuthor Commented:
Thanks Bardobrave! it is much clearer now.

"And the biggest values on your site are from an url getting 2500000 hits and 25000 sharings."

I assume that the largest hits and the largest sharings can come from different urls respectively, right?

Another question: what if correlation exists between any two factors?

Is there any handy calculation formula?
Obviously there will be some correlation. A page that gets no hits will get very few recommendations too. You just have to decide what to put in for the weights and the correlations will sort themselves out.

If you want to be a bit more complex, you can use the ranking of the pages that link to the page in question to give more weight to them.

For example, if a high ranking page links to page x, that should count for more than if a useless page links to it. This prevents people from artificially inflating their own ranks by making lots of pages that link to it. (This is part of how Google does their page ranking).
wsyyAuthor Commented:

what about the lesser and the greater occasionally switch?

you mentioned deviation, which can be very likely in my opinion.

i.e. some url receives abnormal hits of 100 millions say, while the rest receive less than a million each. How can I remove the impact of the abnormal 100 million hits w/o using standard deviation method?

the reason why I am reluctant to apply the twice and three time SD is that I have to deal with a large number of urls, say billions, and have to calculate the ranking score every day.

wsyyAuthor Commented:

Thanks for mentioning the page rank issue. Unfortunately, I don't know an easy way to implement the algorithms, so that may not be helpful to solve the correlation.

TommySzalapskiConnect With a Mentor Commented:
You can calculate average and standard deviation on the fly.
Average is easy, just keep a running sum and a count. Then, when you need and average, just divide.
Or (even better) if you want the average to be more heavily weighted for recent days, use a sliding window approach. Let's say you use 60 days as your window size

For the first 60 days, track the average normally (maintain sum and count and divide) then after that use the following:

NewAvg = oldAvg*59/60 + hitsToday/60

It gives the overall average but adds a bit of weight to more recent data.

There is also a similar way to maintain the standard deviation. All the math/stat types will tell you that the square root of the average of squares minus the square of averages is the same as the standard deviation. I'll explain.

Keep a running sum of the hits and the squares of the hits.
avgSquare = ( day1^2 + day2^2 + day3^2 ... +dayN^2) / N
avg =  ( day1 + day2 + day3 ... +dayN) / N

standard deviation = sqareRoot(avgSquare - avg^2)

(If you test in Excel use stdevp not stddev) You can do the same sliding window thing with them too.
wsyyAuthor Commented:
Sorry should have given Bardobrave half of the points.
You can request for attention a manager and ask him to share the points.
I put in the request for attention. They should open it back up soon and you can close it the way you meant to.
wsyyAuthor Commented:
Great help. Thanks for all!
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.