Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
Solved

How design a ranking algorithsm

Posted on 2011-09-14
Medium Priority
328 Views
Hi,

I have a few many factors to consider when designing algorithms for ordering thousands of hit records.

1. The number of hits on the urls;
2. The number of the urls being shared among friends;
3. The number of recommendations on the urls;
4. The number of not-recommendations on the urls.

The score based on the above four factors should be converged into (0, 1]. In addition, there is a weight for each of the factors, say 20%, 30%, 40% and 10%.

Note please that the number of hits mostly is much larger than the number of each of the rest factors. I need to remove such skews in the score calculation.

Thanks
0
Question by:wsyy
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

• Help others & share knowledge
• Earn cash & points
• 6
• 4
• 3

LVL 19

Expert Comment

ID: 36535790
It should be easy enough.

For each of your factors you will have a value, and you should have stored the maximum value of any of those parameters anywhere (so when a value is modified and surpass one of those maximum values, it should be updated to reflect the new maximum).

Then, to achieve a ranking for any element, you compare your element value to the current maximum value on the system in each factor, this will return you a value between 0 and 1 for each factor. Then you apply the weight for each obtained value and finally add them all.

The most difficult (and resource consuming) part will be to maintain updated maximum values on all of your factors, but probably it will be better than checking maximum value each time a ranking must be done.
0

Author Comment

ID: 36536029
Bardobrave:

Thought you underestimate the difficulty unfortunately.

For example, one url receives 100,000 hits over time, but only is shared by 100 friends. Since the hit number exceeds the shared time a lot, the weight of the hit factor doesn't matter any more.

The way I am seeing to the issue is how to remove the skewness of the seemly large hit number. Of course, the potential solution should address the skewness of other factors as well.
0

LVL 19

Expert Comment

ID: 36536771
Maybe I've not explained myself clearly.

Supose an Url get 100000 hits and another one 275000 hits. Both are shared by 100 friends.
And the biggest values on your site are from an url getting 2500000 hits and 25000 sharings.

Then the values in your first web should be:
100000 / 2500000 = 0.04 * 0.2 = 0.008
100 / 25000 = 0.004 * 0.3 = 0.0012

And those of the second url will be:
275000 / 2500000 = 0.11 * 0.2 = 0.022
100 / 25000 = 0.004 * 0.3 = 0.0012

For partial results of 0.0092 and 0.0232 respectively. You see... here the values are more than two and a half times greater on the second url because it has more than two and a half times more hits, but shares return a value related directly to the max value of the shares on your site.

This way, comparing each parameter to it's max possible value, you separate them from it's relative size. In this example you have 2750 times less sharings than hits, but on the overall result, the sharing value is only 22 times lesser.

Also, in this case you can refer to another factor, the relation between hits and sharings, as in the previous example with same sharings over much lesser hits, both urls get same sharing value, although clearly the web with less hits should have a better score in this part.

To solve this replace the sharing value with sharings/hits as your measurement, and you'll have a much better result.
0

Author Comment

ID: 36537213
Thanks Bardobrave! it is much clearer now.

"And the biggest values on your site are from an url getting 2500000 hits and 25000 sharings."

I assume that the largest hits and the largest sharings can come from different urls respectively, right?

Another question: what if correlation exists between any two factors?

Is there any handy calculation formula?
0

LVL 37

Expert Comment

ID: 36538043
Obviously there will be some correlation. A page that gets no hits will get very few recommendations too. You just have to decide what to put in for the weights and the correlations will sort themselves out.

If you want to be a bit more complex, you can use the ranking of the pages that link to the page in question to give more weight to them.

For example, if a high ranking page links to page x, that should count for more than if a useless page links to it. This prevents people from artificially inflating their own ranks by making lots of pages that link to it. (This is part of how Google does their page ranking).
0

LVL 19

Accepted Solution

Bardobrave earned 500 total points
ID: 36541015
Of course those values can come from different urls. You record the maximum in each parameter to compare all possible values to this maximum.

This way, any url will compare all it's parameters to a value, and no one of them will obtain a number greater than 1 (this is the result that the url with the maximum value in any parameter will obtain on this parameter, although probably it will obtain lesser values on the rest of parameters).

The simplest way to measure a relationship between two values is to divide the lesser by the greater (this is exactly what I propose on the end of my second comment to measure a relation between hits and sharings). This way you can make that same value on a determinate parameter will throw a greater value if the related parameter is lesser (as you divide by the last one before applying the calculation).

In this case you need to compare your value (the result of the division) to the greatest result of that division in any of your recorded urls. If you simply divide the two greatest values you'll obtain a deviation, as both of them can come from different urls (as stated before), so this way probably the result of the division will be different than the maximum result obtained from the singular urls.
0

Author Comment

ID: 36553606
Bardobrave:

what about the lesser and the greater occasionally switch?

you mentioned deviation, which can be very likely in my opinion.

i.e. some url receives abnormal hits of 100 millions say, while the rest receive less than a million each. How can I remove the impact of the abnormal 100 million hits w/o using standard deviation method?

the reason why I am reluctant to apply the twice and three time SD is that I have to deal with a large number of urls, say billions, and have to calculate the ranking score every day.

thanks.
0

Author Comment

ID: 36553612
TommySzalapski:

Thanks for mentioning the page rank issue. Unfortunately, I don't know an easy way to implement the algorithms, so that may not be helpful to solve the correlation.

Thanks
0

LVL 37

Assisted Solution

TommySzalapski earned 500 total points
ID: 36560913
You can calculate average and standard deviation on the fly.
Average is easy, just keep a running sum and a count. Then, when you need and average, just divide.
Or (even better) if you want the average to be more heavily weighted for recent days, use a sliding window approach. Let's say you use 60 days as your window size

For the first 60 days, track the average normally (maintain sum and count and divide) then after that use the following:

NewAvg = oldAvg*59/60 + hitsToday/60

It gives the overall average but adds a bit of weight to more recent data.

There is also a similar way to maintain the standard deviation. All the math/stat types will tell you that the square root of the average of squares minus the square of averages is the same as the standard deviation. I'll explain.

Keep a running sum of the hits and the squares of the hits.
avgSquare = ( day1^2 + day2^2 + day3^2 ... +dayN^2) / N
avg =  ( day1 + day2 + day3 ... +dayN) / N

standard deviation = sqareRoot(avgSquare - avg^2)

(If you test in Excel use stdevp not stddev) You can do the same sliding window thing with them too.
0

Author Comment

ID: 36937873
Sorry should have given Bardobrave half of the points.
0

LVL 19

Expert Comment

ID: 36938289
You can request for attention a manager and ask him to share the points.
0

LVL 37

Expert Comment

ID: 36938773
I put in the request for attention. They should open it back up soon and you can close it the way you meant to.
0

Author Comment

ID: 36940583
Great help. Thanks for all!
0

Featured Post

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This algorithm (in C#) will resize any image down to a given size while maintaining the original aspect ratio. The maximum width and max height are both optional but if neither are given, the original image is returned. This example is designed tâ€¦
The greatest common divisor (gcd) of two positive integers is their largest common divisor. Let's consider two numbers 12 and 20. The divisors of 12 are 1, 2, 3, 4, 6, 12 The divisors of 20 are 1, 2, 4, 5, 10 20 The highest number among the câ€¦
I've attached the XLSM Excel spreadsheet I used in the video and also text files containing the macros used below. https://filedb.experts-exchange.com/incoming/2017/03_w12/1151775/Permutations.txt https://filedb.experts-exchange.com/incoming/201â€¦
Do you want to know how to make a graph with Microsoft Access? First, create a query with the data for the chart. Then make a blank form and add a chart control. This video also shows how to change what data is displayed on the graph as well as formâ€¦
Suggested Courses
Course of the Month9 days, 5 hours left to enroll