asked on

How design a ranking algorithsm

Hi,

I have a few many factors to consider when designing algorithms for ordering thousands of hit records.

1. The number of hits on the urls;
2. The number of the urls being shared among friends;
3. The number of recommendations on the urls;
4. The number of not-recommendations on the urls.

The score based on the above four factors should be converged into (0, 1]. In addition, there is a weight for each of the factors, say 20%, 30%, 40% and 10%.

Note please that the number of hits mostly is much larger than the number of each of the rest factors. I need to remove such skews in the score calculation.

Please advise!

Thanks

Bardobrave

It should be easy enough.

For each of your factors you will have a value, and you should have stored the maximum value of any of those parameters anywhere (so when a value is modified and surpass one of those maximum values, it should be updated to reflect the new maximum).

Then, to achieve a ranking for any element, you compare your element value to the current maximum value on the system in each factor, this will return you a value between 0 and 1 for each factor. Then you apply the weight for each obtained value and finally add them all.

The most difficult (and resource consuming) part will be to maintain updated maximum values on all of your factors, but probably it will be better than checking maximum value each time a ranking must be done.

wsyy

ASKER

Bardobrave:

Thought you underestimate the difficulty unfortunately.

For example, one url receives 100,000 hits over time, but only is shared by 100 friends. Since the hit number exceeds the shared time a lot, the weight of the hit factor doesn't matter any more.

The way I am seeing to the issue is how to remove the skewness of the seemly large hit number. Of course, the potential solution should address the skewness of other factors as well.

Bardobrave

Maybe I've not explained myself clearly.

Supose an Url get 100000 hits and another one 275000 hits. Both are shared by 100 friends.
And the biggest values on your site are from an url getting 2500000 hits and 25000 sharings.

Then the values in your first web should be:
100000 / 2500000 = 0.04 * 0.2 = 0.008
100 / 25000 = 0.004 * 0.3 = 0.0012

And those of the second url will be:
275000 / 2500000 = 0.11 * 0.2 = 0.022
100 / 25000 = 0.004 * 0.3 = 0.0012

For partial results of 0.0092 and 0.0232 respectively. You see... here the values are more than two and a half times greater on the second url because it has more than two and a half times more hits, but shares return a value related directly to the max value of the shares on your site.

This way, comparing each parameter to it's max possible value, you separate them from it's relative size. In this example you have 2750 times less sharings than hits, but on the overall result, the sharing value is only 22 times lesser.

Also, in this case you can refer to another factor, the relation between hits and sharings, as in the previous example with same sharings over much lesser hits, both urls get same sharing value, although clearly the web with less hits should have a better score in this part.

To solve this replace the sharing value with sharings/hits as your measurement, and you'll have a much better result.

wsyy

ASKER

Thanks Bardobrave! it is much clearer now.

"And the biggest values on your site are from an url getting 2500000 hits and 25000 sharings."

I assume that the largest hits and the largest sharings can come from different urls respectively, right?

Another question: what if correlation exists between any two factors?

Is there any handy calculation formula?

TommySzalapski

Obviously there will be some correlation. A page that gets no hits will get very few recommendations too. You just have to decide what to put in for the weights and the correlations will sort themselves out.

If you want to be a bit more complex, you can use the ranking of the pages that link to the page in question to give more weight to them.

For example, if a high ranking page links to page x, that should count for more than if a useless page links to it. This prevents people from artificially inflating their own ranks by making lots of pages that link to it. (This is part of how Google does their page ranking).

ASKER CERTIFIED SOLUTION

Bardobrave

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

wsyy

ASKER

Bardobrave:

what about the lesser and the greater occasionally switch?

you mentioned deviation, which can be very likely in my opinion.

i.e. some url receives abnormal hits of 100 millions say, while the rest receive less than a million each. How can I remove the impact of the abnormal 100 million hits w/o using standard deviation method?

the reason why I am reluctant to apply the twice and three time SD is that I have to deal with a large number of urls, say billions, and have to calculate the ranking score every day.

thanks.

wsyy

ASKER

TommySzalapski:

Thanks for mentioning the page rank issue. Unfortunately, I don't know an easy way to implement the algorithms, so that may not be helpful to solve the correlation.

Thanks

SOLUTION

TommySzalapski

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

wsyy

ASKER

Sorry should have given Bardobrave half of the points.

Bardobrave

You can request for attention a manager and ask him to share the points.

TommySzalapski

I put in the request for attention. They should open it back up soon and you can close it the way you meant to.

wsyy

ASKER

Great help. Thanks for all!