Link to home
Start Free TrialLog in
Avatar of remlabinc
remlabinc

asked on

File Popularity Equation

The objective is to build a better score system for my file serving application. We have over 1m files and 1k file are added every day. We also server about 7k-10k downloads per day.

I would like to take the following into consideration when writing the equation.

votes_like, votes_dislike, time_added, time_last_download, total_downloads, bookmarks(sum of people who bookmarked this file)

i think the values are self explanatory. Some style of adaptive aging with the time stamps is the direction I would like to head.

I have looked at "Wilson's Score" and I think I could adapt it with some help.

Given it cannot be a update that processes every second to each file. That would be a waste of resource.
Avatar of aburr
aburr
Flag of United States of America image

popularity equation = variable * your weight * (current time - time stamp) +[ repeat for all the other variables].
--
Where
"variable" is one from your list
"your weight" is your estimate of the importance of that variable. Perhaps a number from 1 to 10. This is the most important (and subjective) part of the equation.
:"current time" is the time the data is collected
"time stamp" is the most undefined of the parts of the equation but is something that you have to decide. It is the aging factor you wished to include and is entirely dependent on you particulars.
"your weight" could be negagtive
ASKER CERTIFIED SOLUTION
Avatar of Markus Fischer
Markus Fischer
Flag of Switzerland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
For this, "votes_like, votes_dislike" I would just use a "stars" system.  One star is boring, five stars is wonderful.  The scale is normalized from 0 to 5.  Zero means no vote was cast.  An arithmetic mean of the star counts makes sense in terms of popularity.

These fields all make sense to me, "time_added, time_last_download, total_downloads" but "bookmarks" does not - if I downloaded a file but did not bookmark it, I might love it.  However what use is it to me to come back and bookmark it?  So I would just ignore this and you might find your clients ignoring it, too.

A score decay algorithm might make a query that assigns factors to timeframes.  For example, if the scores were computed in the last 7 days, the factor is 1.  If the score was computed over the last 7-to-14 days, the factor is 0.8, etc.  This would give you a scoring mechanism that emphasized current popularity.  As things move back in time, their relative popularity for current queries is decreased.