Solved

Data Structure and approach for recommendations system

Posted on 2015-01-26
2
158 Views
Last Modified: 2015-01-27
Hi,

I am working on a simple recommendations system. Conceptually the bare bones structure (using relational database terms because thats what I am familiar with) is:

fact_table:
  score
  recommender_id
  item_id

I then want to, for a given recommender_id rank all other recommender_ids by their similarity (e.g. if recommender_id 1 scored item_id 1 as 10 and item_id 100 as 0 then another recommender ranking 1 as 9 and 100 as 1 will rank higher than a recommender scoring 10 for 1 and 8 for 100).

I expect this to be a quite large sparse matrix, possibly tens of thousands on each axis.

I would like to get the approach right from the start and would like recommendations for how to best structure the data and which language / tools to use.
0
Comment
Question by:monoceros
2 Comments
 
LVL 26

Accepted Solution

by:
dpearson earned 500 total points
ID: 40572125
I think language is pretty much wide open.  I'll suggest some specifics using Java, but all modern languages have the lists and  maps I'd recommend.

// I think you want some way to store (score+recommender) info together
public class ScoreInfo {
    public final int recommender_id ;
    public final int score ;
    public ScoreInfo(int recommender, int score) {
        this.recommender_id = recommender ;
        this.score = score ;
    }
}

Then I think you can read your entire fact_table into a datastructure like this:

Map<ItemID, List<ScoreInfo>> scoreMap = new HashMap<ItemID, List<ScoreInfo>>() ;

which if you don't speak Java is a map (a hashtable).
The keys in the hashtable are ItemIDs - perhaps integers - you didn't define any types.
The values in the hashtable are List<ScoreInfo> - lists of (score+recommender) pairs.

I hope it's clear that you can read the entire fact_table from a database and build a map like this with no duplication - each entry in the database would have a single entry in the map.  You do this one time - before you start computing similarities.

Then given a specific recommender which you are trying to find similarities to the algorithm becomes:
a) Get the list of (score+item) pairs for this recommender (SELECT score, item_id FROM fact_table WHERE recommender_id=?)
b) Go through each item in that list
c) For each item - retrieve the List<ScoreInfo> from the map.  This is the list of all recommenders that included that item id in their recommendations.  This lookup is constant time for each item so very efficient (since it's a hashtable).
d) Compute the similarity score for this item (for all recommenders that include that item) by walking this list of ScoreInfo objects.  Record this in another map (hashtable) like Map<RecommenderID, Similarity>.  if there's already an entry in the map, you add to the similarity score (if no entry, you just set an initial similarity score).
e) At the end of going through each item (from step c) you have a Map<RecommenderID, Similarity>
f) Now just sort the entries in that final map by similarity and you have the values you need.

The time to build the original map is O(n) where n is the number of entries in the database.  So no wasted time/space.
The time to compute the recommendations for any specific recommender is
        proportional to the (number of items in the original recommender) * (the number of other recommenders for exactly those items)

so you're only touching the sparse entries in the matrix - you're not scanning empty space.

The sorting at the end is O(n log n) where n is the number of other recommenders that overlap with the list of items in the original recommender.

I hope that helps,

Doug
0
 
LVL 1

Author Comment

by:monoceros
ID: 40572216
Hey, Doug,
Thanks - thats eloquent and looks straightforward (and happens to be in Java, which I do know)! For some reason I was blinkered with a view that "this is math, its big(ish) data - therefore I need one of the fancier newer languages" (partly driven by a desire to learn them!) but this is clean and can get me to a MVP without any steep language learning curve.
Cheers,
Mono.
0

Featured Post

Live: Real-Time Solutions, Start Here

Receive instant 1:1 support from technology experts, using our real-time conversation and whiteboard interface. Your first 5 minutes are always free.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
sorting efficency of sorting algorithm 30 112
listing all functions in JavaScript 19 200
Turning python script into an applet 12 110
MacOS and programming in React 7 34
Iteration: Iteration is repetition of a process. A student who goes to school repeats the process of going to school everyday until graduation. We go to grocery store at least once or twice a month to buy products. We repeat this process every mont…
This article will inform Clients about common and important expectations from the freelancers (Experts) who are looking at your Gig.
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…
In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …

813 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now