Data Structure and approach for recommendations system

Posted on 2015-01-26
Last Modified: 2015-01-27

I am working on a simple recommendations system. Conceptually the bare bones structure (using relational database terms because thats what I am familiar with) is:


I then want to, for a given recommender_id rank all other recommender_ids by their similarity (e.g. if recommender_id 1 scored item_id 1 as 10 and item_id 100 as 0 then another recommender ranking 1 as 9 and 100 as 1 will rank higher than a recommender scoring 10 for 1 and 8 for 100).

I expect this to be a quite large sparse matrix, possibly tens of thousands on each axis.

I would like to get the approach right from the start and would like recommendations for how to best structure the data and which language / tools to use.
Question by:monoceros
LVL 26

Accepted Solution

dpearson earned 500 total points
ID: 40572125
I think language is pretty much wide open.  I'll suggest some specifics using Java, but all modern languages have the lists and  maps I'd recommend.

// I think you want some way to store (score+recommender) info together
public class ScoreInfo {
    public final int recommender_id ;
    public final int score ;
    public ScoreInfo(int recommender, int score) {
        this.recommender_id = recommender ;
        this.score = score ;

Then I think you can read your entire fact_table into a datastructure like this:

Map<ItemID, List<ScoreInfo>> scoreMap = new HashMap<ItemID, List<ScoreInfo>>() ;

which if you don't speak Java is a map (a hashtable).
The keys in the hashtable are ItemIDs - perhaps integers - you didn't define any types.
The values in the hashtable are List<ScoreInfo> - lists of (score+recommender) pairs.

I hope it's clear that you can read the entire fact_table from a database and build a map like this with no duplication - each entry in the database would have a single entry in the map.  You do this one time - before you start computing similarities.

Then given a specific recommender which you are trying to find similarities to the algorithm becomes:
a) Get the list of (score+item) pairs for this recommender (SELECT score, item_id FROM fact_table WHERE recommender_id=?)
b) Go through each item in that list
c) For each item - retrieve the List<ScoreInfo> from the map.  This is the list of all recommenders that included that item id in their recommendations.  This lookup is constant time for each item so very efficient (since it's a hashtable).
d) Compute the similarity score for this item (for all recommenders that include that item) by walking this list of ScoreInfo objects.  Record this in another map (hashtable) like Map<RecommenderID, Similarity>.  if there's already an entry in the map, you add to the similarity score (if no entry, you just set an initial similarity score).
e) At the end of going through each item (from step c) you have a Map<RecommenderID, Similarity>
f) Now just sort the entries in that final map by similarity and you have the values you need.

The time to build the original map is O(n) where n is the number of entries in the database.  So no wasted time/space.
The time to compute the recommendations for any specific recommender is
        proportional to the (number of items in the original recommender) * (the number of other recommenders for exactly those items)

so you're only touching the sparse entries in the matrix - you're not scanning empty space.

The sorting at the end is O(n log n) where n is the number of other recommenders that overlap with the list of items in the original recommender.

I hope that helps,


Author Comment

ID: 40572216
Hey, Doug,
Thanks - thats eloquent and looks straightforward (and happens to be in Java, which I do know)! For some reason I was blinkered with a view that "this is math, its big(ish) data - therefore I need one of the fancier newer languages" (partly driven by a desire to learn them!) but this is clean and can get me to a MVP without any steep language learning curve.

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

This article will show, step by step, how to integrate R code into a R Sweave document
Whether you've completed a degree in computer sciences or you're a self-taught programmer, writing your first lines of code in the real world is always a challenge. Here are some of the most common pitfalls for new programmers.
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…

932 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now