Solved

Data Structure and approach for recommendations system

Posted on 2015-01-26
2
175 Views
Last Modified: 2015-01-27
Hi,

I am working on a simple recommendations system. Conceptually the bare bones structure (using relational database terms because thats what I am familiar with) is:

fact_table:
  score
  recommender_id
  item_id

I then want to, for a given recommender_id rank all other recommender_ids by their similarity (e.g. if recommender_id 1 scored item_id 1 as 10 and item_id 100 as 0 then another recommender ranking 1 as 9 and 100 as 1 will rank higher than a recommender scoring 10 for 1 and 8 for 100).

I expect this to be a quite large sparse matrix, possibly tens of thousands on each axis.

I would like to get the approach right from the start and would like recommendations for how to best structure the data and which language / tools to use.
0
Comment
Question by:monoceros
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
2 Comments
 
LVL 27

Accepted Solution

by:
dpearson earned 500 total points
ID: 40572125
I think language is pretty much wide open.  I'll suggest some specifics using Java, but all modern languages have the lists and  maps I'd recommend.

// I think you want some way to store (score+recommender) info together
public class ScoreInfo {
    public final int recommender_id ;
    public final int score ;
    public ScoreInfo(int recommender, int score) {
        this.recommender_id = recommender ;
        this.score = score ;
    }
}

Then I think you can read your entire fact_table into a datastructure like this:

Map<ItemID, List<ScoreInfo>> scoreMap = new HashMap<ItemID, List<ScoreInfo>>() ;

which if you don't speak Java is a map (a hashtable).
The keys in the hashtable are ItemIDs - perhaps integers - you didn't define any types.
The values in the hashtable are List<ScoreInfo> - lists of (score+recommender) pairs.

I hope it's clear that you can read the entire fact_table from a database and build a map like this with no duplication - each entry in the database would have a single entry in the map.  You do this one time - before you start computing similarities.

Then given a specific recommender which you are trying to find similarities to the algorithm becomes:
a) Get the list of (score+item) pairs for this recommender (SELECT score, item_id FROM fact_table WHERE recommender_id=?)
b) Go through each item in that list
c) For each item - retrieve the List<ScoreInfo> from the map.  This is the list of all recommenders that included that item id in their recommendations.  This lookup is constant time for each item so very efficient (since it's a hashtable).
d) Compute the similarity score for this item (for all recommenders that include that item) by walking this list of ScoreInfo objects.  Record this in another map (hashtable) like Map<RecommenderID, Similarity>.  if there's already an entry in the map, you add to the similarity score (if no entry, you just set an initial similarity score).
e) At the end of going through each item (from step c) you have a Map<RecommenderID, Similarity>
f) Now just sort the entries in that final map by similarity and you have the values you need.

The time to build the original map is O(n) where n is the number of entries in the database.  So no wasted time/space.
The time to compute the recommendations for any specific recommender is
        proportional to the (number of items in the original recommender) * (the number of other recommenders for exactly those items)

so you're only touching the sparse entries in the matrix - you're not scanning empty space.

The sorting at the end is O(n log n) where n is the number of other recommenders that overlap with the list of items in the original recommender.

I hope that helps,

Doug
0
 
LVL 1

Author Comment

by:monoceros
ID: 40572216
Hey, Doug,
Thanks - thats eloquent and looks straightforward (and happens to be in Java, which I do know)! For some reason I was blinkered with a view that "this is math, its big(ish) data - therefore I need one of the fancier newer languages" (partly driven by a desire to learn them!) but this is clean and can get me to a MVP without any steep language learning curve.
Cheers,
Mono.
0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

A short article about a problem I had getting the GPS LocationListener working.
Displaying an arrayList in a listView using the default adapter is rarely the best solution. To get full control of your display data, and to be able to refresh it after editing, requires the use of a custom adapter.
In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …
With the power of JIRA, there's an unlimited number of ways you can customize it, use it and benefit from it. With that in mind, there's bound to be things that I wasn't able to cover in this course. With this summary we'll look at some places to go…

688 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question