I've been really interested in recommendation systems lately which has lead my down a few rabbit holes. One of these landed on a similarity measurement called Jaccard similarity which is a fairly straightforward metric for communicating how many elements two sets have in common.

This turns out to be very difficult to compute without some tricks but can be estimated with a certain degree of confidence using minHashing. If this sort of thing interests you I would check out this blog post by Chris McCormick - its one of the best explanations I came across (even though its in the context of document similarity). There is even some python code provided.

I would summarize the technique in one sentence as: by randomly shuffling the union of two sets and looking at the first element in the new, shuffled set, the probability that it is an item belonging to their intersection is equal to the Jaccard Similarity.

http://mccormickml.com/2015/06/12/minhash-tutorial-with-python-code/
5

Keep in touch with Experts Exchange

Tech news and trends delivered to your inbox every month