asked on

List similarity?

Hi,

I have an app where users can mark products. A user might generate a list of strings of products they marked like:

truck
wagon
bottle
dog

I want to find other users which have generated *similar* lists of marked product strings. I star "similar" because it's difficult for me to define.

This user would be the most similar:

truck, wagon, bottle, dog

This user would be a little less similar:

truck, wagon, bottle, dog, red

This user would also be a little less similar:

truck, bottle, dog

This user has zero similarity:

duck

This is probably some type of search, I just don't know the name. I want to find users that have similar tastes in the products they marked,

Thanks

ASKER CERTIFIED SOLUTION

themrrobert

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

aburr

" I star "similar" because it's difficult for me to define. "
If you find "similar" difficult to define, how do you expect the computed to do better?

DJ_AM_Juicebox

ASKER

>> The perfect function for this would be worth a great deal of cash.

Ha yeah I realize this has many useful applications.

I can't think of anything better than the solution you propose. Is there a common name for this problem though? I want to reference it a post I'm writing - like "the list similarity problem"? I doubt I'll find a solution for it, but if it's a known problem, I can site it as such and carry on!!

Thanks

aburr

"Is there a common name for this problem though?"
Try The Similarity Problem"

Do not let my previous comment discourage you too much.
One current related problem is face recognition software

DJ_AM_Juicebox

ASKER

@aburr
Yeah bad wording on my part. I meant I don't know what term to label this problem as.

If a user has 4 items in their list, and I want to find the most similar users, those users would each have a similarity of "1", generated by dividing the number of matched items by max(total items in their own lists, total number of items in the original user's list):

userA { red, green, blue }

user1 { red, blue } 2 / 3 = 0.66

user2 { green, blue, red } 3 / 3 = 1

user3 { red, green, blue, yellow } 3 / 4 = 0.75

so:

similarity = matchedItems / max(setUserA.size(), setUserCurrent.size());

Thanks

SOLUTION

aburr

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

DJ_AM_Juicebox

ASKER

Yeah the solution by themrrobert gives the correct answer, the problem is hoping to do it at scale. For instance, if I have a database of 1 million users, there's no way I can cycle through all of them to do the computation. I would need to precompute some value each time they add a new item to their list and store it in their profile so I could at least get a rough set of users to iterate over in a single select statement.

Even better would be a single precomputed value to sort on which just orders the similarity for me in one select! The problem is that the criteria for similarity is variable and depends on the user the computation is being done for.

It's interesting and not easy to solve in 7 minutes!

Thanks

ozo

> if I have a database of 1 million users
About how many products would be in the database, and about how many would you expect each user to mark?

DJ_AM_Juicebox

ASKER

The number of products in the database is large, over 10 million.

The average number of items a user marks is 50.

SOLUTION

ozo

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

PCableGuy

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

DJ_AM_Juicebox

ASKER

@PCableGuy

Yeah was thinking of something like that, but the set of products is not fixed, new products are added every day. Also there are over 10 million products so the string would be huge (per user too). Also I would have to update every user's string when new products are added to keep it the 'missin' items sorted and up to date.

I can sort products as users mark a new product, I haven't thought of a way that that helps in a sql context though!

Thanks

PCableGuy

http://www.informit.com/articles/article.aspx?p=1070897&seqNum=4

SOLUTION

Member_2_276102

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial