DJ_AM_Juicebox
asked on
List similarity?
Hi,
I have an app where users can mark products. A user might generate a list of strings of products they marked like:
truck
wagon
bottle
dog
I want to find other users which have generated *similar* lists of marked product strings. I star "similar" because it's difficult for me to define.
This user would be the most similar:
truck, wagon, bottle, dog
This user would be a little less similar:
truck, wagon, bottle, dog, red
This user would also be a little less similar:
truck, bottle, dog
This user has zero similarity:
duck
This is probably some type of search, I just don't know the name. I want to find users that have similar tastes in the products they marked,
Thanks
I have an app where users can mark products. A user might generate a list of strings of products they marked like:
truck
wagon
bottle
dog
I want to find other users which have generated *similar* lists of marked product strings. I star "similar" because it's difficult for me to define.
This user would be the most similar:
truck, wagon, bottle, dog
This user would be a little less similar:
truck, wagon, bottle, dog, red
This user would also be a little less similar:
truck, bottle, dog
This user has zero similarity:
duck
This is probably some type of search, I just don't know the name. I want to find users that have similar tastes in the products they marked,
Thanks
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
>> The perfect function for this would be worth a great deal of cash.
Ha yeah I realize this has many useful applications.
I can't think of anything better than the solution you propose. Is there a common name for this problem though? I want to reference it a post I'm writing - like "the list similarity problem"? I doubt I'll find a solution for it, but if it's a known problem, I can site it as such and carry on!!
Thanks
Ha yeah I realize this has many useful applications.
I can't think of anything better than the solution you propose. Is there a common name for this problem though? I want to reference it a post I'm writing - like "the list similarity problem"? I doubt I'll find a solution for it, but if it's a known problem, I can site it as such and carry on!!
Thanks
"Is there a common name for this problem though?"
Try The Similarity Problem"
Do not let my previous comment discourage you too much.
One current related problem is face recognition software
Try The Similarity Problem"
Do not let my previous comment discourage you too much.
One current related problem is face recognition software
ASKER
@aburr
Yeah bad wording on my part. I meant I don't know what term to label this problem as.
If a user has 4 items in their list, and I want to find the most similar users, those users would each have a similarity of "1", generated by dividing the number of matched items by max(total items in their own lists, total number of items in the original user's list):
userA { red, green, blue }
user1 { red, blue } 2 / 3 = 0.66
user2 { green, blue, red } 3 / 3 = 1
user3 { red, green, blue, yellow } 3 / 4 = 0.75
so:
similarity = matchedItems / max(setUserA.size(), setUserCurrent.size());
Thanks
Yeah bad wording on my part. I meant I don't know what term to label this problem as.
If a user has 4 items in their list, and I want to find the most similar users, those users would each have a similarity of "1", generated by dividing the number of matched items by max(total items in their own lists, total number of items in the original user's list):
userA { red, green, blue }
user1 { red, blue } 2 / 3 = 0.66
user2 { green, blue, red } 3 / 3 = 1
user3 { red, green, blue, yellow } 3 / 4 = 0.75
so:
similarity = matchedItems / max(setUserA.size(), setUserCurrent.size());
Thanks
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Yeah the solution by themrrobert gives the correct answer, the problem is hoping to do it at scale. For instance, if I have a database of 1 million users, there's no way I can cycle through all of them to do the computation. I would need to precompute some value each time they add a new item to their list and store it in their profile so I could at least get a rough set of users to iterate over in a single select statement.
Even better would be a single precomputed value to sort on which just orders the similarity for me in one select! The problem is that the criteria for similarity is variable and depends on the user the computation is being done for.
It's interesting and not easy to solve in 7 minutes!
Thanks
Even better would be a single precomputed value to sort on which just orders the similarity for me in one select! The problem is that the criteria for similarity is variable and depends on the user the computation is being done for.
It's interesting and not easy to solve in 7 minutes!
Thanks
> if I have a database of 1 million users
About how many products would be in the database, and about how many would you expect each user to mark?
About how many products would be in the database, and about how many would you expect each user to mark?
ASKER
The number of products in the database is large, over 10 million.
The average number of items a user marks is 50.
The average number of items a user marks is 50.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
@PCableGuy
Yeah was thinking of something like that, but the set of products is not fixed, new products are added every day. Also there are over 10 million products so the string would be huge (per user too). Also I would have to update every user's string when new products are added to keep it the 'missin' items sorted and up to date.
I can sort products as users mark a new product, I haven't thought of a way that that helps in a sql context though!
Thanks
Yeah was thinking of something like that, but the set of products is not fixed, new products are added every day. Also there are over 10 million products so the string would be huge (per user too). Also I would have to update every user's string when new products are added to keep it the 'missin' items sorted and up to date.
I can sort products as users mark a new product, I haven't thought of a way that that helps in a sql context though!
Thanks
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
If you find "similar" difficult to define, how do you expect the computed to do better?