Link to home
Start Free TrialLog in
Avatar of DJ_AM_Juicebox
DJ_AM_Juicebox

asked on

List similarity?

Hi,

I have an app where users can mark products. A user might generate a list of strings of products they marked like:

    truck
    wagon
    bottle
    dog

I want to find other users which have generated *similar* lists of marked product strings. I star "similar" because it's difficult for me to define.

This user would be the most similar:

    truck, wagon, bottle, dog

This user would be a little less similar:

    truck, wagon, bottle, dog, red

This user would also be a little less similar:

    truck, bottle, dog

This user has zero similarity:

    duck

This is probably some type of search, I just don't know the name. I want to find users that have similar tastes in the products they marked,

Thanks
ASKER CERTIFIED SOLUTION
Avatar of themrrobert
themrrobert
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of aburr
" I star "similar" because it's difficult for me to define. "
If you find "similar" difficult to define, how do you expect the computed to do better?
Avatar of DJ_AM_Juicebox
DJ_AM_Juicebox

ASKER

>> The perfect function for this would be worth a great deal of cash.

Ha yeah I realize this has many useful applications.

I can't think of anything better than the solution you propose. Is there a common name for this problem though? I want to reference it a post I'm writing - like "the list similarity problem"? I doubt I'll find a solution for it, but if it's a known problem, I can site it as such and carry on!!

Thanks
"Is there a common name for this problem though?"
Try The Similarity Problem"

Do not let my previous comment discourage you too much.
One current related problem is face recognition software
@aburr
Yeah bad wording on my part. I meant I don't know what term to label this problem as.

If a user has 4 items in their list, and I want to find the most similar users, those users would each have a similarity of "1", generated by dividing the number of matched items by max(total items in their own lists, total number of items in the original user's list):

    userA { red, green, blue }

    user1 { red, blue }  2 / 3 = 0.66

    user2 { green, blue, red }    3 / 3  = 1

    user3 { red, green, blue, yellow } 3 / 4 = 0.75

so:

    similarity = matchedItems / max(setUserA.size(), setUserCurrent.size());  

Thanks
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Yeah the solution by themrrobert gives the correct answer, the problem is hoping to do it at scale. For instance, if I have a database of 1 million users, there's no way I can cycle through all of them to do the computation. I would need to precompute some value each time they add a new item to their list and store it in their profile so I could at least get a rough set of users to iterate over in a single select statement.

Even better would be a single precomputed value to sort on which just orders the similarity for me in one select! The problem is that the criteria for similarity is variable and depends on the user the computation is being done for.

It's interesting and not easy to solve in 7 minutes!

Thanks
> if I have a database of 1 million users
About how many products would be in the database, and about how many would you expect each user to mark?
The number of products in the database is large, over 10 million.

The average number of items a user marks is 50.


 
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
@PCableGuy

Yeah was thinking of something like that, but the set of products is not fixed, new products are added every day. Also there are over 10 million products so the string would be huge (per user too). Also I would have to update every user's string when new products are added to keep it the 'missin' items sorted and up to date.

I can sort products as users mark a new product, I haven't thought of a way that that helps in a sql context though!

Thanks
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial