Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

List similarity?

Posted on 2010-08-12
14
Medium Priority
?
285 Views
Last Modified: 2012-05-10
Hi,

I have an app where users can mark products. A user might generate a list of strings of products they marked like:

    truck
    wagon
    bottle
    dog

I want to find other users which have generated *similar* lists of marked product strings. I star "similar" because it's difficult for me to define.

This user would be the most similar:

    truck, wagon, bottle, dog

This user would be a little less similar:

    truck, wagon, bottle, dog, red

This user would also be a little less similar:

    truck, bottle, dog

This user has zero similarity:

    duck

This is probably some type of search, I just don't know the name. I want to find users that have similar tastes in the products they marked,

Thanks
0
Comment
Question by:DJ_AM_Juicebox
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 5
  • 3
  • 2
  • +3
14 Comments
 
LVL 13

Accepted Solution

by:
themrrobert earned 400 total points
ID: 33425389
You will have to create your own algorithm, Especially if you plan to have a dynamic list now or in the future, it could change the order, thus impeding efforts even further.

The perfect function for this would be worth a great deal of cash.

Here is a good starting point:

$arrayUser1 = split("\n",$UserData1);
$arrayUser2 = split("\n",$UserData2);

this creates two arrays.

Loop through both, in a for x, for y, next y, next x format and add 1 point for each entry that matches.

For each user, divide the number of matches by the number of items of user1. (dividing by user2 gives you the converse 'similarity').

This should give you a good start, let me know if you have any questions
0
 
LVL 27

Expert Comment

by:aburr
ID: 33425419
" I star "similar" because it's difficult for me to define. "
If you find "similar" difficult to define, how do you expect the computed to do better?
0
 

Author Comment

by:DJ_AM_Juicebox
ID: 33425420
>> The perfect function for this would be worth a great deal of cash.

Ha yeah I realize this has many useful applications.

I can't think of anything better than the solution you propose. Is there a common name for this problem though? I want to reference it a post I'm writing - like "the list similarity problem"? I doubt I'll find a solution for it, but if it's a known problem, I can site it as such and carry on!!

Thanks
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
LVL 27

Expert Comment

by:aburr
ID: 33425448
"Is there a common name for this problem though?"
Try The Similarity Problem"

Do not let my previous comment discourage you too much.
One current related problem is face recognition software
0
 

Author Comment

by:DJ_AM_Juicebox
ID: 33425461
@aburr
Yeah bad wording on my part. I meant I don't know what term to label this problem as.

If a user has 4 items in their list, and I want to find the most similar users, those users would each have a similarity of "1", generated by dividing the number of matched items by max(total items in their own lists, total number of items in the original user's list):

    userA { red, green, blue }

    user1 { red, blue }  2 / 3 = 0.66

    user2 { green, blue, red }    3 / 3  = 1

    user3 { red, green, blue, yellow } 3 / 4 = 0.75

so:

    similarity = matchedItems / max(setUserA.size(), setUserCurrent.size());  

Thanks
0
 
LVL 27

Assisted Solution

by:aburr
aburr earned 400 total points
ID: 33425500
"  similarity = matchedItems / max(setUserA.size(), setUserCurrent.size());"
It looks like you are well on your way especially if the users construct their lists by selecting from a menu so that the items have a well defined form.
ie red will not be confused with redish
0
 

Author Comment

by:DJ_AM_Juicebox
ID: 33425526
Yeah the solution by themrrobert gives the correct answer, the problem is hoping to do it at scale. For instance, if I have a database of 1 million users, there's no way I can cycle through all of them to do the computation. I would need to precompute some value each time they add a new item to their list and store it in their profile so I could at least get a rough set of users to iterate over in a single select statement.

Even better would be a single precomputed value to sort on which just orders the similarity for me in one select! The problem is that the criteria for similarity is variable and depends on the user the computation is being done for.

It's interesting and not easy to solve in 7 minutes!

Thanks
0
 
LVL 84

Expert Comment

by:ozo
ID: 33425551
> if I have a database of 1 million users
About how many products would be in the database, and about how many would you expect each user to mark?
0
 

Author Comment

by:DJ_AM_Juicebox
ID: 33425580
The number of products in the database is large, over 10 million.

The average number of items a user marks is 50.


 
0
 
LVL 84

Assisted Solution

by:ozo
ozo earned 400 total points
ID: 33425784
So this seems to be the model you are using
http://en.wikipedia.org/wiki/Standard_Boolean_model
But for a dataset that large, speed could be a problem
0
 
LVL 12

Assisted Solution

by:PCableGuy
PCableGuy earned 400 total points
ID: 33426098

Is there any advantage of making everyone the same string length along with 'fake' characters for an item that they didn't choose? Does this make it easier or harder to compute, or roughly the same?
 
Database of Choices, Ascending Order
bottle, dog, duck, red, truck, wagon

New User, hasn't chosen anything yet.
######, ###, ####, ###, #####, #####

User 1
bottle, dog, ####, ###, truck, wagon

User A
bottle, dog, ####, red, truck, wagon

User B
bottle, dog, ####, ###, truck, #####

User C
######, ###, duck, ###, #####, #####
 
0
 

Author Comment

by:DJ_AM_Juicebox
ID: 33426161
@PCableGuy

Yeah was thinking of something like that, but the set of products is not fixed, new products are added every day. Also there are over 10 million products so the string would be huge (per user too). Also I would have to update every user's string when new products are added to keep it the 'missin' items sorted and up to date.

I can sort products as users mark a new product, I haven't thought of a way that that helps in a sql context though!

Thanks
0
 
LVL 12

Expert Comment

by:PCableGuy
ID: 33426358
0
 
LVL 27

Assisted Solution

by:tliotta
tliotta earned 400 total points
ID: 33433359
Consider also that you might find value in tracking how many times each product is chosen for a listb. If 9M products have never been chosen, there's not much reason to pay attention to them in any algorithm.

Also, you might find value in weighting related to the number of times a product is chosen. If a customer lists four products that no one else has ever chosen, that list might have a total "weight" of zero, or something along those lines. Regardless, there will clearly be no other customers with a similar list -- it's not necessary even to check.

Tom
0

Featured Post

On Demand Webinar: Networking for the Cloud Era

Did you know SD-WANs can improve network connectivity? Check out this webinar to learn how an SD-WAN simplified, one-click tool can help you migrate and manage data in the cloud.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Have you ever thought of installing a power system that generates solar electricity to power your house? Some may say yes, while others may tell me no. But have you noticed that people around you are now considering installing such systems in their …
We are taking giant steps in technological advances in the field of wireless telephony. At just 10 years since the advent of smartphones, it is crucial to examine the benefits and disadvantages that have been report to us.
This is a video describing the growing solar energy use in Utah. This is a topic that greatly interests me and so I decided to produce a video about it.
I've attached the XLSM Excel spreadsheet I used in the video and also text files containing the macros used below. https://filedb.experts-exchange.com/incoming/2017/03_w12/1151775/Permutations.txt https://filedb.experts-exchange.com/incoming/201…
Suggested Courses

722 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question