Solved

Find most common patterns with Lists of numbers

Posted on 2014-03-31
8
263 Views
Last Modified: 2014-04-07
Hi,

I'm trying to find the optimal way of identifying common patterns with groups of items (an item is identified by a number)

e.g. every transaction has a list (of size 1 - 20) of items. I can easily say which single items appear most often by keeping a count and incrementing per transaction. I guess i could use the same approach for combinations of items (2+) that appear in transactions, but i'm not sure if this is the best approach?

note that order of the item list is not important, but that i'm not looking to treat the full list as a unique transaction (I want to split each list into its sub combinations - ab,ac,bc, abc, etc and match on these)

any help appreciated,
thanks
0
Comment
Question by:basil365
  • 4
  • 3
8 Comments
 
LVL 27

Expert Comment

by:d-glitch
Comment Utility
Need more information.

What is your goal in doing this?
How much data are we talking about?  Hundreds of transactions or millions or more??
What are the items?
How many different items are there?  
Can a single transaction have multiple copies of a particular item?

You probably want to sort the list of items in each transaction for easier comparison.
Then you could compare every pair of transactions, and keep track of the number items that overlap in an array.

Then you could work through the array to find the most common single items and most common subsets.
0
 
LVL 27

Expert Comment

by:d-glitch
Comment Utility
If there is a finite number of items N, you could assign a 1xN array to each transaction, where the values in the array represent the count of particular items in each transaction.  This would also simplify the subsequent analysis.
0
 

Author Comment

by:basil365
Comment Utility
Hi d-glitch,

The amount of data is variable, but will tend towards the millions. I intended to find a solution then optimise it by adding assumptions (e.g. only allow combinations up to size 4 to limit results set)

- The items are a complex object representing product, but in this case i only care about the id
- Each transaction can only have one copy of a particular item.
- There are n different items, but ~100000 at a given point in time (again this value will be restricted by applying logic to the data set e.g. older than x)


thanks for your responses
0
 
LVL 27

Assisted Solution

by:d-glitch
d-glitch earned 250 total points
Comment Utility
Millions of transactions with up to 20 items per transaction isn't too bad.

What are you using for software??

You probably want to go through the data and make lots of lists:
  A list of unique items and the number it times they appear.

Then figure out which items are most popular and make more lists.  
  A list of the transactions that include Fav_1
  A list of the transactions that include Fav_2
  A list of the transactions that include Fav_3
  . . . .
  A list of the transactions that include Fav_M  (not N).

You can decide how deep you need to go into the Favorites.  Certainly you can eliminate any items that only appear once.  You will have to look at the data, or prototype a subsample to decide.  You would have to go through the entire data set at least twice.

Finally you would have to compare the Fav_ lists with each other to find how many transactions overlap.  This is a lot of work, but you can start at the top.  The most likely pairs (or triples or quads) of items are likely (but not guaranteed) to include the most common individual items.
0
What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 

Author Comment

by:basil365
Comment Utility
Thanks d-glitch, thats the approach i've taken:

- store all combinations of ids when a transaction occurs, as well as the list within the transactions
- for each combination (up to a size of 6 at the moment) I'm iterating through all lists within each transaction and checking if the combination appears.


I'm using .net (programming with linq mainly) and just about to create a few unit tests to check performance

thanks
0
 

Author Comment

by:basil365
Comment Utility
Speed is definitely the issue here - I had to further reduce my pattern size to combinations of 4 in order to keep up with transactions when using the above approach
0
 
LVL 84

Accepted Solution

by:
ozo earned 250 total points
Comment Utility
Suffix trees can find the most frequently occurring substrings of a minimum length in O(n) time, O(n) space
http://en.wikipedia.org/wiki/Suffix_tree
0
 

Author Comment

by:basil365
Comment Utility
Thanks ozo - i'll take a look to see if i can apply these to the problem.
0

Featured Post

Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

Suggested Solutions

Okay. So what exactly is the problem here? How often have we come across situations where we need to know if two strings are 'similar' but not necessarily the same? I have, plenty of times. Until recently, I thought any functionality like that wo…
More often than not, we developers are confronted with a need: a need to make some kind of magic happen via code. Whether it is for a client, for the boss, or for our own personal projects, the need must be satisfied. Most of the time, the Framework…
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…
This tutorial demonstrates a quick way of adding group price to multiple Magento products.

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now