# Find most common patterns with Lists of numbers

Hi,

I'm trying to find the optimal way of identifying common patterns with groups of items (an item is identified by a number)

e.g. every transaction has a list (of size 1 - 20) of items. I can easily say which single items appear most often by keeping a count and incrementing per transaction. I guess i could use the same approach for combinations of items (2+) that appear in transactions, but i'm not sure if this is the best approach?

note that order of the item list is not important, but that i'm not looking to treat the full list as a unique transaction (I want to split each list into its sub combinations - ab,ac,bc, abc, etc and match on these)

any help appreciated,
thanks
###### Who is Participating?

Commented:
Suffix trees can find the most frequently occurring substrings of a minimum length in O(n) time, O(n) space
http://en.wikipedia.org/wiki/Suffix_tree
0

Commented:

What is your goal in doing this?
How much data are we talking about?  Hundreds of transactions or millions or more??
What are the items?
How many different items are there?
Can a single transaction have multiple copies of a particular item?

You probably want to sort the list of items in each transaction for easier comparison.
Then you could compare every pair of transactions, and keep track of the number items that overlap in an array.

Then you could work through the array to find the most common single items and most common subsets.
0

Commented:
If there is a finite number of items N, you could assign a 1xN array to each transaction, where the values in the array represent the count of particular items in each transaction.  This would also simplify the subsequent analysis.
0

Author Commented:
Hi d-glitch,

The amount of data is variable, but will tend towards the millions. I intended to find a solution then optimise it by adding assumptions (e.g. only allow combinations up to size 4 to limit results set)

- The items are a complex object representing product, but in this case i only care about the id
- Each transaction can only have one copy of a particular item.
- There are n different items, but ~100000 at a given point in time (again this value will be restricted by applying logic to the data set e.g. older than x)

0

Commented:
Millions of transactions with up to 20 items per transaction isn't too bad.

What are you using for software??

You probably want to go through the data and make lots of lists:
A list of unique items and the number it times they appear.

Then figure out which items are most popular and make more lists.
A list of the transactions that include Fav_1
A list of the transactions that include Fav_2
A list of the transactions that include Fav_3
. . . .
A list of the transactions that include Fav_M  (not N).

You can decide how deep you need to go into the Favorites.  Certainly you can eliminate any items that only appear once.  You will have to look at the data, or prototype a subsample to decide.  You would have to go through the entire data set at least twice.

Finally you would have to compare the Fav_ lists with each other to find how many transactions overlap.  This is a lot of work, but you can start at the top.  The most likely pairs (or triples or quads) of items are likely (but not guaranteed) to include the most common individual items.
0

Author Commented:
Thanks d-glitch, thats the approach i've taken:

- store all combinations of ids when a transaction occurs, as well as the list within the transactions
- for each combination (up to a size of 6 at the moment) I'm iterating through all lists within each transaction and checking if the combination appears.

I'm using .net (programming with linq mainly) and just about to create a few unit tests to check performance

thanks
0

Author Commented:
Speed is definitely the issue here - I had to further reduce my pattern size to combinations of 4 in order to keep up with transactions when using the above approach
0

Author Commented:
Thanks ozo - i'll take a look to see if i can apply these to the problem.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.