Go Premium for a chance to win a PS4. Enter to Win

x
Solved

# Find most common patterns with Lists of numbers

Posted on 2014-03-31
Medium Priority
308 Views
Hi,

I'm trying to find the optimal way of identifying common patterns with groups of items (an item is identified by a number)

e.g. every transaction has a list (of size 1 - 20) of items. I can easily say which single items appear most often by keeping a count and incrementing per transaction. I guess i could use the same approach for combinations of items (2+) that appear in transactions, but i'm not sure if this is the best approach?

note that order of the item list is not important, but that i'm not looking to treat the full list as a unique transaction (I want to split each list into its sub combinations - ab,ac,bc, abc, etc and match on these)

any help appreciated,
thanks
0
Question by:basil365
• 4
• 3

LVL 27

Expert Comment

ID: 39966576

What is your goal in doing this?
How much data are we talking about?  Hundreds of transactions or millions or more??
What are the items?
How many different items are there?
Can a single transaction have multiple copies of a particular item?

You probably want to sort the list of items in each transaction for easier comparison.
Then you could compare every pair of transactions, and keep track of the number items that overlap in an array.

Then you could work through the array to find the most common single items and most common subsets.
0

LVL 27

Expert Comment

ID: 39966590
If there is a finite number of items N, you could assign a 1xN array to each transaction, where the values in the array represent the count of particular items in each transaction.  This would also simplify the subsequent analysis.
0

Author Comment

ID: 39966647
Hi d-glitch,

The amount of data is variable, but will tend towards the millions. I intended to find a solution then optimise it by adding assumptions (e.g. only allow combinations up to size 4 to limit results set)

- The items are a complex object representing product, but in this case i only care about the id
- Each transaction can only have one copy of a particular item.
- There are n different items, but ~100000 at a given point in time (again this value will be restricted by applying logic to the data set e.g. older than x)

0

LVL 27

Assisted Solution

d-glitch earned 1000 total points
ID: 39966751
Millions of transactions with up to 20 items per transaction isn't too bad.

What are you using for software??

You probably want to go through the data and make lots of lists:
A list of unique items and the number it times they appear.

Then figure out which items are most popular and make more lists.
A list of the transactions that include Fav_1
A list of the transactions that include Fav_2
A list of the transactions that include Fav_3
. . . .
A list of the transactions that include Fav_M  (not N).

You can decide how deep you need to go into the Favorites.  Certainly you can eliminate any items that only appear once.  You will have to look at the data, or prototype a subsample to decide.  You would have to go through the entire data set at least twice.

Finally you would have to compare the Fav_ lists with each other to find how many transactions overlap.  This is a lot of work, but you can start at the top.  The most likely pairs (or triples or quads) of items are likely (but not guaranteed) to include the most common individual items.
0

Author Comment

ID: 39966774
Thanks d-glitch, thats the approach i've taken:

- store all combinations of ids when a transaction occurs, as well as the list within the transactions
- for each combination (up to a size of 6 at the moment) I'm iterating through all lists within each transaction and checking if the combination appears.

I'm using .net (programming with linq mainly) and just about to create a few unit tests to check performance

thanks
0

Author Comment

ID: 39974671
Speed is definitely the issue here - I had to further reduce my pattern size to combinations of 4 in order to keep up with transactions when using the above approach
0

LVL 85

Accepted Solution

ozo earned 1000 total points
ID: 39974713
Suffix trees can find the most frequently occurring substrings of a minimum length in O(n) time, O(n) space
http://en.wikipedia.org/wiki/Suffix_tree
0

Author Comment

ID: 39982634
Thanks ozo - i'll take a look to see if i can apply these to the problem.
0

## Featured Post

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Linear search (searching each index in an array one by one) works almost everywhere but it is not optimal in many cases. Let's assume, we have a book which has 42949672960 pages. We also have a table of contents. Now we want to read the content on p…
This article covers the basics of data encryption, what it is, how it works, and why it's important. If you've ever wondered what goes on when you "encrypt" data, you can look here to build a good foundation for your personal learning.
Although Jacob Bernoulli (1654-1705) has been credited as the creator of "Binomial Distribution Table", Gottfried Leibniz (1646-1716) did his dissertation on the subject in 1666; Leibniz you may recall is the co-inventor of "Calculus" and beat Isaac…
Finds all prime numbers in a range requested and places them in a public primes() array. I've demostrated a template size of 30 (2 * 3 * 5) but larger templates can be built such 210  (2 * 3 * 5 * 7) or 2310  (2 * 3 * 5 * 7 * 11). The larger templa…
###### Suggested Courses
Course of the Month9 days, 22 hours left to enroll