Go Premium for a chance to win a PS4. Enter to Win

x
?
Solved

Find most common patterns with Lists of numbers

Posted on 2014-03-31
8
Medium Priority
?
308 Views
Last Modified: 2014-04-07
Hi,

I'm trying to find the optimal way of identifying common patterns with groups of items (an item is identified by a number)

e.g. every transaction has a list (of size 1 - 20) of items. I can easily say which single items appear most often by keeping a count and incrementing per transaction. I guess i could use the same approach for combinations of items (2+) that appear in transactions, but i'm not sure if this is the best approach?

note that order of the item list is not important, but that i'm not looking to treat the full list as a unique transaction (I want to split each list into its sub combinations - ab,ac,bc, abc, etc and match on these)

any help appreciated,
thanks
0
Comment
Question by:basil365
  • 4
  • 3
8 Comments
 
LVL 27

Expert Comment

by:d-glitch
ID: 39966576
Need more information.

What is your goal in doing this?
How much data are we talking about?  Hundreds of transactions or millions or more??
What are the items?
How many different items are there?  
Can a single transaction have multiple copies of a particular item?

You probably want to sort the list of items in each transaction for easier comparison.
Then you could compare every pair of transactions, and keep track of the number items that overlap in an array.

Then you could work through the array to find the most common single items and most common subsets.
0
 
LVL 27

Expert Comment

by:d-glitch
ID: 39966590
If there is a finite number of items N, you could assign a 1xN array to each transaction, where the values in the array represent the count of particular items in each transaction.  This would also simplify the subsequent analysis.
0
 

Author Comment

by:basil365
ID: 39966647
Hi d-glitch,

The amount of data is variable, but will tend towards the millions. I intended to find a solution then optimise it by adding assumptions (e.g. only allow combinations up to size 4 to limit results set)

- The items are a complex object representing product, but in this case i only care about the id
- Each transaction can only have one copy of a particular item.
- There are n different items, but ~100000 at a given point in time (again this value will be restricted by applying logic to the data set e.g. older than x)


thanks for your responses
0
What is SQL Server and how does it work?

The purpose of this paper is to provide you background on SQL Server. It’s your self-study guide for learning fundamentals. It includes both the history of SQL and its technical basics. Concepts and definitions will form the solid foundation of your future DBA expertise.

 
LVL 27

Assisted Solution

by:d-glitch
d-glitch earned 1000 total points
ID: 39966751
Millions of transactions with up to 20 items per transaction isn't too bad.

What are you using for software??

You probably want to go through the data and make lots of lists:
  A list of unique items and the number it times they appear.

Then figure out which items are most popular and make more lists.  
  A list of the transactions that include Fav_1
  A list of the transactions that include Fav_2
  A list of the transactions that include Fav_3
  . . . .
  A list of the transactions that include Fav_M  (not N).

You can decide how deep you need to go into the Favorites.  Certainly you can eliminate any items that only appear once.  You will have to look at the data, or prototype a subsample to decide.  You would have to go through the entire data set at least twice.

Finally you would have to compare the Fav_ lists with each other to find how many transactions overlap.  This is a lot of work, but you can start at the top.  The most likely pairs (or triples or quads) of items are likely (but not guaranteed) to include the most common individual items.
0
 

Author Comment

by:basil365
ID: 39966774
Thanks d-glitch, thats the approach i've taken:

- store all combinations of ids when a transaction occurs, as well as the list within the transactions
- for each combination (up to a size of 6 at the moment) I'm iterating through all lists within each transaction and checking if the combination appears.


I'm using .net (programming with linq mainly) and just about to create a few unit tests to check performance

thanks
0
 

Author Comment

by:basil365
ID: 39974671
Speed is definitely the issue here - I had to further reduce my pattern size to combinations of 4 in order to keep up with transactions when using the above approach
0
 
LVL 85

Accepted Solution

by:
ozo earned 1000 total points
ID: 39974713
Suffix trees can find the most frequently occurring substrings of a minimum length in O(n) time, O(n) space
http://en.wikipedia.org/wiki/Suffix_tree
0
 

Author Comment

by:basil365
ID: 39982634
Thanks ozo - i'll take a look to see if i can apply these to the problem.
0

Featured Post

Fill in the form and get your FREE NFR key NOW!

Veeam is happy to provide a FREE NFR server license to certified engineers, trainers, and bloggers.  It allows for the non‑production use of Veeam Agent for Microsoft Windows. This license is valid for five workstations and two servers.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Article by: Nadia
Linear search (searching each index in an array one by one) works almost everywhere but it is not optimal in many cases. Let's assume, we have a book which has 42949672960 pages. We also have a table of contents. Now we want to read the content on p…
This article covers the basics of data encryption, what it is, how it works, and why it's important. If you've ever wondered what goes on when you "encrypt" data, you can look here to build a good foundation for your personal learning.
Although Jacob Bernoulli (1654-1705) has been credited as the creator of "Binomial Distribution Table", Gottfried Leibniz (1646-1716) did his dissertation on the subject in 1666; Leibniz you may recall is the co-inventor of "Calculus" and beat Isaac…
Finds all prime numbers in a range requested and places them in a public primes() array. I've demostrated a template size of 30 (2 * 3 * 5) but larger templates can be built such 210  (2 * 3 * 5 * 7) or 2310  (2 * 3 * 5 * 7 * 11). The larger templa…
Suggested Courses

927 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question