Go Premium for a chance to win a PS4. Enter to Win

x
Solved

# What is the computational complexity of this method?

Posted on 2011-03-13
Medium Priority
568 Views
Hi
I have a paper that does some statistical analysis on some data. The paper returned by the reviewer with some comments which are easy to understand but one is asking for the complexity of the method used as below:

What is the computational complexity of this method? It does look very expensive to me.

Any idea what he meant here. I need some ideas in order to fix the paper please.

0
Question by:jean11
• 9
• 9
• 8
• +3

LVL 3

Accepted Solution

_kiwi_ earned 252 total points
ID: 35121095
Hi,

Computational complexity is a way to represent the amount of work required by an algorithm to solve a problem. Other complexity exists (space complexity [amount of memory required by an algorithm], complexity classes...)

Computing complexity is computed by studying how an algorithm will run over defined sets of input data, then taking the order of magnitude required for this work. (you will often find classes: O(1), O(n), O(log(n)), O(n.log(n)), O(n.m), O(n^2))

It is given as three values:
best case complexity: the set of input values that give the best running time
worst case complexity: the set of input values that give the worst running time
average complexity: an estimation of the running time "in average", i.e. through a statistical repartition of the input values

Let's take a simple algorithm (false language) explain: searching a value A through a non empty list of values X = [X1, X2, X3, ... Xn] sorted in ascending order, given A >= X1 and A <= Xn, X* are positive integers

``````for (index = 1 to n)
if A equals value index of array X
return index
end if
end for
``````

This is one of the simplest search algorithms.
Best case complexity: A == X1
if we run the algorithm, we will have only one instruction run inside the loop, the first test will answer OK, so complexity will be O(1)
worst case complexity: A not in X
if we run the algorithm, we will have n iterations through the loop with 1 instruction in the loop and we will finally answer Not Found, so complexity will be => O(n)
average case complexity: given a random value A between X1 and Xn, what will be behavior of our algorithm
given the amount of possible values A in the positive integer range, we have most chances of falling in the case where A is not in X, the complexity will end up as O(n)

Small optimization: using the input constraints (X sorted in ascending order), we can change the algorithm to
``````for (index = 1 to n)
if A equals value index of array X
return index
else if A lower than value index of array X
end if
end for
``````

Note: since A <= Xn, I will never get out of the loop without a return, so I can remove the default case.

Best case complexity: A == X1
if we run the algorithm, we will have only one instruction run inside the loop, the first test will answer OK, so complexity will be O(1)
worst case complexity: A > X(n-1) and A < Xn (notice we changed, this is now worst case)
if we run the algorithm, we will have n iterations through the loop with 2 instructions inside and we will finally answer with Not Found, so complexity will be n iterations of 2 instructions = O(2n) = O(n); note here that complexity is an order of magnitude, not an exact number, so O(10n) is still only linked to the variable n, and will be considered equals to O(n)
average case complexity: given a random value A between X1 and Xn
given the amount of possible values A in the positive integer range, we have most chances of falling in the case where A is not in X (still), but randomness will also on average give A in the middle of the X1..Xn range, so complexity would be O(n/2) [we would do half the tests], and thus it would still be worth O(n)

Another example of complexity: a balanced binary tree search algorithm will run in O(log(n)).

So given your algorithm, identify the parts of it that depend on the amount of input data available, and simulate it running to know your complexity, and think about best, worst, and average case.

Small optimizations (limiting the number of instructions inside a loop...), though they will make a "real life" difference if you run your loop a few million times, won't change your complexity, so choosing adequate data structures and efficient algorithms to go through them is the key.

Hope this helps,
Julien
0

LVL 28

Expert Comment

ID: 35123437
You'll need to post the algorithm for us to help tell you what it's computational complexity is.

Doug
0

LVL 37

Assisted Solution

TommySzalapski earned 248 total points
ID: 35123600
What is the computational complexity of this method? It does look very expensive to me.
He's saying he thinks your method takes a long time to compute compared to the size of the input. Computational complexity is just a measure of the amount of computations required to complete the algorithm. Most people use 'big O' notation to describe the complexity. kiwi posted a good introduction to computational complexity and as dpearson noted, the more info you give us, the more specific we can be.
0

LVL 18

Expert Comment

ID: 35134513
As an example of complexity, lets consider SORTING.
Brute force sorting of N numbers compares each one with all the others, which is N x N, or quadratic.
Divide and conquer sorting is N log N, which is less expensive than N x N.
The ideal sorting algorithm does N iterations, wich is linear and the least expensive one.

What the revisor says is that your algorithm can be better, say, can be solved with less effort. It is not clear if the comment was on CPU time (probably yes) or Memory room.

Jose
0

Author Comment

ID: 35142004

0

LVL 37

Expert Comment

ID: 35142080
So you need to calculate the support for every x,y pair. What function are you using for support? All pairs puts you at n^2 already.
If you have n nodes then there are n(n-1)/2 unique pairs so that in itself is O(n^2).
0

Author Comment

ID: 35142098

As you see I have no programs here . The formulat above calculates the Normalized Support.

Normalized support is used to define the minimum threshold value for the product combination based on the general support of the last two campaigns in order to maximize product sale. This should enable the marketing or the sales team to design future discount coupons or offers for product promotions. It can be said that Normalized support takes the general support to the next level.

Normalized Support Defined
see the formula above.

Methodology
Data Source
The paper makes use of data from a real grocery store to demonstrate the real world applicability of Normalized Support. The data was obtained from an anonymous store.

Real Data

A set of real grocery store data was used to demonstrate the scale of normalized support. It was part of a sales database of a grocery store chain. Point of sale data from four different stores in the area were collected in a central system with each transaction including the store code. The point of sale data were collected in a flat file and each record included the item discount coupons code (offer code), the store number and the Transaction identifier.

The first step of the analysis process involved data cleaning since some of the sales records lacked the Transaction identifier while the others lacked the offer code. This was followed by taking out the top nine products. The cleaning process eliminated around 49% of the total records. The outcome was a reliable dataset suitable for analysis. In this study, discount coupons were considered for a product.

In the second step, the text formatted data of transaction records was loaded into a SAS database for querying. Customer basket analysis involves finding the number of occurrences of every item to all other items to find possible relationships between them.

In order to make a practical analysis and identify the potential important items the study made use of a two step approach to screen data records. . The structure of sales data table items is illustrated in Fig. 1.1. The figure shows the items that were important to the analysis. The database as shown in Table Figure 1.1 demonstrates Normalized Support.
Data attributes:

TRANSID= Transaction ID [First three digits represents store id next 6 digit present transaction code]
PRODUCTS=PRODUCT ID
OFFER CODE=DISCOUNTED COUPONS CODE

0

Author Comment

ID: 35142143
TommySzalapski:

What function are you using for support?

Thanks
0

LVL 37

Expert Comment

ID: 35142182
So supp(x,y) = (count_transactions(x)+count_transactions(y))/count_transactions(all)?
0

LVL 37

Expert Comment

ID: 35142189
Then your complexity is just O(n^2) where n is the number of products. That doesn't seem so bad to me.
0

LVL 28

Expert Comment

ID: 35142239
The complexity of the overall calculation looks to be O(n^2 * k) where n is the number of x,y values you are considering and k is the number of transaction records per x,y pair.

Doug
0

LVL 37

Expert Comment

ID: 35142312
By the way, is that how your paper actually looks? No offense, but your equations look really bad. You know that you can use Word's equation editor to make fractions and square roots and such right?
0

LVL 37

Expert Comment

ID: 35142329
Right. I was assuming the counts of transactions were known already. If they are not known and you actually loop through them counting them one at a time, then you would need to add that in there. If the counts are already computed though, there's no need.
0

LVL 37

Expert Comment

ID: 35142662
This screenshot is of an equation created in Word. You can make them look quite nice.

0

Author Comment

ID: 35142701
TommySzalapski:

no that is not taken from the paper. The paper looks better.
So you are saying the complexity is O(n^2 * k) then how it does not seem to be so bad? Me I think if the complexity is O(n^2 * k) then it is bad. Can you please tell me why it is not so bad.

dpearson:
Thanks for the help.
If it is bad how we can optimize this?

Thanks

Jean
0

Author Comment

ID: 35142722
thanks for the image can I get an editable version
0

LVL 28

Expert Comment

ID: 35142873

dpearson:
Thanks for the help.
If it is bad how we can optimize this?

Good and bad are relative terms :)

I'd agree this looks fairly slow for large n and k.

To speed it up you should first compute sum(transaction(x)) for each x and store that as a cached value.  That step would be O(n * k), but requires space(O(n)) to store the totals.

Then the support calc would be a further O(n^2) - because now you can look up the sum(transaction(x)) in O(1).

So the total becomes O(n*k) + O(n^2) with O(n) additional space.  If k is less than n, this is O(n^2) overall - but that depends on the data set (can't tell from the algorithm's definition).

Doug
0

LVL 37

Expert Comment

ID: 35142913
I've attached the Word file I used to create the equation. If you use 2003, then it won't work. But you can use the equation editor yourself and make the exact same thing.

You say Sup(X,Y) is the count of transactions of X and Y divided by the total transactions.
Are these counts already known or do you recount every time? If they are already known, then as I  mentioned before, you are only at O(n^2).
0

LVL 28

Expert Comment

ID: 35142948
The transaction data appears to be in the database as a list of transactions - so it would require work to compute the counts.

Doug
0

LVL 28

Expert Comment

ID: 35142963
I should have said that I'm assuming the database is correctly indexed.  If not, computing sum(transactions(x)) actually involves scanning all rows of the database - making it O(n*k) per summation operation instead of O(k).

But that's not likely the case.  Databases are good at indexing...

Doug
0

LVL 3

Expert Comment

ID: 35142966

n products
k transactions
l transaction details lines (entry of one product in a transaction)
t = average number of different products in a transaction (constant)

we agree that sup(X,Y) = sup(Y,X) [same pair = same support]

Instead of taking the problem by taking pairs of items, you could take the algorithm from the transactions side.

Each transaction includes a defined number of products (t), if you make a two dimensional map, indexed by product ID X and product ID Y, with as a value the number of common occurrences between product X and product Y.

Your support computation algorithm could be:

You would have a k loop over your transactions, and on each iteration, for each product id X in this transaction, take all other product ids Y (where Y>X) in the transaction, and increment the value support[X][Y]

Going over all transactions, and on each transaction loop through products (ask SAS to give you the data sorted by transaction id then product id) has a complexity of O(k) k loops of p*(p-1)/2 = O(k*p^2), but p being a constant (and small, you rarely have a 200 basket products), it is reduced to O(k)

The space complexity of the support map would be in O(n^2), use some efficient data storage for inserting values into the map and you will have a computation complexity of:
- O(k) if you use something like a two level hash table (a hash table linking to another hash table giving the requested value) (data input in a well designed hash table is O(1))
- O(k*log(n)) if you use something like a two level self balancing tree

Computing the total transactions count can be made in the same k loop, as soon as you see a product ID X, you increase a binary tree entry in O(log(n)) or hash table entry in O(1).

Hash tables are cheaper in CPU, but bigger in space, and must be well designed, because self increasing tables take a really long time to grow up, while trees are more reliable if not fine tuned.

You would have, to build your dataset, a:
- with a hash table structure: O(k)
- with a binary tree: O(k*log(n))

Hope this helped,
Julien
0

LVL 28

Expert Comment

ID: 35142998
kiwi's right that you could trade space for time like this - but usually O(n^2) space isn't feasible for a large n.

Doug
0

LVL 3

Expert Comment

ID: 35143029
That's the all point, not knowing the inputs makes it difficult to design, we're stuck with happy guesses...

But even a 10K products catalog would lead to something like:
- (10K*10K/2) = 50M values at worst case (at least one pair of each product X with each product Y)
- with 50 bytes each record (structure + data + pointers) --> 2,5G memory space

That's affordable for a SAS like server.

If you are on a major ecommerce site and want to try a n^2 on the entire 10M products catalog, then you'll have quite a problem :)

Julien
0

LVL 3

Expert Comment

ID: 35143067
Or else if you have a n problem, then make the storage a SAS table or a small database, you'll be slow in access (way slower than in memory, in the order of a few milliseconds a read/write, so you'll want to optimize them), but you can scale easily.

We do that on some statistics computation in my company, with a few million possible products and many million users, it works efficiently (for our needs).

Julien
0

LVL 28

Expert Comment

ID: 35143106

That's the all point, not knowing the inputs makes it difficult to design, we're stuck with happy guesses...

Fair enough :)

But usually you're best to assume O(n) for space in most algorithms - since the data itself that you start from is usually O(n) too so making extra copies of the data may be reasonable, but O(n^2) in space isn't usually safe to assume.

At least that's my opinion :)

Doug
0

LVL 3

Expert Comment

ID: 35159869
And I completely agree with you Doug.

Preparing data beforehand would help a lot, and one would probably not compute all the supports between all products, but simply have the support of one top selling product to other significant products, making it easier in both memory and cpu.

Cheers,
Julien
0

Author Comment

ID: 35214987
TommySzalapski:

>You say Sup(X,Y) is the count of transactions of X and Y divided by the total transactions.
>Are these counts already known or do you recount every time?
yes we count them every time.

_kiwi_:

I am lost. I do not understand your solution to speed up. I do not know why you are concerned with the  space complexity here.  Can someone summarize the computation complexity and the solution please.
0

LVL 3

Expert Comment

ID: 35217618
The optimization is about only reading the number of purchased items mines once, and not going through all products for each product you want to compute, giving O(k) instead of O(k+n^2), thus having a linear complexity.

The Space complexity concern is that to work this approach has to keep up to date an array containing, for each product, the number of times it can ne found with all other products, giving a potential array of n^2 elements, and you trade your computational complexity for a space complexity.

Then the algorithm you will choose will gave to tale into account the number of products (n) and the number of purchase lines (k) to be efficient.

If n is small and k huge, the array won't cost much in memory, and the time gain will be huge (only one loop). (top sellers scenario)

If on the other hand n is huge, the space taken by the array can be so important that this maked no sense to store the whole thing. (long tail scenario)

Cheers,
Julien
0

LVL 3

Expert Comment

ID: 35217676
To restate, the computational complexity is a measure of the number of computing operations made by your program, given the hypothesis of data volume your program will process (number of products, number of purchase lines, number of customers, ...).Â

The space complexity is the general idea of the memory amount that your program will require.
0

Author Comment

ID: 35221815
So what is the computational complexity if the counts are not known in advance? we count them every time.

And what is the computationaal complexity after the optimization please?

0

LVL 3

Expert Comment

ID: 35226487
The computational complexity depends on the way you will implement your algorithm, what you have published so far is only a general presentation of a formula, not a real implementation, so the complexity of your exact implementation is something I can't give. From what you say it would be something like O(k+n^2)

The "optimized" version has, as stated, a O(k) computational complexity and a worst case O(n^2) space complexity.
0

Author Comment

ID: 35297214

if I change the algorithm to the attached one will the complexity changes?

0

LVL 3

Expert Comment

ID: 35297324
Hello,

To me, since the complexity of the algorithm is mainly contained in the "Number of transactions done (product x and product y)" statement, restating your algorithm this way does not change much the complexity.

Julien
0

LVL 28

Expert Comment

ID: 35303319
Jean,

I think you're fundamentally confusing a formula and an algorithm.  A formula shows what the results of a calculation will be.  An algorithm shows the method for computing that result.  Algorithms include steps like loops and variables (for storage) which formulae can usually ignore.  It's a bit like the difference between a theorem and a proof - algorithms have more steps, a formula is just a summary.

As kiwi said, the question here is really how you're going to calculate "Number of transactions done" for the different product x,y.  There are different ways to do this (we've described several of them earlier in this question) and each has different computational complexity - from O(n*k) + O(n^2) using O(n) space down to O(k) using O(n^2) space.

Exactly what the complexity is depends on the precise method for computing the Sup(X,Y) value - which isn't specified in your problem description.

Hope that helps clarify things,

Doug
0

LVL 37

Expert Comment

ID: 35306644
Yes. Can you post the actual algortihm you used when running experiments? Then we can help you much more specifically.
0

LVL 101

Expert Comment

ID: 36275166
This question has been classified as abandoned and is closed as part of the Cleanup Program. See the recommendation for more details.
0

## Featured Post

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction Many of the most common information processing tasks require sorting data sets.Â  For example, you may want to find the largest or smallest value in a collection.Â  Or you may want to order the data set in numeric or alphabetical order.Â â€¦
Introduction This question got me thinking... (http://www.experts-exchange.com/questions/28707487/GLOBALS.html) Why shouldn't we use Globals? This is a simple question without a simple answer. Â How do you explain these concepts to a programmer wâ€¦
This is a video describing the growing solar energy use in Utah. This is a topic that greatly interests me and so I decided to produce a video about it.
Finds all prime numbers in a range requested and places them in a public primes() array. I've demostrated a template size of 30 (2 * 3 * 5) but larger templates can be built such 210  (2 * 3 * 5 * 7) or 2310  (2 * 3 * 5 * 7 * 11). The larger templaâ€¦
###### Suggested Courses
Course of the Month11 days, 3 hours left to enroll