Solved

# O(log n) queries for subsequence problem

Posted on 2007-07-26
408 Views
Hello,

I have to solve the following problem in O(log n) for both updates and queries... the problem statement goes like this:

Given n numbers, answer m questions like this: what is the value of the highest sum subsequence in the interval [A, B], 1<= A <= B <= n?

My idea is to use an interval / segment tree, but I'm not sure how to build, query and update it...

I'd appreciate any pointers. I don't want any other code except maybe pseudo-code, I'm only after some ideas about how to solve this efficiently...
0
Question by:Dastas

LVL 58

Expert Comment

As stated, the answer is alway: Sum from i=A to i=B of xi where xi>0.

The way you incoporate A and B in the sequence 1...n, it indicates that these are indices, and not values. So [A,B] isn't an interval per se, but a subset of X.

Note that is doesn't really matter. If [A,B] does in fact represent an interval:

A <= xi <= B, where i = 1...n

This still creates a subset of k numbers, from which the largest subset sum is obtained by taking only the positive values, or none if all are negative (as a subset can be empty).

I'm probably missing the point, though.

(°v°)
0

LVL 84

Expert Comment

what was your idea for updating the interval / segment tree?
0

Author Comment

harfang:

A and B are indices. I need the maximum sum subsequence of the sequence [1, n], subsequence that is between the indices [A, B].

For example, considering the following sequence of numbers:

1 4 -2 -1 6 5 -20 4 5

The highest sum subsequence in the interval [1, 4] is 5 (the numbers 1 + 4)
The highest sum subsequence in the interval [1, 5] is 8 (sum of all numbers. note that, since this is a subsequence, all numbers composing it must have consecutive indices, so you can't leave out the negatives, unless you leave 6 out too, which would give a lower sum)

Consider that, if there are just negative numbers in the sequence, the answer is the highest one. For example, given the query [7, 7], the answer would be -20.

ozo:
Well, I thought of building it like this:
Let A be the array of the given n numbers.
It's trivial to calculate the value for the leaves of the tree, since those are intervals of the form [X, X]. That is, their value is always A[X]. So, once these are calculated, due to the recursive nature of the process, the value of their father in the tree is calculated like this:
The value is one of three things: the value of the subsequence in the interval represented by the left son, the value of the subsequence in the interval represented by the right son, or, a subsequence with elements from both of those intervals. It is obviously the maximum of these three cases. This is where I got stuck, however. If it's the one with elements in both intervals, how do I compute that?
0

LVL 58

Expert Comment

Ah, that makes sense. I misread "subset" for "subsequence".

If k is the number of elements between A and B, you have (k * (k+1)) / 2 possible subsequences. There is no elegant way to build a tree with them (in your example above, each node would have two fathers), so I can't think of a recursive algorithm.

So it's basically a two-variable nested loop problem. There might exist some borderline optimization in eliminating fringe negatives (provided there is at least one positive), but that probably irrelevant.

In pseudo code:

result = A[min]
for i = min to max {
this = 0
for j = i to max {
this += A[j]
if (this > result) then {
// record this, i, j
}
}
}

Perhaps I'm missing something else, now...
(°v°)
0

Author Comment

harfang:

You are. That is very slow when I can have M ( M <= 100000 ) such queries. You algorithm's complexity is O(N^2 * M) in the worse case, which, as you can see, is huge.

In fact, there's an easier O(N) dynamic programming algorithm, but that's too slow too when you have M = 100000 and N anything higher than 100. I need an algorithm that can solve the problem in O(M*logN) total complexity, that is, determine the value of the subsequence in logarithmic time. I'm not sure if it's possible to do this with interval trees, or if there's an easier way, it was just an idea. I'm open to any solution that can solve the problem in that complexity.
0

LVL 58

Expert Comment

OK, I see your point. The nested loops do need (n*(n+1))/2 tests at the worst (k=n). Do you have a pointer to the O(n) algorithm? Anyway, here is another overall optimization, especially good for non-random numbers. Transform:

1 4 -2 -1 6 5 -20 4 5

To alternating positive/negative sequences:

[1 4] [-2 -1] [6 5] [-20] [4 5]
+5 -3 +11 -20 +9

And consider only positive-to-positive sequences. This reduces the number of tests from 45 to 6, in this case.

There is again the borderline optimization: given that the max positive is 11, the "cost" -20 to include the trailing 9 is too high: it can be eliminated.

(°v°)
0

LVL 84

Expert Comment

You can find the max subequence in [1,n] in O(n), and you can lookup a table of [A,B] answers in O(log n), but updating the table in O(log n) when you update one of the numbers seems trickier...
0

Author Comment

harfang:

The O(n) dynamic programming algorithm for finding the maximum sum subsequence is this:

Let A be the array of the initial numbers, let B be an array such that B[i] = maximum sum subsequence ending at position i.

B[1] = A[1]
B[i, i > 1] = max(B[i-1] + A[i], A[i]) (you can either append A[i] to the already-computed maximum sum subsequence, or take A[i] alone, depending on which gives a better sum. This also works if all the numbers are negative.)

This can be used to answer each query in O(n) time, resulting in total O(n*m) time, but this is still too slow for my needs... I need O(m*logn).

ozo:

Can you explain in a little more detail how I would go about creating that lookup table? I can't just store the answer to each query in an array, since that isn't helpful either in the worst case where all queries are unique, even if I don't have to do updates (by updates I mean changing one of the initial numbers with another)...

Consider the problem of finding the sum in an interval [A, B] in O(log n). Consider updates as well. This can be done with interval trees, like this:

For every leaf node, the sum is always A[interval_associated_with_that_node]. Therefore, for every father, the answer is Interval_tree[left_son] + Interval_tree[right_son]. This is how the tree is built.

Now, for the queries: Consider the query [X, Y]. Add to a global variable S the value associated to all nodes whose associated interval is incloded in [X, Y]. Do NOT go deeper down that path in the tree (so as not to add the same values twice). At the end of the querying process, all nodes whose reunited associated intervals form the interval [X, Y] are checked, and so the variable S contains the sum of the interval [X, Y]. The worst case complexity of this is O(log n), since an interval tree has log n nodes.

Updating is done the same way. Recursively go down the paths that include (or might end up including) the indice of the updated element. Complexity is O(log n), worst case, but can be optimised since not all paths need to be checked, only those that the changed element affects. Alternatively, just call the build function again, but that has log n complexity ALL the time.

Maybe there's a similar solution to this one? I just can't come up with something that works...
0

Author Comment

Hello again.

It's been a week now since there has been no reply. If I am not going to get an answer, could I at least get a refund?

Thank you.
0

LVL 58

Expert Comment

Since you will run several queries (m = 100000) on the same set, the optimization involves creating a good abstract of your data.

Continuing from {http:#19584985}, you can first summarize your data in alternating positive and negative blocks.

[1 4] [-2 -1] [6 5] [-20] [4 5]
+5 -3 +11 -20 +9

You can then examine each negative block and, if the absolute value is smaller than both bordering blocks, summarize three blocks as a new one. This reflects the fact that the cost of the negative block is covered from both sides (for subsequences being built from both sides)

[+5 -3 +11] -20 +9
+13 -20 +9

You can also examine each positive value and do the same when the value is smaller than the absolute value of the negative blocks on both sides. It's never worth it to take that positive block *alone* when building a sequence from either side.

By repeating this process, you end up with a sort of tree, or rather a forest, with the stems having alternating positive and negative values that can no longer be compressed.

When searching the highest subsequence for [A,B], blocks not affected do not need to be recalculated. In the example above, all sequences which contain the first block will have the same answer. Also note that each block's value is the highest possible subsequence within, so you don't need to open it at all when you can discard it using the total. Likewise, you can discard negative blocks on either side (provided you have one positive block) because you cannot create a positive subsequence from it starting at either side.

You query engine will thus:

Get "stem" blocks for A,B. If it's a single block, look inside until there are several blocks. (If you end up with a pure first-level positive block, return A,B, a pure first-level negative block, return the highest between A and B)

In the list of blocks, discard negative blocks from both sides entirely, "open" positive block containing A or B, until you end up with a sequence of alternating blocks.

Then apply your O(n) strategy, using the entire blocks.

Does that help?
(°v°)
0

Author Comment

What if all the numbers are positive or negative? Or if the difference between the two is big? That won't improve it much, still O(queries*n) total...

I know for sure there's a O(queries*log n) worst case solution... you solution however can easily approach liniar complexity if the numbers negatives and positives aren't relatively the same...

And what about updates? Doing those precomputations takes liniar time, I can't afford to do any operation in liniar time (except maybe once, but queries and updates need logarithmic time)...
0

LVL 58

Expert Comment

If all numbers are positive, it's the "first-level positive block" rule: return A,B (the largest subsequence is to take all).

If all numbers are negative, it's the "first-level negative block" rule: return the largest value in the range (yes, that's linear time unless you index the block). [Formally, the solution for all negatives should be 0, but you said you didn't want that.]

The problem is interesting only with a mixture of positive and negative values.

Large differences do not affect the logic. Given three blocks (alternating signs), their absolute values can be in four forms (disregarding equals for now):

10 9 8 (decreasing); 8 9 10 (increasing); 8 10 9 (local max); 10 8 9 (local min)

Only the last case will allow a new block to be created. But that means all small blocks not on one end will be included in higher-level blocks. You will end up with a regular increase of absolute block values to a maximum, and a decrease after that. It will eliminate all local minimas.

I don't know if the allows a faster algorithm in the end. But in any case, it's independent from the absolute differences in the values.

An update potentially requires the entire tree to be recomputed, that is true. But you wanted to answer 100000 queries on the same set, didn't you? Preprocessing is thus worthwhile. But I realize that my suggestion is a form of overall indexing of the data.

(°v°)
0

Author Comment

Yes, but I can also have quite a lot of updates - which would be considerably slow, recomputing everything each time.

Anyway, I still don't think your idea approaches logarithmic time, however, I'll try it, and if I don't get any better answers in the meantime I'll award you the points.
0

LVL 58

Accepted Solution

Thank you.

> I can also have quite a lot of updates

I'm not entirely certain that a single update needs a full recalculation. You might find that it can be updated recursively starting from the new leaf, but I think it could potentially affect 1/2 of the tree in the worst case. I'd need to test it myself to make sure, though.

> I still don't think your idea approaches logarithmic time

There is the overhead of the pre-calculations, and that of the final O(k) algorithm on the resulting blocks. But the number of blocks to consider (k) is in a log relation to n.

As a matter of fact, I'm not sure there *is* a final O(k) algorithm at all.

Remember that when A or B falls in a negative block, it's skipped entirely (with a 50% probability). When it falls in a positive block, that block needs to be processed ("opened") only as long as the resulting total is larger than the next negative block in the subsequence (which has a 50% initial probability). When the total goes below, the block is skipped along with the next negative block.

That rule: "if a positive block on the end of the sequence is smaller than the absolute value of the next negative block, skip both" is especially easy to apply as the block values are locally sorted.

For example, consider this overall "stem" sequence, which no longer has any local minimas (if there were, it would become a new block):

+1 -2 +3 -4 +5 -6 +7 -6 +5 -4 +3 -2 +1

Given A and B, it's easy to find the largest block, as you know where the maximum is and where the sequence is ascending and descending. If A falls in the -2 block and B in the +7 block, you only have to consider the +5 block and explore the +7 block. That's all. In most cases, your answer will be a single block (which is intuitively satisfying). For example, whenever the range contains the +7 block, that's the answer.

> I know for sure there's a O(queries*log n) worst case solution...

If you ever find it, keep me informed; I've grown curious. If I had the time, I'd be programming already...

Good luck!
(°v°)
0

Author Comment

OK, points awarded to harfang. Your idea did speed things up a little. I still haven't found a log n solution.

Thanks!
0

LVL 58

Expert Comment

Thank you, and success with your quest!
(°v°)
0

## Featured Post

This algorithm (in C#) will resize any image down to a given size while maintaining the original aspect ratio. The maximum width and max height are both optional but if neither are given, the original image is returned. This example is designed t…
Linear search (searching each index in an array one by one) works almost everywhere but it is not optimal in many cases. Let's assume, we have a book which has 42949672960 pages. We also have a table of contents. Now we want to read the content on p…
It is a freely distributed piece of software for such tasks as photo retouching, image composition and image authoring. It works on many operating systems, in many languages.
This video gives you a great overview about bandwidth monitoring with SNMP and WMI with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're looking for how to monitor bandwidth using netflow or packet s…