Solved

Compressing a list of integers

Posted on 2007-03-26
27
518 Views
Last Modified: 2012-06-27
I'm trying to find a way to compress a list of integers (16 bit). The list has a few particularities:

- always 48 numbers in the list
- all integers (none above 65536)
- the sum of all numbers will never be above 65536 (I feel this could be key in making the compressed value take up little room)
- 0 might naturally occur more often as a value, but all should be considered random.

I want to compress these so I can decompress without loss of precision. Is there something better than a list of 96 chars (which is no compression at all)?
0
Comment
Question by:Gorizo
  • 9
  • 7
  • 4
  • +3
27 Comments
 
LVL 84

Expert Comment

by:ozo
ID: 18798014
the average number is less than 1366 which could fit in 11 bits,
if you used 4 bits to store the number of bits needed to store the number in binary
followed by that many bits of the number, you would never need more than 720 bits to express all 48 numbers.
0
 
LVL 84

Expert Comment

by:ozo
ID: 18798049
An improvement may br to use 3 bits to say whether the number will need to be stored in 9-16 bits, or if 0 is expected to be common, in 0 or 10-16 bits
Ths should not take more than 688 bits to express all 48 numbers
0
 
LVL 28

Expert Comment

by:pepr
ID: 18798126
What is the purpose? Do you have many of such chunks? Is the reason to save the space (on disk) or save the communication bandwidth (i.e. as short messages as possible). If the chunks form a bigger space or if the data should be sent as a whole, it is probably always better not to use any optimisation of that kind and use some general packing/unpacking algorithm.

The chunk is quite small and you will always need some extra info. The algorithm will be more complex and you can  introduce more errors.
0
 
LVL 45

Expert Comment

by:Kdo
ID: 18799142
Hello Gorizo,

Another improvement may be to use a 2 bit flag/modifier with 8 or 16 bits of data.

The 2 bit flag will contain 0, 1, 2, or 3.

 0 -- The current data value is the next 8-bit value.
 1 -- The current data value is the previous data value plus the next 8-bit value
 2 -- The current data value is the previous data value minus the next 8-bit value
 3 -- The current data value is the next 16-bit value.

This will will allow all values of 0 to 511 and all values within +- 511 of the previous value to be stored in 8 bits.  For every data value after the first item in the list, 1,023 to 1,534 values will store in 8 bits.

The total compression will vary a bit as the compression factor will be data-dependent, but the average total length should be less than 576 bits.


For speed purposes, I would limit the embedding of flags and data.  Two approaches come to mind.

1 -- A byte contains 4 2-bit flags that describe the next 4 values.
2 -- All 48 flags preceed the data in 96 bits.

I'm kind of partial to method 1.


Good Luck,
Kent
0
 
LVL 84

Expert Comment

by:ozo
ID: 18799196
I think a different set of 4 options may be better
If the data values are randomly distributed around an average value of 65536/48 then most of the flag values would require a flag value of 3 with the value in the next 16 bits
0
 
LVL 84

Expert Comment

by:ozo
ID: 18799272
If the order of the values does not matter, then we could sort the values and  store the differences, and the differences should cluster more strongly towards smaller values
with sorted values, there couldn't be more than 22 difference values that require more than 8 bits.
0
 
LVL 45

Expert Comment

by:Kdo
ID: 18799849

Hi Ozo,

> If the data values are randomly distributed around an average value of 65536/48 then most of the flag values would require a flag value of 3 with the value in the next 16 bits

This misses the key requirement that "the sum of all numbers will never be above 65536".  Every time a value over the average (1366) occurs increases the likelihood of a subsequent 8-bit value occuring.  The higher the value, the more 8-bit values will be required to maintain the average.  If all the values are within a standard deviation (or so) of average, something close to all the values will be stored as 8-bit values.


Kent
0
 

Author Comment

by:Gorizo
ID: 18800741
Hi all. So far I like a lot of the ideas. I will eventually decide on which is more usable/easy to implement.

To answer the questions:

- Order of values will matter and must remain the same, or be refound (sorry, huge omission on my part, I know)
- This is for a database field. I'm toying with the idea of having a compressed field instead of 48 fields numbered "F1" thru "F48", always queried according to other accompanying values on the row. There is no need to query according to F1 or the others, just to get the result and calculate from there. The table in question is already quite wide, and I expect it to grow quite long too (possibly 4000 lines per day).
0
 
LVL 45

Expert Comment

by:Kdo
ID: 18801065
Hi Gorizo,

You can probably squeeze a few more bytes out of my proposed solution if you incorporate an "assumed" value.

Every time a fixed value is stored (key 0 or 3) the algorithm should use a value equal to or less than the average value as the basis for a relative value in the next byte.  Here's an example:

Value    code  stored value
   1           0          1
 800         3          800
 900         1          466            -- Note below
 850         1          50
 3000       3          3000
 1366       1          0                --  Note below
 1365       1          1

When a code 0 or 3 is stored, 1366 is assumed as the base value for a relative code in the next byte.  In line 3, 900 is stored as -466 (code 1, value 466) because line 2 is a constant (code 3).  Average (1366) - 466 = 900.  Likewise, in line 6, 1366 is stored as +0 (code 1, value 0) because line 5 is a constant (code 3).  Average (1366) + 0 = 1366.

If a value > 33,000 occurs, the next value can not possibly be derived as a relative value.  33,000 + 33,000 (-+ 512) is greater than 65536 which would violate the rules.  By using an assumed value after a constant is put in the buffer brings the (8 bit) relative codes back into play.

Similarly if a value from 1 -> 511 is stored, relative positioning is less likely to occur in the next word as there is an overlap between the stored and relative values.  If 0 and 14 are to be stored in consecutive words, the 14 could be produced as (0,0) (0,14) or (0,0) (1,14).  In this case there are no values greater than than 511 that can be represented as a relative value for the second word.  If 0 and 1366 are to be stored in consecutive words, the assumed value lets them be stored as (0,0), (1, 0) saving 8 bits.

This will likely squeeze another byte or two out of the result.  (The original list is only 96 bytes so there aren't really that many to squeeze out.)

It's really not as complicated as this might make it sound.  It'll write very easily in C or a similar language.


Kent
0
 
LVL 84

Expert Comment

by:ozo
ID: 18802545
without more restrictions on the numbers it is possible that your list will be
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
which would take 864 bits using the above code

0
 
LVL 45

Expert Comment

by:Kdo
ID: 18802795
Hi ozo,

I acknowledge that there are certainly worse case scenarios.  But by using the assumed-average that I metioned above, this list compresses to (3,1622), (2,258) [repeating].  Or 1.5 bytes (plus 2 bits) per entry for 84 total bytes.  12.5% smaller than the original 96.

It's hard to nail down exactly how effective the compression will be since it is quite data dependent.  A single large value (or a small total sum) could result in the entire list being represented in 1 byte (plus 2 bits) per entry.  Such a list stores in  60 bytes.  38.5% smaller than the original 96.

Intuitively, I want to say that using 1023 for the assumed average is superior to using 1366.  That results in all values from 0 to 1534 representable as an 8-bit value after any constant.  And since the appearance of a large value in the list reduces the average of the other values, this could result in more relative values being recorded.


Another idea that I've considered is a combination of your 3/4 bit length and this relative values technique.  Include other functions into the code set.  [Repeat previous value], [Repeat previous value N times], [Repeat previous N values M times], [Insert 0 N times], etc.



Still, we're trying to compress only 96 bytes.  It sill seems kind of silly to me.   :)
Kent
0
 
LVL 84

Expert Comment

by:ozo
ID: 18802842
0 -> number is between 0 and 2047, 1 + 11 bits
10 -> number is between 2048 and 4095, 2 + 11 bits, happens at most 31 times
110 -> number is between 4096 and 8191, 3+12 bits, happens at most 15 times
1110 -> number is between 8192 and 16383, 4+13 bits, happens at most 7 times
11110 ->  number is between 16384 and 32767, 5+14 bits, happens at most 3 times
111110 -> number is between 32768 and 65535, 6+15 bits, happens at most once
0
 
LVL 84

Expert Comment

by:ozo
ID: 18802860
258 does not fit in 8 bits
0
IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 

Author Comment

by:Gorizo
ID: 18805056
ozo, your last solution looks good. I really like its mathematical structure.

I've devised something from that.

Suppose that we reserve 4 bits to a value, n. Thus, n will range between 0 and 15.
Then, the next n bits are the remainder of the value to be added to 2^n.

Thus:

0000 0 -> 0 (special case)
0000 1 -> 1 (special case)
0001 0 -> 2
0001 1 -> 3
0010 00 -> 4 (4+2 bits)
0010 01 -> 5
0010 10 -> 6
0111 1111111 -> 255 (4+7 bits)
1000 00000000 -> 256 (4+8 bits)
1001 000000000 -> 512 (4+9 bits)
1010 0000000000 -> 1024 (4+10 bits)
1011 00000000000 -> 2048 (4+11 bits)
1100 000000000000 -> 4096 (4+12 bits)
1111 111111111111111 -> 65535 (4+15 bits)
etc.

The formula would be 2^n + (n bits).

This makes lower values take up less bits, mid-range values take up more, and high values about the same.
0
 
LVL 84

Accepted Solution

by:
ozo earned 500 total points
ID: 18805938
generally, you want to allocate more bits to rarer events so that you can use fewer bits on more common events.
The constraints on 48 non-negative integers that sum to at most 65536 force some values to be relatively rare.  
(and I think I mis counted, if the sum can be <= 65536 then number is between 32768 and 65535 can occur at most 2 times, number is between 16384 and 32767 can occur at most 4 times, etc.  also, if one of the numbers can be  65536, not going over none above 65536, then we need to be able to represtnt one mre value.
perhaps we could add something like
0 -> number is between 0 and 2047, 1 + 11 bits
10 -> number is between 2048 and 4095, 2 + 11 bits, happens at most 32 times
110 -> number is between 4096 and 8191, 3+12 bits, happens at most 16 times
1110 -> number is between 8192 and 16383, 4+13 bits, happens at most 8 times
11110 ->  number is between 16384 and 32767, 5+14 bits, happens at most 4 times
11111 -> number is between 32768 and 65535, 5+15 bits, happens at most 2 times (but the 0 after the 11111 was not necessary)
11111 000000000000000 -> 32768
...
11111 111111111111110 -> 65534 at most 1 time 5+15 bits
11111 111111111111111 0 -> 65535 at most 1 time 5+15+1 bits
11111 111111111111111 1 -> 65536 at most 1 time 5+15+1 bits
)
so we would actually do better to use more bits for larger values, if that allowed us to use fewer bits for the more common values.
But it would be hard to get below 1+11 bits for the potentially most comon range uness we go to representations that allow fractional bits per number.
Which is doable, but may not be worth it.
0
 
LVL 84

Expert Comment

by:ozo
ID: 18805954
> Suppose that we reserve 4 bits to a value, n. Thus, n will range between 0 and 15.
that's close to my initial idea, but since 0-2047 could be the most common case, it's more importiant to minimize the number of bits that can be used there.
0
 

Expert Comment

by:lauwr
ID: 18806510
Just a thought, but I can offer a solution that guarantees compression into exactly 539 bits.
Remove the maximum value. Store it (16 bits) and where it was (6 bits).
Encode the remaining 47 values, each of which is now guaranteed to be less than 1366, in 11 bits each. (47 * 11).
Naturally, you could improve on that last bit. E.g. In threes, such values could fit in 31 bits, rather than 33, so you could squeeze another improvement there, to 528 bits, if you were picky.
0
 
LVL 45

Expert Comment

by:Kdo
ID: 18807449
Hi ozo,

>> 258 does not fit in 8 bits

Sorry.  That's what happens when you quit designing for a living and start using application tools (SQL).  I was putting 511 into 8 bits, not 255.

Still, this assumed average has possibilities.  I'd like to get some of Gorizo's real data to prove it's worth.


Regarding your compression, if the data really is random then the likelihood of storing multiple values of 0 and 1 is pretty small.  It might do well to store them in two bits (as if they were 2/3) and reserve 0 and 1 for special effects.

A 0 for a special effect could mean "repeat previous" so that a repeated value takes only 5 bits, not 5 bits + length, etc.



Hi aluwr,

I don't think that that will work.  Removing the highest value guarantees that the AVERAGE of the remaining 47 value is less than 1366 -- and then only if the highest value is great that 1366.



Kent
0
 
LVL 22

Expert Comment

by:NovaDenizen
ID: 18808533
Here's a variant on ozo's solution, using a huffman code.  This ivolves a preliminary phase where you do some analysis on your data and use a histogram to construct an efficient code table.

struct codepoint {
    int base;
    int followingbits;
};

Then you construct a list of codepoints that look like they will reasonably efficiently cover your data.  Frequencies aren't known yet:

struct codepoint codes[] = {
    { 0, 0 }, // for zero
    { 1, 4 }, // covers numbers 1 to 1+2^4-1 = 16
    { 17, 6 }, // covers 17 to 17+2^6-1 = 80
    { 81, 10 }, // covers 81 to 81+2^10-1 = 1104
    ... and so on.  Make sure they eventually cover all values up to 65535
}

Then you use codes[] and a large sample of data to estimate the frequencies for each of these codepoints.  From these frequencies you can construct a huffman code, then use this huffman code to encode and decode the values.  

Each encoded value will be a huffman code followed by the appropriate number of following bits.
0
 

Expert Comment

by:lauwr
ID: 18808700
Sorry. You're quite right. There could be just two non-zero values, both 32765. However, I feel sure I can modify my idea only a little, to cover that...

Remove the highest 16 elements. Store these as they come. Record a bit mask of 48 bits to show which they were. Store the remaining 32 in 12 bits each.
Net saving is 32*4-48 = 80 bits.

The point I'd like to make is that, if the 48 16-bit values are totally random, other than the sum being less than 65536, you must directly use that fact in any compression you devise.

None of the earlier techniques above directly do so. They are doomed to, in general, increase the average total bits required to store the data. If there were more than 48 values, I'd be less sure of this.

I can't get too excited over my method above (I've saved 10 out of 96 bytes), but it's not just 10 bytes, of course. By being a fixed length field, it has saved anywhere between one and eight more bytes, by not needing a field length.
0
 

Expert Comment

by:lauwr
ID: 18809608
Argh. OK. So Ozo does have a point ( :) ). I've just run his scheme on a million randomly generated packets, and the average bit count is 587.607.
My 16/32 split requires 688. The field length isn't going to cost 100 bits!

Hm... a re-run, but extending ozo's scheme down to the 0-1023 range as well, gives an average of 581.309 bits.

One further question... the random packets I generated all had a sum of near 65535. Does the actual data do this? If. it is normalised to exacty sum to exactly 65535, one 16 bit value can be skipped completely.
0
 
LVL 45

Expert Comment

by:Kdo
ID: 18810034
Hi lauwr,

How did you generate your random numbers?  My guess (purely speculative) is that most of the values will cluster near the average.


Kent
0
 

Expert Comment

by:lauwr
ID: 18810339
Final version was to generate 48 values, each in the range 1..32768, and sum them (into "s"). Then scale them down to make the sum exactly 65535 by a loop that did:

t = 65535;
do {
    unsigned short x = data[j];
    unsigned short y = (2*t*(unsigned long)x/s + 1)/2;
    data[j] = y;
    s -= x;
    t -= y;
  } while (j--);
}

So, NO clustering. In fact, I suppose this whole thing is flawed. I've generated a flat distribution, then scaled it. The distribution will be flat from 0 to 2*65535/48 appox, with a slight jiggle at the top end where my initial data hasn't quite averaged to max/2.

In effect, I've generated 11.5 bit data - no large values.

By the way the problem was posed, I'm assuming that some sort of normalisation like this was being done. I can't think of any real world scenario that fits the whole thing.

Rats! Really need a better description of what the distribution in the data is supposed to be.

Maybe I'll repeat the test with a Poisson distribution. Maybe I won't. :)
0
 
LVL 45

Expert Comment

by:Kdo
ID: 18823792
Hi guys,

I've tested a couple of the algorithms and it looks like Ozo's 4-bit example from above is as good as anything that has been suggested.  Here's a side-by-side comparison of my suggestion and Ozo's.


C1 Bitcount = 664 (Length using my method)
C2 Bitcount = 546 (Length using Ozo's method)


Row | Value | code value base | code value
  0 |   130  |  0   130 [  766]  |  7   130
  1 | 10982  |  3 10982 [  766]  |  13 10982
  2 |  1090  |  3  1090 [10982]  |  10  1090
  3 | 11656  |  3 11656 [ 1090]  |  13 11656
  4 |  7117  |  3  7117 [11656]  |  12  7117
  5 |   315  |  3   315 [ 7117]  |  8   315
  6 |  6415  |  3  6415 [  315]  |  12  6415
  7 |  2077  |  3  2077 [ 6415]  |  11  2077
  8 |  5374  |  3  5374 [ 2077]  |  12  5374
  9 |  3910  |  3  3910 [ 5374]  |  11  3910
 10 |  2207  |  3  2207 [ 3910]  |  11  2207
 11 |     6  |  0     6 [ 2207]  |  2     6
 12 |  1495  |  3  1495 [  766]  |  10  1495
 13 |  2542  |  3  2542 [ 1495]  |  11  2542
 14 |  1360  |  3  1360 [ 2542]  |  10  1360
 15 |   984  |  3   984 [ 1360]  |  9   984
 16 |  1137  |  1   153 [  984]  |  10  1137
 17 |   571  |  3   571 [  984]  |  9   571
 18 |   391  |  3   391 [  571]  |  8   391
 19 |  1127  |  3  1127 [  391]  |  10  1127
 20 |    67  |  0    67 [ 1127]  |  6    67
 21 |   320  |  3   320 [  766]  |  8   320
 22 |   140  |  0   140 [  320]  |  7   140
 23 |   314  |  3   314 [  766]  |  8   314
 24 |   853  |  3   853 [  314]  |  9   853
 25 |   732  |  3   732 [  853]  |  9   732
 26 |   486  |  3   486 [  732]  |  8   486
 27 |    23  |  0    23 [  486]  |  4    23
 28 |   128  |  0   128 [  766]  |  7   128
 29 |   297  |  3   297 [  766]  |  8   297
 30 |     9  |  0     9 [  297]  |  3     9
 31 |    75  |  0    75 [  766]  |  6    75
 32 |   115  |  0   115 [  766]  |  6   115
 33 |   198  |  0   198 [  766]  |  7   198
 34 |   153  |  0   153 [  766]  |  7   153
 35 |   115  |  0   115 [  766]  |  6   115
 36 |   155  |  0   155 [  766]  |  7   155
 37 |     0  |  0     0 [  766]  |  0     0
 38 |    63  |  0    63 [  766]  |  5    63
 39 |     6  |  0     6 [  766]  |  2     6
 40 |    88  |  0    88 [  766]  |  6    88
 41 |    74  |  0    74 [  766]  |  6    74
 42 |    50  |  0    50 [  766]  |  5    50
 43 |    17  |  0    17 [  766]  |  4    17
 44 |     3  |  0     3 [  766]  |  1     3
 45 |     2  |  0     2 [  766]  |  1     2
 46 |    34  |  0    34 [  766]  |  5    34
 47 |    17  |  0    17 [  766]  |  4    17


Of course, this looks better in fixed font.

Kent
0
 
LVL 28

Expert Comment

by:pepr
ID: 18824579
I do not have a solution, but I suggest to start thinking differently about the problem. The above suggested solution are based on searching for the rules, for the algorithm. An algorithm can sometimes be nicely relplaced by data structures (duality).

The observation is used in general archiving algorithms that build the structure of patterns that are encoded more efficiently (by shorter patterns). The compression factor is better if the input data is less dense (i.e. if it is less random). The problem is how to build the dictionary of patterns. Usually, a moving window of pattern is build based on the packed data itself.

However, I guess that it may be possible to collect somehow the patterns from the large set of the data and build a static encoding table (i.e. the one that will be fixed for some "version of compression"). That table/tables would not be stored inside the message thus making even short sequences packed well.
0
 

Author Comment

by:Gorizo
ID: 19172280
Sorry about the delay. I had to set aside this project while a colleague looked over the algorithm. The solution might not be applicable to databases (null characters aren't very popular there), but will definitely work in a file based system.

Many thanks to all participants, and congrats to ozo for his mathematical approach.
0

Featured Post

Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Article by: Nadia
Linear search (searching each index in an array one by one) works almost everywhere but it is not optimal in many cases. Let's assume, we have a book which has 42949672960 pages. We also have a table of contents. Now we want to read the content on p…
Iteration: Iteration is repetition of a process. A student who goes to school repeats the process of going to school everyday until graduation. We go to grocery store at least once or twice a month to buy products. We repeat this process every mont…
This demo shows you how to set up the containerized NetScaler CPX with NetScaler Management and Analytics System in a non-routable Mesos/Marathon environment for use with Micro-Services applications.
When you create an app prototype with Adobe XD, you can insert system screens -- sharing or Control Center, for example -- with just a few clicks. This video shows you how. You can take the full course on Experts Exchange at http://bit.ly/XDcourse.

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now