# Compressing a list of integers

I'm trying to find a way to compress a list of integers (16 bit). The list has a few particularities:

- always 48 numbers in the list
- all integers (none above 65536)
- the sum of all numbers will never be above 65536 (I feel this could be key in making the compressed value take up little room)
- 0 might naturally occur more often as a value, but all should be considered random.

I want to compress these so I can decompress without loss of precision. Is there something better than a list of 96 chars (which is no compression at all)?
###### Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Commented:
the average number is less than 1366 which could fit in 11 bits,
if you used 4 bits to store the number of bits needed to store the number in binary
followed by that many bits of the number, you would never need more than 720 bits to express all 48 numbers.
0
Commented:
An improvement may br to use 3 bits to say whether the number will need to be stored in 9-16 bits, or if 0 is expected to be common, in 0 or 10-16 bits
Ths should not take more than 688 bits to express all 48 numbers
0
Commented:
What is the purpose? Do you have many of such chunks? Is the reason to save the space (on disk) or save the communication bandwidth (i.e. as short messages as possible). If the chunks form a bigger space or if the data should be sent as a whole, it is probably always better not to use any optimisation of that kind and use some general packing/unpacking algorithm.

The chunk is quite small and you will always need some extra info. The algorithm will be more complex and you can  introduce more errors.
0
DBACommented:
Hello Gorizo,

Another improvement may be to use a 2 bit flag/modifier with 8 or 16 bits of data.

The 2 bit flag will contain 0, 1, 2, or 3.

0 -- The current data value is the next 8-bit value.
1 -- The current data value is the previous data value plus the next 8-bit value
2 -- The current data value is the previous data value minus the next 8-bit value
3 -- The current data value is the next 16-bit value.

This will will allow all values of 0 to 511 and all values within +- 511 of the previous value to be stored in 8 bits.  For every data value after the first item in the list, 1,023 to 1,534 values will store in 8 bits.

The total compression will vary a bit as the compression factor will be data-dependent, but the average total length should be less than 576 bits.

For speed purposes, I would limit the embedding of flags and data.  Two approaches come to mind.

1 -- A byte contains 4 2-bit flags that describe the next 4 values.
2 -- All 48 flags preceed the data in 96 bits.

I'm kind of partial to method 1.

Good Luck,
Kent
0
Commented:
I think a different set of 4 options may be better
If the data values are randomly distributed around an average value of 65536/48 then most of the flag values would require a flag value of 3 with the value in the next 16 bits
0
Commented:
If the order of the values does not matter, then we could sort the values and  store the differences, and the differences should cluster more strongly towards smaller values
with sorted values, there couldn't be more than 22 difference values that require more than 8 bits.
0
DBACommented:

Hi Ozo,

> If the data values are randomly distributed around an average value of 65536/48 then most of the flag values would require a flag value of 3 with the value in the next 16 bits

This misses the key requirement that "the sum of all numbers will never be above 65536".  Every time a value over the average (1366) occurs increases the likelihood of a subsequent 8-bit value occuring.  The higher the value, the more 8-bit values will be required to maintain the average.  If all the values are within a standard deviation (or so) of average, something close to all the values will be stored as 8-bit values.

Kent
0
Author Commented:
Hi all. So far I like a lot of the ideas. I will eventually decide on which is more usable/easy to implement.

To answer the questions:

- Order of values will matter and must remain the same, or be refound (sorry, huge omission on my part, I know)
- This is for a database field. I'm toying with the idea of having a compressed field instead of 48 fields numbered "F1" thru "F48", always queried according to other accompanying values on the row. There is no need to query according to F1 or the others, just to get the result and calculate from there. The table in question is already quite wide, and I expect it to grow quite long too (possibly 4000 lines per day).
0
DBACommented:
Hi Gorizo,

You can probably squeeze a few more bytes out of my proposed solution if you incorporate an "assumed" value.

Every time a fixed value is stored (key 0 or 3) the algorithm should use a value equal to or less than the average value as the basis for a relative value in the next byte.  Here's an example:

Value    code  stored value
1           0          1
800         3          800
900         1          466            -- Note below
850         1          50
3000       3          3000
1366       1          0                --  Note below
1365       1          1

When a code 0 or 3 is stored, 1366 is assumed as the base value for a relative code in the next byte.  In line 3, 900 is stored as -466 (code 1, value 466) because line 2 is a constant (code 3).  Average (1366) - 466 = 900.  Likewise, in line 6, 1366 is stored as +0 (code 1, value 0) because line 5 is a constant (code 3).  Average (1366) + 0 = 1366.

If a value > 33,000 occurs, the next value can not possibly be derived as a relative value.  33,000 + 33,000 (-+ 512) is greater than 65536 which would violate the rules.  By using an assumed value after a constant is put in the buffer brings the (8 bit) relative codes back into play.

Similarly if a value from 1 -> 511 is stored, relative positioning is less likely to occur in the next word as there is an overlap between the stored and relative values.  If 0 and 14 are to be stored in consecutive words, the 14 could be produced as (0,0) (0,14) or (0,0) (1,14).  In this case there are no values greater than than 511 that can be represented as a relative value for the second word.  If 0 and 1366 are to be stored in consecutive words, the assumed value lets them be stored as (0,0), (1, 0) saving 8 bits.

This will likely squeeze another byte or two out of the result.  (The original list is only 96 bytes so there aren't really that many to squeeze out.)

It's really not as complicated as this might make it sound.  It'll write very easily in C or a similar language.

Kent
0
Commented:
without more restrictions on the numbers it is possible that your list will be
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
1622
1108
which would take 864 bits using the above code

0
DBACommented:
Hi ozo,

I acknowledge that there are certainly worse case scenarios.  But by using the assumed-average that I metioned above, this list compresses to (3,1622), (2,258) [repeating].  Or 1.5 bytes (plus 2 bits) per entry for 84 total bytes.  12.5% smaller than the original 96.

It's hard to nail down exactly how effective the compression will be since it is quite data dependent.  A single large value (or a small total sum) could result in the entire list being represented in 1 byte (plus 2 bits) per entry.  Such a list stores in  60 bytes.  38.5% smaller than the original 96.

Intuitively, I want to say that using 1023 for the assumed average is superior to using 1366.  That results in all values from 0 to 1534 representable as an 8-bit value after any constant.  And since the appearance of a large value in the list reduces the average of the other values, this could result in more relative values being recorded.

Another idea that I've considered is a combination of your 3/4 bit length and this relative values technique.  Include other functions into the code set.  [Repeat previous value], [Repeat previous value N times], [Repeat previous N values M times], [Insert 0 N times], etc.

Still, we're trying to compress only 96 bytes.  It sill seems kind of silly to me.   :)
Kent
0
Commented:
0 -> number is between 0 and 2047, 1 + 11 bits
10 -> number is between 2048 and 4095, 2 + 11 bits, happens at most 31 times
110 -> number is between 4096 and 8191, 3+12 bits, happens at most 15 times
1110 -> number is between 8192 and 16383, 4+13 bits, happens at most 7 times
11110 ->  number is between 16384 and 32767, 5+14 bits, happens at most 3 times
111110 -> number is between 32768 and 65535, 6+15 bits, happens at most once
0
Commented:
258 does not fit in 8 bits
0
Author Commented:
ozo, your last solution looks good. I really like its mathematical structure.

I've devised something from that.

Suppose that we reserve 4 bits to a value, n. Thus, n will range between 0 and 15.
Then, the next n bits are the remainder of the value to be added to 2^n.

Thus:

0000 0 -> 0 (special case)
0000 1 -> 1 (special case)
0001 0 -> 2
0001 1 -> 3
0010 00 -> 4 (4+2 bits)
0010 01 -> 5
0010 10 -> 6
0111 1111111 -> 255 (4+7 bits)
1000 00000000 -> 256 (4+8 bits)
1001 000000000 -> 512 (4+9 bits)
1010 0000000000 -> 1024 (4+10 bits)
1011 00000000000 -> 2048 (4+11 bits)
1100 000000000000 -> 4096 (4+12 bits)
1111 111111111111111 -> 65535 (4+15 bits)
etc.

The formula would be 2^n + (n bits).

This makes lower values take up less bits, mid-range values take up more, and high values about the same.
0
Commented:
generally, you want to allocate more bits to rarer events so that you can use fewer bits on more common events.
The constraints on 48 non-negative integers that sum to at most 65536 force some values to be relatively rare.
(and I think I mis counted, if the sum can be <= 65536 then number is between 32768 and 65535 can occur at most 2 times, number is between 16384 and 32767 can occur at most 4 times, etc.  also, if one of the numbers can be  65536, not going over none above 65536, then we need to be able to represtnt one mre value.
perhaps we could add something like
0 -> number is between 0 and 2047, 1 + 11 bits
10 -> number is between 2048 and 4095, 2 + 11 bits, happens at most 32 times
110 -> number is between 4096 and 8191, 3+12 bits, happens at most 16 times
1110 -> number is between 8192 and 16383, 4+13 bits, happens at most 8 times
11110 ->  number is between 16384 and 32767, 5+14 bits, happens at most 4 times
11111 -> number is between 32768 and 65535, 5+15 bits, happens at most 2 times (but the 0 after the 11111 was not necessary)
11111 000000000000000 -> 32768
...
11111 111111111111110 -> 65534 at most 1 time 5+15 bits
11111 111111111111111 0 -> 65535 at most 1 time 5+15+1 bits
11111 111111111111111 1 -> 65536 at most 1 time 5+15+1 bits
)
so we would actually do better to use more bits for larger values, if that allowed us to use fewer bits for the more common values.
But it would be hard to get below 1+11 bits for the potentially most comon range uness we go to representations that allow fractional bits per number.
Which is doable, but may not be worth it.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Commented:
> Suppose that we reserve 4 bits to a value, n. Thus, n will range between 0 and 15.
that's close to my initial idea, but since 0-2047 could be the most common case, it's more importiant to minimize the number of bits that can be used there.
0
Commented:
Just a thought, but I can offer a solution that guarantees compression into exactly 539 bits.
Remove the maximum value. Store it (16 bits) and where it was (6 bits).
Encode the remaining 47 values, each of which is now guaranteed to be less than 1366, in 11 bits each. (47 * 11).
Naturally, you could improve on that last bit. E.g. In threes, such values could fit in 31 bits, rather than 33, so you could squeeze another improvement there, to 528 bits, if you were picky.
0
DBACommented:
Hi ozo,

>> 258 does not fit in 8 bits

Sorry.  That's what happens when you quit designing for a living and start using application tools (SQL).  I was putting 511 into 8 bits, not 255.

Still, this assumed average has possibilities.  I'd like to get some of Gorizo's real data to prove it's worth.

Regarding your compression, if the data really is random then the likelihood of storing multiple values of 0 and 1 is pretty small.  It might do well to store them in two bits (as if they were 2/3) and reserve 0 and 1 for special effects.

A 0 for a special effect could mean "repeat previous" so that a repeated value takes only 5 bits, not 5 bits + length, etc.

Hi aluwr,

I don't think that that will work.  Removing the highest value guarantees that the AVERAGE of the remaining 47 value is less than 1366 -- and then only if the highest value is great that 1366.

Kent
0
Commented:
Here's a variant on ozo's solution, using a huffman code.  This ivolves a preliminary phase where you do some analysis on your data and use a histogram to construct an efficient code table.

struct codepoint {
int base;
int followingbits;
};

Then you construct a list of codepoints that look like they will reasonably efficiently cover your data.  Frequencies aren't known yet:

struct codepoint codes[] = {
{ 0, 0 }, // for zero
{ 1, 4 }, // covers numbers 1 to 1+2^4-1 = 16
{ 17, 6 }, // covers 17 to 17+2^6-1 = 80
{ 81, 10 }, // covers 81 to 81+2^10-1 = 1104
... and so on.  Make sure they eventually cover all values up to 65535
}

Then you use codes[] and a large sample of data to estimate the frequencies for each of these codepoints.  From these frequencies you can construct a huffman code, then use this huffman code to encode and decode the values.

Each encoded value will be a huffman code followed by the appropriate number of following bits.
0
Commented:
Sorry. You're quite right. There could be just two non-zero values, both 32765. However, I feel sure I can modify my idea only a little, to cover that...

Remove the highest 16 elements. Store these as they come. Record a bit mask of 48 bits to show which they were. Store the remaining 32 in 12 bits each.
Net saving is 32*4-48 = 80 bits.

The point I'd like to make is that, if the 48 16-bit values are totally random, other than the sum being less than 65536, you must directly use that fact in any compression you devise.

None of the earlier techniques above directly do so. They are doomed to, in general, increase the average total bits required to store the data. If there were more than 48 values, I'd be less sure of this.

I can't get too excited over my method above (I've saved 10 out of 96 bytes), but it's not just 10 bytes, of course. By being a fixed length field, it has saved anywhere between one and eight more bytes, by not needing a field length.
0
Commented:
Argh. OK. So Ozo does have a point ( :) ). I've just run his scheme on a million randomly generated packets, and the average bit count is 587.607.
My 16/32 split requires 688. The field length isn't going to cost 100 bits!

Hm... a re-run, but extending ozo's scheme down to the 0-1023 range as well, gives an average of 581.309 bits.

One further question... the random packets I generated all had a sum of near 65535. Does the actual data do this? If. it is normalised to exacty sum to exactly 65535, one 16 bit value can be skipped completely.
0
DBACommented:
Hi lauwr,

How did you generate your random numbers?  My guess (purely speculative) is that most of the values will cluster near the average.

Kent
0
Commented:
Final version was to generate 48 values, each in the range 1..32768, and sum them (into "s"). Then scale them down to make the sum exactly 65535 by a loop that did:

t = 65535;
do {
unsigned short x = data[j];
unsigned short y = (2*t*(unsigned long)x/s + 1)/2;
data[j] = y;
s -= x;
t -= y;
} while (j--);
}

So, NO clustering. In fact, I suppose this whole thing is flawed. I've generated a flat distribution, then scaled it. The distribution will be flat from 0 to 2*65535/48 appox, with a slight jiggle at the top end where my initial data hasn't quite averaged to max/2.

In effect, I've generated 11.5 bit data - no large values.

By the way the problem was posed, I'm assuming that some sort of normalisation like this was being done. I can't think of any real world scenario that fits the whole thing.

Rats! Really need a better description of what the distribution in the data is supposed to be.

Maybe I'll repeat the test with a Poisson distribution. Maybe I won't. :)
0
DBACommented:
Hi guys,

I've tested a couple of the algorithms and it looks like Ozo's 4-bit example from above is as good as anything that has been suggested.  Here's a side-by-side comparison of my suggestion and Ozo's.

C1 Bitcount = 664 (Length using my method)
C2 Bitcount = 546 (Length using Ozo's method)

Row | Value | code value base | code value
0 |   130  |  0   130 [  766]  |  7   130
1 | 10982  |  3 10982 [  766]  |  13 10982
2 |  1090  |  3  1090 [10982]  |  10  1090
3 | 11656  |  3 11656 [ 1090]  |  13 11656
4 |  7117  |  3  7117 [11656]  |  12  7117
5 |   315  |  3   315 [ 7117]  |  8   315
6 |  6415  |  3  6415 [  315]  |  12  6415
7 |  2077  |  3  2077 [ 6415]  |  11  2077
8 |  5374  |  3  5374 [ 2077]  |  12  5374
9 |  3910  |  3  3910 [ 5374]  |  11  3910
10 |  2207  |  3  2207 [ 3910]  |  11  2207
11 |     6  |  0     6 [ 2207]  |  2     6
12 |  1495  |  3  1495 [  766]  |  10  1495
13 |  2542  |  3  2542 [ 1495]  |  11  2542
14 |  1360  |  3  1360 [ 2542]  |  10  1360
15 |   984  |  3   984 [ 1360]  |  9   984
16 |  1137  |  1   153 [  984]  |  10  1137
17 |   571  |  3   571 [  984]  |  9   571
18 |   391  |  3   391 [  571]  |  8   391
19 |  1127  |  3  1127 [  391]  |  10  1127
20 |    67  |  0    67 [ 1127]  |  6    67
21 |   320  |  3   320 [  766]  |  8   320
22 |   140  |  0   140 [  320]  |  7   140
23 |   314  |  3   314 [  766]  |  8   314
24 |   853  |  3   853 [  314]  |  9   853
25 |   732  |  3   732 [  853]  |  9   732
26 |   486  |  3   486 [  732]  |  8   486
27 |    23  |  0    23 [  486]  |  4    23
28 |   128  |  0   128 [  766]  |  7   128
29 |   297  |  3   297 [  766]  |  8   297
30 |     9  |  0     9 [  297]  |  3     9
31 |    75  |  0    75 [  766]  |  6    75
32 |   115  |  0   115 [  766]  |  6   115
33 |   198  |  0   198 [  766]  |  7   198
34 |   153  |  0   153 [  766]  |  7   153
35 |   115  |  0   115 [  766]  |  6   115
36 |   155  |  0   155 [  766]  |  7   155
37 |     0  |  0     0 [  766]  |  0     0
38 |    63  |  0    63 [  766]  |  5    63
39 |     6  |  0     6 [  766]  |  2     6
40 |    88  |  0    88 [  766]  |  6    88
41 |    74  |  0    74 [  766]  |  6    74
42 |    50  |  0    50 [  766]  |  5    50
43 |    17  |  0    17 [  766]  |  4    17
44 |     3  |  0     3 [  766]  |  1     3
45 |     2  |  0     2 [  766]  |  1     2
46 |    34  |  0    34 [  766]  |  5    34
47 |    17  |  0    17 [  766]  |  4    17

Of course, this looks better in fixed font.

Kent
0
Commented:
I do not have a solution, but I suggest to start thinking differently about the problem. The above suggested solution are based on searching for the rules, for the algorithm. An algorithm can sometimes be nicely relplaced by data structures (duality).

The observation is used in general archiving algorithms that build the structure of patterns that are encoded more efficiently (by shorter patterns). The compression factor is better if the input data is less dense (i.e. if it is less random). The problem is how to build the dictionary of patterns. Usually, a moving window of pattern is build based on the packed data itself.

However, I guess that it may be possible to collect somehow the patterns from the large set of the data and build a static encoding table (i.e. the one that will be fixed for some "version of compression"). That table/tables would not be stored inside the message thus making even short sequences packed well.
0
Author Commented:
Sorry about the delay. I had to set aside this project while a colleague looked over the algorithm. The solution might not be applicable to databases (null characters aren't very popular there), but will definitely work in a file based system.

Many thanks to all participants, and congrats to ozo for his mathematical approach.
0
###### It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Algorithms

From novice to tech pro — start learning today.