This course will teach participants about installing and configuring Python, syntax, importing, statements, types, strings, booleans, files, lists, tuples, comprehensions, functions, and classes.

I'm trying to find a way to compress a list of integers (16 bit). The list has a few particularities:

- always 48 numbers in the list

- all integers (none above 65536)

- the sum of all numbers will never be above 65536 (I feel this could be key in making the compressed value take up little room)

- 0 might naturally occur more often as a value, but all should be considered random.

I want to compress these so I can decompress without loss of precision. Is there something better than a list of 96 chars (which is no compression at all)?

- always 48 numbers in the list

- all integers (none above 65536)

- the sum of all numbers will never be above 65536 (I feel this could be key in making the compressed value take up little room)

- 0 might naturally occur more often as a value, but all should be considered random.

I want to compress these so I can decompress without loss of precision. Is there something better than a list of 96 chars (which is no compression at all)?

Experts Exchange Solution brought to you by

Enjoy your complimentary solution view.

Get every solution instantly with Premium.
Start your 7-day free trial.

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Ths should not take more than 688 bits to express all 48 numbers

The chunk is quite small and you will always need some extra info. The algorithm will be more complex and you can introduce more errors.

Another improvement may be to use a 2 bit flag/modifier with 8 or 16 bits of data.

The 2 bit flag will contain 0, 1, 2, or 3.

0 -- The current data value is the next 8-bit value.

1 -- The current data value is the previous data value plus the next 8-bit value

2 -- The current data value is the previous data value minus the next 8-bit value

3 -- The current data value is the next 16-bit value.

This will will allow all values of 0 to 511 and all values within +- 511 of the previous value to be stored in 8 bits. For every data value after the first item in the list, 1,023 to 1,534 values will store in 8 bits.

The total compression will vary a bit as the compression factor will be data-dependent, but the average total length should be less than 576 bits.

For speed purposes, I would limit the embedding of flags and data. Two approaches come to mind.

1 -- A byte contains 4 2-bit flags that describe the next 4 values.

2 -- All 48 flags preceed the data in 96 bits.

I'm kind of partial to method 1.

Good Luck,

Kent

If the data values are randomly distributed around an average value of 65536/48 then most of the flag values would require a flag value of 3 with the value in the next 16 bits

with sorted values, there couldn't be more than 22 difference values that require more than 8 bits.

Hi Ozo,

> If the data values are randomly distributed around an average value of 65536/48 then most of the flag values would require a flag value of 3 with the value in the next 16 bits

This misses the key requirement that "the sum of all numbers will never be above 65536". Every time a value over the average (1366) occurs increases the likelihood of a subsequent 8-bit value occuring. The higher the value, the more 8-bit values will be required to maintain the average. If all the values are within a standard deviation (or so) of average, something close to all the values will be stored as 8-bit values.

Kent

To answer the questions:

- Order of values will matter and must remain the same, or be refound (sorry, huge omission on my part, I know)

- This is for a database field. I'm toying with the idea of having a compressed field instead of 48 fields numbered "F1" thru "F48", always queried according to other accompanying values on the row. There is no need to query according to F1 or the others, just to get the result and calculate from there. The table in question is already quite wide, and I expect it to grow quite long too (possibly 4000 lines per day).

You can probably squeeze a few more bytes out of my proposed solution if you incorporate an "assumed" value.

Every time a fixed value is stored (key 0 or 3) the algorithm should use a value equal to or less than the average value as the basis for a relative value in the next byte. Here's an example:

Value code stored value

1 0 1

800 3 800

900 1 466 -- Note below

850 1 50

3000 3 3000

1366 1 0 -- Note below

1365 1 1

When a code 0 or 3 is stored, 1366 is assumed as the base value for a relative code in the next byte. In line 3, 900 is stored as -466 (code 1, value 466) because line 2 is a constant (code 3). Average (1366) - 466 = 900. Likewise, in line 6, 1366 is stored as +0 (code 1, value 0) because line 5 is a constant (code 3). Average (1366) + 0 = 1366.

If a value > 33,000 occurs, the next value can not possibly be derived as a relative value. 33,000 + 33,000 (-+ 512) is greater than 65536 which would violate the rules. By using an assumed value after a constant is put in the buffer brings the (8 bit) relative codes back into play.

Similarly if a value from 1 -> 511 is stored, relative positioning is less likely to occur in the next word as there is an overlap between the stored and relative values. If 0 and 14 are to be stored in consecutive words, the 14 could be produced as (0,0) (0,14) or (0,0) (1,14). In this case there are no values greater than than 511 that can be represented as a relative value for the second word. If 0 and 1366 are to be stored in consecutive words, the assumed value lets them be stored as (0,0), (1, 0) saving 8 bits.

This will likely squeeze another byte or two out of the result. (The original list is only 96 bytes so there aren't really that many to squeeze out.)

It's really not as complicated as this might make it sound. It'll write very easily in C or a similar language.

Kent

1622

1108

1622

1108

1622

1108

1622

1108

1622

1108

1622

1108

1622

1108

1622

1108

1622

1108

1622

1108

1622

1108

1622

1108

1622

1108

1622

1108

1622

1108

1622

1108

1622

1108

1622

1108

1622

1108

1622

1108

1622

1108

1622

1108

1622

1108

1622

1108

which would take 864 bits using the above code

I acknowledge that there are certainly worse case scenarios. But by using the assumed-average that I metioned above, this list compresses to (3,1622), (2,258) [repeating]. Or 1.5 bytes (plus 2 bits) per entry for 84 total bytes. 12.5% smaller than the original 96.

It's hard to nail down exactly how effective the compression will be since it is quite data dependent. A single large value (or a small total sum) could result in the entire list being represented in 1 byte (plus 2 bits) per entry. Such a list stores in 60 bytes. 38.5% smaller than the original 96.

Intuitively, I want to say that using 1023 for the assumed average is superior to using 1366. That results in all values from 0 to 1534 representable as an 8-bit value after any constant. And since the appearance of a large value in the list reduces the average of the other values, this could result in more relative values being recorded.

Another idea that I've considered is a combination of your 3/4 bit length and this relative values technique. Include other functions into the code set. [Repeat previous value], [Repeat previous value N times], [Repeat previous N values M times], [Insert 0 N times], etc.

Still, we're trying to compress only 96 bytes. It sill seems kind of silly to me. :)

Kent

10 -> number is between 2048 and 4095, 2 + 11 bits, happens at most 31 times

110 -> number is between 4096 and 8191, 3+12 bits, happens at most 15 times

1110 -> number is between 8192 and 16383, 4+13 bits, happens at most 7 times

11110 -> number is between 16384 and 32767, 5+14 bits, happens at most 3 times

111110 -> number is between 32768 and 65535, 6+15 bits, happens at most once

I've devised something from that.

Suppose that we reserve 4 bits to a value, n. Thus, n will range between 0 and 15.

Then, the next n bits are the remainder of the value to be added to 2^n.

Thus:

0000 0 -> 0 (special case)

0000 1 -> 1 (special case)

0001 0 -> 2

0001 1 -> 3

0010 00 -> 4 (4+2 bits)

0010 01 -> 5

0010 10 -> 6

0111 1111111 -> 255 (4+7 bits)

1000 00000000 -> 256 (4+8 bits)

1001 000000000 -> 512 (4+9 bits)

1010 0000000000 -> 1024 (4+10 bits)

1011 00000000000 -> 2048 (4+11 bits)

1100 000000000000 -> 4096 (4+12 bits)

1111 111111111111111 -> 65535 (4+15 bits)

etc.

The formula would be 2^n + (n bits).

This makes lower values take up less bits, mid-range values take up more, and high values about the same.

The constraints on 48 non-negative integers that sum to at most 65536 force some values to be relatively rare.

(and I think I mis counted, if the sum can be <= 65536 then number is between 32768 and 65535 can occur at most 2 times, number is between 16384 and 32767 can occur at most 4 times, etc. also, if one of the numbers can be 65536, not going over none above 65536, then we need to be able to represtnt one mre value.

perhaps we could add something like

0 -> number is between 0 and 2047, 1 + 11 bits

10 -> number is between 2048 and 4095, 2 + 11 bits, happens at most 32 times

110 -> number is between 4096 and 8191, 3+12 bits, happens at most 16 times

1110 -> number is between 8192 and 16383, 4+13 bits, happens at most 8 times

11110 -> number is between 16384 and 32767, 5+14 bits, happens at most 4 times

11111 -> number is between 32768 and 65535, 5+15 bits, happens at most 2 times (but the 0 after the 11111 was not necessary)

11111 000000000000000 -> 32768

...

11111 111111111111110 -> 65534 at most 1 time 5+15 bits

11111 111111111111111 0 -> 65535 at most 1 time 5+15+1 bits

11111 111111111111111 1 -> 65536 at most 1 time 5+15+1 bits

)

so we would actually do better to use more bits for larger values, if that allowed us to use fewer bits for the more common values.

But it would be hard to get below 1+11 bits for the potentially most comon range uness we go to representations that allow fractional bits per number.

Which is doable, but may not be worth it.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trialthat's close to my initial idea, but since 0-2047 could be the most common case, it's more importiant to minimize the number of bits that can be used there.

Remove the maximum value. Store it (16 bits) and where it was (6 bits).

Encode the remaining 47 values, each of which is now guaranteed to be less than 1366, in 11 bits each. (47 * 11).

Naturally, you could improve on that last bit. E.g. In threes, such values could fit in 31 bits, rather than 33, so you could squeeze another improvement there, to 528 bits, if you were picky.

>> 258 does not fit in 8 bits

Sorry. That's what happens when you quit designing for a living and start using application tools (SQL). I was putting 511 into 8 bits, not 255.

Still, this assumed average has possibilities. I'd like to get some of Gorizo's real data to prove it's worth.

Regarding your compression, if the data really is random then the likelihood of storing multiple values of 0 and 1 is pretty small. It might do well to store them in two bits (as if they were 2/3) and reserve 0 and 1 for special effects.

A 0 for a special effect could mean "repeat previous" so that a repeated value takes only 5 bits, not 5 bits + length, etc.

Hi aluwr,

I don't think that that will work. Removing the highest value guarantees that the AVERAGE of the remaining 47 value is less than 1366 -- and then only if the highest value is great that 1366.

Kent

struct codepoint {

int base;

int followingbits;

};

Then you construct a list of codepoints that look like they will reasonably efficiently cover your data. Frequencies aren't known yet:

struct codepoint codes[] = {

{ 0, 0 }, // for zero

{ 1, 4 }, // covers numbers 1 to 1+2^4-1 = 16

{ 17, 6 }, // covers 17 to 17+2^6-1 = 80

{ 81, 10 }, // covers 81 to 81+2^10-1 = 1104

... and so on. Make sure they eventually cover all values up to 65535

}

Then you use codes[] and a large sample of data to estimate the frequencies for each of these codepoints. From these frequencies you can construct a huffman code, then use this huffman code to encode and decode the values.

Each encoded value will be a huffman code followed by the appropriate number of following bits.

Remove the highest 16 elements. Store these as they come. Record a bit mask of 48 bits to show which they were. Store the remaining 32 in 12 bits each.

Net saving is 32*4-48 = 80 bits.

The point I'd like to make is that, if the 48 16-bit values are totally random, other than the sum being less than 65536, you must directly use that fact in any compression you devise.

None of the earlier techniques above directly do so. They are doomed to, in general, increase the average total bits required to store the data. If there were more than 48 values, I'd be less sure of this.

I can't get too excited over my method above (I've saved 10 out of 96 bytes), but it's not just 10 bytes, of course. By being a fixed length field, it has saved anywhere between one and eight more bytes, by not needing a field length.

My 16/32 split requires 688. The field length isn't going to cost 100 bits!

Hm... a re-run, but extending ozo's scheme down to the 0-1023 range as well, gives an average of 581.309 bits.

One further question... the random packets I generated all had a sum of near 65535. Does the actual data do this? If. it is normalised to exacty sum to exactly 65535, one 16 bit value can be skipped completely.

How did you generate your random numbers? My guess (purely speculative) is that most of the values will cluster near the average.

Kent

t = 65535;

do {

unsigned short x = data[j];

unsigned short y = (2*t*(unsigned long)x/s + 1)/2;

data[j] = y;

s -= x;

t -= y;

} while (j--);

}

So, NO clustering. In fact, I suppose this whole thing is flawed. I've generated a flat distribution, then scaled it. The distribution will be flat from 0 to 2*65535/48 appox, with a slight jiggle at the top end where my initial data hasn't quite averaged to max/2.

In effect, I've generated 11.5 bit data - no large values.

By the way the problem was posed, I'm assuming that some sort of normalisation like this was being done. I can't think of any real world scenario that fits the whole thing.

Rats! Really need a better description of what the distribution in the data is supposed to be.

Maybe I'll repeat the test with a Poisson distribution. Maybe I won't. :)

I've tested a couple of the algorithms and it looks like Ozo's 4-bit example from above is as good as anything that has been suggested. Here's a side-by-side comparison of my suggestion and Ozo's.

C1 Bitcount = 664 (Length using my method)

C2 Bitcount = 546 (Length using Ozo's method)

Row | Value | code value base | code value

0 | 130 | 0 130 [ 766] | 7 130

1 | 10982 | 3 10982 [ 766] | 13 10982

2 | 1090 | 3 1090 [10982] | 10 1090

3 | 11656 | 3 11656 [ 1090] | 13 11656

4 | 7117 | 3 7117 [11656] | 12 7117

5 | 315 | 3 315 [ 7117] | 8 315

6 | 6415 | 3 6415 [ 315] | 12 6415

7 | 2077 | 3 2077 [ 6415] | 11 2077

8 | 5374 | 3 5374 [ 2077] | 12 5374

9 | 3910 | 3 3910 [ 5374] | 11 3910

10 | 2207 | 3 2207 [ 3910] | 11 2207

11 | 6 | 0 6 [ 2207] | 2 6

12 | 1495 | 3 1495 [ 766] | 10 1495

13 | 2542 | 3 2542 [ 1495] | 11 2542

14 | 1360 | 3 1360 [ 2542] | 10 1360

15 | 984 | 3 984 [ 1360] | 9 984

16 | 1137 | 1 153 [ 984] | 10 1137

17 | 571 | 3 571 [ 984] | 9 571

18 | 391 | 3 391 [ 571] | 8 391

19 | 1127 | 3 1127 [ 391] | 10 1127

20 | 67 | 0 67 [ 1127] | 6 67

21 | 320 | 3 320 [ 766] | 8 320

22 | 140 | 0 140 [ 320] | 7 140

23 | 314 | 3 314 [ 766] | 8 314

24 | 853 | 3 853 [ 314] | 9 853

25 | 732 | 3 732 [ 853] | 9 732

26 | 486 | 3 486 [ 732] | 8 486

27 | 23 | 0 23 [ 486] | 4 23

28 | 128 | 0 128 [ 766] | 7 128

29 | 297 | 3 297 [ 766] | 8 297

30 | 9 | 0 9 [ 297] | 3 9

31 | 75 | 0 75 [ 766] | 6 75

32 | 115 | 0 115 [ 766] | 6 115

33 | 198 | 0 198 [ 766] | 7 198

34 | 153 | 0 153 [ 766] | 7 153

35 | 115 | 0 115 [ 766] | 6 115

36 | 155 | 0 155 [ 766] | 7 155

37 | 0 | 0 0 [ 766] | 0 0

38 | 63 | 0 63 [ 766] | 5 63

39 | 6 | 0 6 [ 766] | 2 6

40 | 88 | 0 88 [ 766] | 6 88

41 | 74 | 0 74 [ 766] | 6 74

42 | 50 | 0 50 [ 766] | 5 50

43 | 17 | 0 17 [ 766] | 4 17

44 | 3 | 0 3 [ 766] | 1 3

45 | 2 | 0 2 [ 766] | 1 2

46 | 34 | 0 34 [ 766] | 5 34

47 | 17 | 0 17 [ 766] | 4 17

Of course, this looks better in fixed font.

Kent

The observation is used in general archiving algorithms that build the structure of patterns that are encoded more efficiently (by shorter patterns). The compression factor is better if the input data is less dense (i.e. if it is less random). The problem is how to build the dictionary of patterns. Usually, a moving window of pattern is build based on the packed data itself.

However, I guess that it may be possible to collect somehow the patterns from the large set of the data and build a static encoding table (i.e. the one that will be fixed for some "version of compression"). That table/tables would not be stored inside the message thus making even short sequences packed well.

Many thanks to all participants, and congrats to ozo for his mathematical approach.

Algorithms

From novice to tech pro — start learning today.

Experts Exchange Solution brought to you by

Enjoy your complimentary solution view.

Get every solution instantly with Premium.
Start your 7-day free trial.

if you used 4 bits to store the number of bits needed to store the number in binary

followed by that many bits of the number, you would never need more than 720 bits to express all 48 numbers.