Lossless compression from n to (n-1) bytes

Posted on 1998-06-26
Last Modified: 2010-04-16
I need a lossless compression algorithm/program to make a n-bytes sequence to be a (n-1) bytes sequence. For example:
  5-bytes sequence represented by: ABCDE,
  compressed to 4-bytes sequence, represented by: KLMN.
  KLMN must be able to be uncompressed back to ABCDE.

Or to compress a n-bits sequence to be a (n-1) bits sequence, for example 01010101 compressed to 1101101.

It doesn't need to be exactly (n-1) bits/bytes, it can be (n-2), (n-3) or less, the less the better.
For n, it can be any number (e.g. 100 bytes compressed to 99 bytes).

And the bytes/bits sequence must be able to be compressed with this method in any byte sequence combination, this is the one I know is the limitation of most-used compression technique such as RLE, LZW and Huffman.

I've tried to use a XOR and several boolean logic method to do it:
For example, I have 4 bytes: A,B,C,D, and I compressed it to 3 bytes:
  A xor B = AB
  A xor C = AC
  A xor D = AD
To uncompress it:
 AB xor AC xor AD = ABCD
 ABCD xor AB = CD
But then I keep end up with a cyclic XOR operation, which cannot result the uncompressed byte. I've also tried the other combinations, but didn't give any result.

I know here's not a general algorithm topic area (which I can't find one), but since Pascal is widely used to learn an algorithm, I think I can get good answers here. For a start, I offered 30 points, but for a solution (prefereably a good one), I will change it to 300 or more points.
Question by:gete
  • 3
  • 3
  • 2
  • +3

Expert Comment

by:nils pipenbrinck
ID: 1217128
that's not possible.

There are byte sequences that can't be compressed any more (statistically uniform distributed random-data for example)

You can use ordinary compression algorithms for most data. these will, however make a larger compressed file if the data itself is not compressable.

If there would be a algorithm you asked for, you could apply it 99 times on a 100 byte chunk and get a 1 byte compressed file.. All storage problems would be solved... unfortunately there is no algorithm to do this.


LVL 84

Expert Comment

ID: 1217129
What byte values do you want to allow in your n-byte sequence?
If you're restricting it to just uppercase letters, you could compress 3 bytes into 2, or 8 bytes into 5

Author Comment

ID: 1217130
nils, I think you misinterpreted my problem, I don't mean the algorithm could make:
100 to be 99, then 99 to be 98, and then 98 to be 97, ....
But just to make 100 to be 99, the compressed-99 doesn't need to be compressable again.
But still the uncompressed-100 must be able to compress in any combination.

And what you'd said about 'All storage problems would be solved...' is not completely true, even if an algorithm to do that really exists. You must considered the time to (de)compress 100 byte to be 1 byte (multi pass), absolutely doesn't suit to networking and real-time disk compression such as DriveSpace etc.

Actually, I have seen an article discussing about this issue in one of my old local (Indonesia)'s computer magazine, which unfortunately I couldn't get a copy of it again. If I'm not mistaken, the author presented an algorithm to make an any-64KB sequence to be 1 byte smaller.
So I think maybe I can get some similar/better answer from the world-wide through this media.

To be honest, I'm quite disappointed that you make that an answer instead of merely comment, which maybe someone can give a solution to me. With all respect, please don't make an answer which is not an answer yet, I really appreciate and thank everyone who would comment my question.

I myself also still trying hard to find a solution as I posted the question. In my research, I've found this fact (which I don't know if it's useful or not):
  3 bytes uncompressed: A, B, C.
  A xor B = AB (1 byte)  --> stored
  A xor C = AC (1 byte)  --> stored
  B xor C = BC (1 byte)
If we only store the AB and AC (total = 2 bytes), we can get BC from:
  AB xor AC = BC (1 byte)
At first, I sense that maybe I can extract one of the uncompressed byte from these information, but again, I trapped in a cyclic XOR operation (well, I've said before, it may be not useful).

I desperately want the algorithm (if any exist), even if it only compress 100KB to be 1 byte smaller, but must be compressable in any combination. I will give 400 points or more (right now I have 420 points) to a SOLUTION (not a so-called-ANSWER). The reason that I don't offer the 400 right away is to avoid giving it to an unsatisfactionary answer automatically by the engine.

ozo, I want ALL/ANY kind of combination sequence (binary files) be compressable.
Free Tool: Postgres Monitoring System

A PHP and Perl based system to collect and display usage statistics from PostgreSQL databases.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

LVL 84

Expert Comment

ID: 1217131
As nils pipenbrinck said, it's impossible to losslessly compress ALL sequences.
There are 256^100=6668014432879854274079851790721257797144758322315908160396257811764037237817632071521432200871554290742929910593433240445888801654119365080363356052330830046095157579514014558463078285911814024728965016135886601981690748037476461291163877376 possible 100 byte sequences, and only 256^99=26046931378436930758124421057504913270096712196546516251547882077203270460225125279380594534654508948214569963255598595491753131461403769845169359579417304867559209294976619368996399554343023534097519594280807038990979484521392426918608896
That leaves 6641967501501417343321727369663752883874661610119361644144709929686833967357406946242051606336899781794715340630177641850397048522657961310518186692751412741227598370219037939094081886357471001194867496541605794942699768552955068864245268480 sequences which can't be represented in 99 bytes

Compression only works when you expect most of those sequences to be less likely to occur than others, so that you can compress the few common sequences you are really interested in.

Accepted Solution

nils pipenbrinck earned 30 total points
ID: 1217132

An answer that sais "It's impossible" is an answer. (in my opinion)

Anyways.. I try to explain you why it's not possible.

Lets say, we have an algorithm that is able to compress a cunk of data from n bytes down to n-1 bytes.

If you would take a book and compress it unless it's 1 byte long (which would be possible just by calling the compression over and over again) the entire information of that book would be compressable to one single byte. (don't care about the time it takes)

Since compression without decompression is useless the entire information of that book must be reconstructable from that byte (we're talking about lossless compression here).

That means:

 * we have 8 bit
 * 256 different possible values
 * it's only possible to write 256 different books.

(since each possible byte can only reconstruct one book.. it's lossless compression!)

I hope you see that this proves that it's impossible to write an algorithm that will compress from n to n-1 under any circumstances.

Of cause there are other algorithms that compress data-chunks most of the time, but these can only be applied once on a file.

There are two different kinds of algorithms:

The kind works on the individual bytes. It tries to code often used bytes with less bits and seldom used with more bits.. Huffman and arithmetical coding are the most known algorithms to do this.

The other algorithm family remove some bits from the datastream by grouping similar bytes and write them in a more compact way..
The easiest algorithm of that family is the run lenght encoding..

for example:

  "aaaabbbbbcde" could be saved as:

which saves 2 bytes. (ok, there are better ways to compress, but it shows the trick).

Common algorithms usesd are RLE, LWZ, LZ77 and LZ78.

I can give you a couple of URL's where you can find the altorithms and maybe even some sourcecode.

Nils Pipenbrinck


Expert Comment

ID: 1217133
The only thing that may be true, is that in almost any case, you can compress lanrge data block by one byte.
However, you WON'T find a compressing algorithm that will ALWAYS reduce your data. As ozo mentioned, there are some sequences that could not be decompressed after running the algorithm.

There is another method of "lossely" compression that can always redure the size of the data, but it does not give you back (always) the original data (something like 99% of it...).

And a third point:
What you found about the XORing is as in numbers:
Let x,y,z have unknown values.
And we do know however, that:
   x + y = 30
   x - z = 20
so now we can fing out that:
   y + z = 10
But this is not a new information.
XORing is like adding/substruction of bits.
The third information just repeated the previous two and therefore you cannot get anything from it.

PS. Binary files which are not precompressed can be compressed to about 65% atleast.
LVL 10

Expert Comment

ID: 1217134
Yes this problem is impossible to solve. Simple induction like nils pipenbrinck did is the prove. Give him the points or delete the question. You'll get no better answer.
LVL 84

Expert Comment

ID: 1217135
LVL 10

Expert Comment

ID: 1217136

This is so stupid.

It is like trying to make a perpetuum mobile with four interconnected gears.

Expert Comment

ID: 1217137
You damm right.
LVL 10

Expert Comment

ID: 1217138
Last comment on this issue from me:

Try reading a book on Data Entropy:
Entropy is something like the average chaosity of your data. Compressing a file means reducing it Data Entropy.

In your case you want to reduce 3 bytes to 2 bytes by XOR-ing.
Can't you see that by XORing you loose data unless you store what you are XORing with? Your compression doesn't reduce the entropy of the data. What you need to do is doing bitwise operations if you want to compress 3 bytes to two. But again it will not work on any data. For example you could runlength encode the following 3 bytes

11111111 0000000 11110000    (24 bits)


1000 1 1000 0 0100 1 0100 0  (20 bits)

meaning 8x1 8x0 4x1 4x0

but the overhead of the repeat counter becomes very important when you have chaotic data. For example


would become

0001 1 0001 0 0001 1 0001 etc

You could try to use LZW this is a very good compression algoritm. But here as wel. Worst casde your data will not be compressed. (It might even grow in size).

Good luck.

P.S. If you find a way to do what you want. Patent it !! It will make you a lot of money!!

Expert Comment

by:nils pipenbrinck
ID: 1217139
hehe. If I'll ever find such an algorithm I'll never have to work anymore. If so everyone here is invited to visit me at my house somewhere in the caribbeans.


Author Comment

ID: 1217140
Cheap points. Really cheap.

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article describes my battle tested process for setting up delegation. I use this process anywhere that I need to setup delegation. In the article I will show how it applies to Active Directory
Many businesses neglect disaster recovery and treat it as an after-thought. I can tell you first hand that data will be lost, hard drives die, servers will be hacked, and careless (or malicious) employees can ruin your data.
I've attached the XLSM Excel spreadsheet I used in the video and also text files containing the macros used below.…

808 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question