Link to home
Start Free TrialLog in
Avatar of rdktsj
rdktsj

asked on

compression of text, .exe and graphics ALL IN ONE

  our project is about enhancing the huffman algorithm that would be able to compress text, .exe and graphics, and would be able to surpass the existing compression techniques for the said variation of files. The algorithm for the Huffman is :

 1.For each symbol create a tree with a single root node and order all trees according to the probability of symbol table
;

 2. While more than one tree is left take the two trees t1,t2, with the lowes frequencies f1,f2,(f1 less than or equal to f2)and create a tree with t1 and t2 as its children and with the frequency in the new root equal to f1 + f2;

 3. Associate 0 with each left branch and 1 with each right branch.

 4. Create a unique code for each symbol by traversing  the tree from the root to the leaf containing the frequency corresponding to this symbol and by putting all encountered 0s and 1s together.

    thanks in advance.


Avatar of ozo
ozo
Flag of United States of America image

So how would you enhance that algoritm to surpass the existing compression techniques?
Avatar of vikiing
vikiing

PKZIP, RAR, ARJ, LZ and other well known data compressors make no difference if you're compressing an executable file, a text document or an image; they simply make the analysis and finally compress.

¿What do you need, exactly?
RAR (somehow) seems to have a bit of an edge in efficiency when it comes to multimedia.
Please, Doggy: ¿what's a "multimedia" file?
multimedia = graphics, sound, etc., as opposed to programs, text, etc.
Avatar of rdktsj

ASKER

What I really need is a generic algorithm that would surpass the existing Huffman algorithm. This generic algo may be a combination of some of the existing algos or an enhanced version of one. Thank you so much.
I think the only way to enhance this algorithm (for generic use) is to increase its speed (in a particular implementation).  I don't think there is a way to increase the compression advantage unless you were to start from scratch and devise a new algorithm.
before you compress your files with the huffmann-algo, you should compress them with an lz-like algo. heres how it works :

1. find combination of bytes, which are often in your file
2. for every combination store a link to the last combination

at the end your file consits of a few combinations and many links. you can compress it after this lz-algo with huffman and decrease the file size again.

if you want a higher compression you should use before the lz and huffman compression a rle (run length encoding). i achieve a ~50% compression with all these algos.

greets
columbo666
Arithmetic coding might be considered an enhancement of Huffman coding.
LZ77 or LZ78 are some other generic algorithms that can surpass Huffman coding.
one of additional steps could be using different algorithms inside one program to pack the same file, writing compressed to several files, than compare ratio and choose best one...
but it will give probably no more than 10% difference.
Principal difference is lossy compression like .MP3 .JPG files, so you can use it on these 'multimedia' ;) files, but it is not worth it... I mean what's the use of such program which will check the file extension, than use JPEG scheme for photos, LZW for .txt etc etc...
The best computer is you ;-), so you can choose which archiver better to use with particular file.
You can expand Huffman to handle series. That is, to store the symbol "aa" as something different than "a". That needs additional field in the node record, but compresses the string "aaaaaaaaaaaaaaaaaaaaaaaa" to 1bit instead of three bytes.
 
                        Hobbit123

Dahm that hobbit is smart, anyways I'll soon mail you some source of mine with the improved huffman's (YES we of UtopiC believe in free source so you will have all of it - haha)
Avatar of rdktsj

ASKER

hobbit, how can we achieve this method of compressing "aa" instead of "a"?
Huffman tree is a set of symbols. Each symbol has its code. Right?
If your file contains lots of 'a', the code for 'a' will be the shortest code in the tree. Right?
If the compressor meets 'a', and then 'a' again, it recognizes this as two 'a' symbols. You can improve it to recognize it as one symbol. Usual declaration for Huffman node looks like this:
 THuffmanNode = record
   Symbol   : char;
   Code     : LongInt;
 end;

our modification:
 THuffmanNode = record
   Symbol    : Char;
   Code      : LongInt;
   RepCount  : word;
 end;

When building the tree, the compressor meets 10000 'a' letters. It creates ONE node for them (RepCount=10000) and compresses even to ONE BIT. In the best case, normal Huffman would compress to 1250 bytes. But our version requires more data in the header.


                   Hobbit123
first about the hufman alforithm it generate a tree for each file not each symbol

not zip coding is using the hufman algorithm

the only thing that you can do to enhance it is to generat a code to each word <not to symbols like in the hufman algorithm>

i think that it's the idea of rar but i'm not sure

if you want to code txt files that contains words try to use the hufman algorithm to generate code to each word
more maby to some words that were found alot in the file and for the rest just code for the symbols
you can do two checks first check the word and the secod check will be for the rest symbols

and if you know a little bit more about the hufman algorithm you"ll know that it generate the most optimized code for each symbol so there is no way you can change the algorithm itself to make it better

for the one that sujested to increase the speed :the speed is not the problem

so i don't know if there is something else to do except what i told you to
ASKER CERTIFIED SOLUTION
Avatar of God_Ares
God_Ares

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial