rdktsj
asked on
compression of text, .exe and graphics ALL IN ONE
our project is about enhancing the huffman algorithm that would be able to compress text, .exe and graphics, and would be able to surpass the existing compression techniques for the said variation of files. The algorithm for the Huffman is :
1.For each symbol create a tree with a single root node and order all trees according to the probability of symbol table
;
2. While more than one tree is left take the two trees t1,t2, with the lowes frequencies f1,f2,(f1 less than or equal to f2)and create a tree with t1 and t2 as its children and with the frequency in the new root equal to f1 + f2;
3. Associate 0 with each left branch and 1 with each right branch.
4. Create a unique code for each symbol by traversing the tree from the root to the leaf containing the frequency corresponding to this symbol and by putting all encountered 0s and 1s together.
thanks in advance.
1.For each symbol create a tree with a single root node and order all trees according to the probability of symbol table
;
2. While more than one tree is left take the two trees t1,t2, with the lowes frequencies f1,f2,(f1 less than or equal to f2)and create a tree with t1 and t2 as its children and with the frequency in the new root equal to f1 + f2;
3. Associate 0 with each left branch and 1 with each right branch.
4. Create a unique code for each symbol by traversing the tree from the root to the leaf containing the frequency corresponding to this symbol and by putting all encountered 0s and 1s together.
thanks in advance.
So how would you enhance that algoritm to surpass the existing compression techniques?
PKZIP, RAR, ARJ, LZ and other well known data compressors make no difference if you're compressing an executable file, a text document or an image; they simply make the analysis and finally compress.
¿What do you need, exactly?
¿What do you need, exactly?
RAR (somehow) seems to have a bit of an edge in efficiency when it comes to multimedia.
Please, Doggy: ¿what's a "multimedia" file?
multimedia = graphics, sound, etc., as opposed to programs, text, etc.
ASKER
What I really need is a generic algorithm that would surpass the existing Huffman algorithm. This generic algo may be a combination of some of the existing algos or an enhanced version of one. Thank you so much.
I think the only way to enhance this algorithm (for generic use) is to increase its speed (in a particular implementation). I don't think there is a way to increase the compression advantage unless you were to start from scratch and devise a new algorithm.
before you compress your files with the huffmann-algo, you should compress them with an lz-like algo. heres how it works :
1. find combination of bytes, which are often in your file
2. for every combination store a link to the last combination
at the end your file consits of a few combinations and many links. you can compress it after this lz-algo with huffman and decrease the file size again.
if you want a higher compression you should use before the lz and huffman compression a rle (run length encoding). i achieve a ~50% compression with all these algos.
greets
columbo666
1. find combination of bytes, which are often in your file
2. for every combination store a link to the last combination
at the end your file consits of a few combinations and many links. you can compress it after this lz-algo with huffman and decrease the file size again.
if you want a higher compression you should use before the lz and huffman compression a rle (run length encoding). i achieve a ~50% compression with all these algos.
greets
columbo666
Arithmetic coding might be considered an enhancement of Huffman coding.
LZ77 or LZ78 are some other generic algorithms that can surpass Huffman coding.
LZ77 or LZ78 are some other generic algorithms that can surpass Huffman coding.
one of additional steps could be using different algorithms inside one program to pack the same file, writing compressed to several files, than compare ratio and choose best one...
but it will give probably no more than 10% difference.
Principal difference is lossy compression like .MP3 .JPG files, so you can use it on these 'multimedia' ;) files, but it is not worth it... I mean what's the use of such program which will check the file extension, than use JPEG scheme for photos, LZW for .txt etc etc...
The best computer is you ;-), so you can choose which archiver better to use with particular file.
but it will give probably no more than 10% difference.
Principal difference is lossy compression like .MP3 .JPG files, so you can use it on these 'multimedia' ;) files, but it is not worth it... I mean what's the use of such program which will check the file extension, than use JPEG scheme for photos, LZW for .txt etc etc...
The best computer is you ;-), so you can choose which archiver better to use with particular file.
You can expand Huffman to handle series. That is, to store the symbol "aa" as something different than "a". That needs additional field in the node record, but compresses the string "aaaaaaaaaaaaaaaaaaaaaaaa" to 1bit instead of three bytes.
Hobbit123
Hobbit123
Dahm that hobbit is smart, anyways I'll soon mail you some source of mine with the improved huffman's (YES we of UtopiC believe in free source so you will have all of it - haha)
ASKER
hobbit, how can we achieve this method of compressing "aa" instead of "a"?
Huffman tree is a set of symbols. Each symbol has its code. Right?
If your file contains lots of 'a', the code for 'a' will be the shortest code in the tree. Right?
If the compressor meets 'a', and then 'a' again, it recognizes this as two 'a' symbols. You can improve it to recognize it as one symbol. Usual declaration for Huffman node looks like this:
THuffmanNode = record
Symbol : char;
Code : LongInt;
end;
our modification:
THuffmanNode = record
Symbol : Char;
Code : LongInt;
RepCount : word;
end;
When building the tree, the compressor meets 10000 'a' letters. It creates ONE node for them (RepCount=10000) and compresses even to ONE BIT. In the best case, normal Huffman would compress to 1250 bytes. But our version requires more data in the header.
Hobbit123
If your file contains lots of 'a', the code for 'a' will be the shortest code in the tree. Right?
If the compressor meets 'a', and then 'a' again, it recognizes this as two 'a' symbols. You can improve it to recognize it as one symbol. Usual declaration for Huffman node looks like this:
THuffmanNode = record
Symbol : char;
Code : LongInt;
end;
our modification:
THuffmanNode = record
Symbol : Char;
Code : LongInt;
RepCount : word;
end;
When building the tree, the compressor meets 10000 'a' letters. It creates ONE node for them (RepCount=10000) and compresses even to ONE BIT. In the best case, normal Huffman would compress to 1250 bytes. But our version requires more data in the header.
Hobbit123
first about the hufman alforithm it generate a tree for each file not each symbol
not zip coding is using the hufman algorithm
the only thing that you can do to enhance it is to generat a code to each word <not to symbols like in the hufman algorithm>
i think that it's the idea of rar but i'm not sure
if you want to code txt files that contains words try to use the hufman algorithm to generate code to each word
more maby to some words that were found alot in the file and for the rest just code for the symbols
you can do two checks first check the word and the secod check will be for the rest symbols
and if you know a little bit more about the hufman algorithm you"ll know that it generate the most optimized code for each symbol so there is no way you can change the algorithm itself to make it better
for the one that sujested to increase the speed :the speed is not the problem
so i don't know if there is something else to do except what i told you to
not zip coding is using the hufman algorithm
the only thing that you can do to enhance it is to generat a code to each word <not to symbols like in the hufman algorithm>
i think that it's the idea of rar but i'm not sure
if you want to code txt files that contains words try to use the hufman algorithm to generate code to each word
more maby to some words that were found alot in the file and for the rest just code for the symbols
you can do two checks first check the word and the secod check will be for the rest symbols
and if you know a little bit more about the hufman algorithm you"ll know that it generate the most optimized code for each symbol so there is no way you can change the algorithm itself to make it better
for the one that sujested to increase the speed :the speed is not the problem
so i don't know if there is something else to do except what i told you to
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.