Solved

ziv lempel compress?

Posted on 2003-12-03
6
434 Views
Last Modified: 2010-04-02
hi , how can i compress a large txt file ( like 10 MB). Some one know Ziv Lempel algorithm will see that if i use 16 bit CodeWord for every entry data i input from a txt file(large file) into a hash table. As a result of 16 bits(unsigned short), i only compressed a part of that txt file. I wonder that how can i compress the whole file?
A newbie  
0
Comment
Question by:therock_80
  • 4
6 Comments
 
LVL 4

Expert Comment

by:dhyanesh
Comment Utility
Hi

You can either increase the CodeWord size or increase the number of characters you consider at a time. If you are considering one character at a time you can increase it to more than one to get the entire file in the hash table.

Dhyanesh
 
0
 
LVL 4

Expert Comment

by:dhyanesh
Comment Utility
Also you can use other compression techniques like huffman coding but this will not be as efficient as Lempel-Ziv
0
 
LVL 4

Expert Comment

by:dhyanesh
Comment Utility
Hi

This is excerpt from an article:

"Another problem on long files is that frequently the compression ratio begins to degrade as more of the file is read in. The reason for this is simple. Since the string table is of finite size, after a certain number of strings have been defined, no more can be added. But the string table is only good for the portion of the file that was read in while it was built. Later sections of the file may have different characteristics, and really need a different string table.

The conventional way to solve this problem is to monitor the compression ratio. After the string table is full, the compressor watches to see if the compression ratio degrades. After a certain amount of degradation, the entire table is flushed, and gets rebuilt from scratch. The expansion code is flagged when this happens by seeing a special code from the compression routine.

An alternative method would be to keep track of how frequently strings are used, and to periodically flush values that are rarely used. An adaptive technique like this may be too difficult to implement in a reasonably sized program.

One final technique for compressing the data is to take the LZW codes and run them through an adaptive Huffman coding filter. This will generally exploit a few more percentage points of compression, but at the cost of considerable more complexity in the code, as well as quite a bit more run time. "

The link for entire article is:
http://dogma.net/markn/articles/lzw/lzw.htm

It also has working implementation.

Dhyanesh

0
Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

 
LVL 1

Author Comment

by:therock_80
Comment Utility
hi Dhyanesh,
 i tried change unsigned short to unsigned int, then the hash table can contain 2^32 entries codeword. But when i compress, my computer slow down, and got low virtual memory.
I think that because i declared 2^32 size of array and struct{} in my code.?
 Can you show me any algorithm or method people use to compress big data except Ziv Lempel???
0
 
LVL 4

Expert Comment

by:skypalae
Comment Utility
Hi
Lempel-Ziv uses simple technique (I don't know if it is used in standard PKZIP, but i think it is). When you exceed the capacity of your hash table, just save everything, clear all your tables and continue like if you doing new file.
S.
0
 
LVL 4

Accepted Solution

by:
dhyanesh earned 125 total points
Comment Utility
Hi

I think Lempel Ziv is the best compression method

As I have posted above:

"The conventional way to solve this problem is to monitor the compression ratio. After the string table is full, the compressor watches to see if the compression ratio degrades. After a certain amount of degradation, the entire table is flushed, and gets rebuilt from scratch. The expansion code is flagged when this happens by seeing a special code from the compression routine.
"

You can use this technique to handle data when the table is full. i.e. by flushing the entire table.

However if you do want other techniques you can use Huffman Coding. It uses the fixed character set which are encoded on basis of their probability of occurence. This will not require more than 256 characters i.e. the ASCII set. Each character is assigned a code. More frequently occuring characters are given less bits and less frequently occuring ones more bits.

Here is are links to various versions of Huffman coding:

http://www.gamedev.net/reference/articles/article286.asp

http://www.xcf.berkeley.edu/~ali/K0D/Algorithms/huff/

http://www.cs.mu.oz.au/~alistair/inplace.c



However Huffman coding requires that the dictionary created i.e. character and codeword be sent along as well. The order of most frequently occuring to least frequently occuring symbol is also sufficient. Huffman coding does not perform very well for small files as the overhead of sending the order of frequency. It can handle large files without any problem since max characters are 256. Also compression ratios will be less compared to Lempel - Ziv but not in all cases

 
Dhyanesh
0

Featured Post

Find Ransomware Secrets With All-Source Analysis

Ransomware has become a major concern for organizations; its prevalence has grown due to past successes achieved by threat actors. While each ransomware variant is different, we’ve seen some common tactics and trends used among the authors of the malware.

Join & Write a Comment

When writing generic code, using template meta-programming techniques, it is sometimes useful to know if a type is convertible to another type. A good example of when this might be is if you are writing diagnostic instrumentation for code to generat…
Container Orchestration platforms empower organizations to scale their apps at an exceptional rate. This is the reason numerous innovation-driven companies are moving apps to an appropriated datacenter wide platform that empowers them to scale at a …
The viewer will learn how to clear a vector as well as how to detect empty vectors in C++.
The viewer will be introduced to the member functions push_back and pop_back of the vector class. The video will teach the difference between the two as well as how to use each one along with its functionality.

763 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now