ziv lempel compress?

hi , how can i compress a large txt file ( like 10 MB). Some one know Ziv Lempel algorithm will see that if i use 16 bit CodeWord for every entry data i input from a txt file(large file) into a hash table. As a result of 16 bits(unsigned short), i only compressed a part of that txt file. I wonder that how can i compress the whole file?
A newbie  
LVL 1
therock_80Asked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

dhyaneshCommented:
Hi

You can either increase the CodeWord size or increase the number of characters you consider at a time. If you are considering one character at a time you can increase it to more than one to get the entire file in the hash table.

Dhyanesh
 
0
dhyaneshCommented:
Also you can use other compression techniques like huffman coding but this will not be as efficient as Lempel-Ziv
0
dhyaneshCommented:
Hi

This is excerpt from an article:

"Another problem on long files is that frequently the compression ratio begins to degrade as more of the file is read in. The reason for this is simple. Since the string table is of finite size, after a certain number of strings have been defined, no more can be added. But the string table is only good for the portion of the file that was read in while it was built. Later sections of the file may have different characteristics, and really need a different string table.

The conventional way to solve this problem is to monitor the compression ratio. After the string table is full, the compressor watches to see if the compression ratio degrades. After a certain amount of degradation, the entire table is flushed, and gets rebuilt from scratch. The expansion code is flagged when this happens by seeing a special code from the compression routine.

An alternative method would be to keep track of how frequently strings are used, and to periodically flush values that are rarely used. An adaptive technique like this may be too difficult to implement in a reasonably sized program.

One final technique for compressing the data is to take the LZW codes and run them through an adaptive Huffman coding filter. This will generally exploit a few more percentage points of compression, but at the cost of considerable more complexity in the code, as well as quite a bit more run time. "

The link for entire article is:
http://dogma.net/markn/articles/lzw/lzw.htm

It also has working implementation.

Dhyanesh

0
Cloud Class® Course: CompTIA Healthcare IT Tech

This course will help prep you to earn the CompTIA Healthcare IT Technician certification showing that you have the knowledge and skills needed to succeed in installing, managing, and troubleshooting IT systems in medical and clinical settings.

therock_80Author Commented:
hi Dhyanesh,
 i tried change unsigned short to unsigned int, then the hash table can contain 2^32 entries codeword. But when i compress, my computer slow down, and got low virtual memory.
I think that because i declared 2^32 size of array and struct{} in my code.?
 Can you show me any algorithm or method people use to compress big data except Ziv Lempel???
0
skypalaeCommented:
Hi
Lempel-Ziv uses simple technique (I don't know if it is used in standard PKZIP, but i think it is). When you exceed the capacity of your hash table, just save everything, clear all your tables and continue like if you doing new file.
S.
0
dhyaneshCommented:
Hi

I think Lempel Ziv is the best compression method

As I have posted above:

"The conventional way to solve this problem is to monitor the compression ratio. After the string table is full, the compressor watches to see if the compression ratio degrades. After a certain amount of degradation, the entire table is flushed, and gets rebuilt from scratch. The expansion code is flagged when this happens by seeing a special code from the compression routine.
"

You can use this technique to handle data when the table is full. i.e. by flushing the entire table.

However if you do want other techniques you can use Huffman Coding. It uses the fixed character set which are encoded on basis of their probability of occurence. This will not require more than 256 characters i.e. the ASCII set. Each character is assigned a code. More frequently occuring characters are given less bits and less frequently occuring ones more bits.

Here is are links to various versions of Huffman coding:

http://www.gamedev.net/reference/articles/article286.asp

http://www.xcf.berkeley.edu/~ali/K0D/Algorithms/huff/

http://www.cs.mu.oz.au/~alistair/inplace.c



However Huffman coding requires that the dictionary created i.e. character and codeword be sent along as well. The order of most frequently occuring to least frequently occuring symbol is also sufficient. Huffman coding does not perform very well for small files as the overhead of sending the order of frequency. It can handle large files without any problem since max characters are 256. Also compression ratios will be less compared to Lempel - Ziv but not in all cases

 
Dhyanesh
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
C++

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.