?
Solved

LZW compression - converting a string binary into bits

Posted on 2003-03-06
25
Medium Priority
?
1,296 Views
Last Modified: 2008-03-10
I am using LZW as the compression algorithm, that part works fine, but I am trying to convert a binary string of a codeword into bits as to compress the file further.
Here is an example:
Input from file:  ababababababbbabac
codeword ( eg ):  12344634
binary for the codeword ( not correct just example )
1          0000001
2          0000010
3          0000011
4          0000101
4          0000101
...

Lets assume that the  alphabet consists of the 127  ASCII characters from decimal code 0 to  decimal code 126 (tilde), inclusive.  Therefore, I will need at least 7 bits per code initially.  So how can i convert each element in the binary representation to a bit, and lets say enter into a 7 bit stream?  Also how can I read it back, i.e how do i know what each integer bit was entered at and that it doesnt come back as one big long number that cant be broken up into one byte when decompressing?

Thanks any help would be greatly appreciated

John

i have seen this done in C.
0
Comment
Question by:imrama
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 11
  • 8
  • 3
  • +3
25 Comments
 
LVL 35

Expert Comment

by:TimYates
ID: 8081378
I'm confused...

Are you trying to get chars into a 7bit numeric representation?
0
 
LVL 35

Expert Comment

by:TimYates
ID: 8081414
ahhh...then read 7 bits at a time from a stream, and convert that back into the chars?

hmmmm...not gonna be pretty in Java :-(
0
 

Author Comment

by:imrama
ID: 8081419
Ya im trying to convert an integer -> binary -> 7bit representation.

input   codeword   binary     bit representation
ab      4          000101     ?
0
Get 15 Days FREE Full-Featured Trial

Benefit from a mission critical IT monitoring with Monitis Premium or get it FREE for your entry level monitoring needs.
-Over 200,000 users
-More than 300,000 websites monitored
-Used in 197 countries
-Recommended by 98% of users

 

Author Comment

by:imrama
ID: 8081438
Ya thats the problem, its not, have you any source code to help this, I was trying to do this but its not working and now I have got all confused!
0
 

Author Comment

by:imrama
ID: 8081458
IN java binary is represented in strings so every 1 or 0 in the string representation takes up 1 byte ( 8 bits ) so how do i represent each 1 or 0 ( in the string form ) as a bit instead of 1 byte.  get me? confusing I know!!
0
 
LVL 35

Expert Comment

by:TimYates
ID: 8081505
byte setBits( String bitRepresentation )
{
  if( bitRepresentation.length() > 8 )
    return 0 ;
  byte ret = 0 ;
  for( int i = bitRepresentation.length - 1 ; i >= 0 ; i-- )
  {
    if( bitRepresentation.charAt( i ) == '1' )
      ret |= ( 1 << i ) ;
  }
  return ret ;
}

.....

byte b = setBits( "01010101" ) ;

----------------

Should set b to 85
0
 
LVL 35

Expert Comment

by:TimYates
ID: 8081512
of course, byte is signed too, so

byte b = setBits( "11010101" ) ;

would be negative...

ho hum!
0
 

Author Comment

by:imrama
ID: 8081525
IN java binary is represented in strings so every 1 or 0 in the string representation takes up 1 byte ( 8 bits ) so how do i represent each 1 or 0 ( in the string form ) as a bit instead of 1 byte.  get me? confusing I know!!
0
 
LVL 35

Expert Comment

by:TimYates
ID: 8081542
oooh...an echo ;-)
0
 
LVL 35

Expert Comment

by:TimYates
ID: 8081696
I think you may be going a little off track though...

Ok, you have a char:

A

which has an ASCII value of (64) I think...

doing

Integer.toBinaryString( (int)'A' ) ;

will give you the string

"01000000"

(or something similar)

however, doing

byte b = (byte)'A' ;

will give you the byte that represents the number 64

Are you sure you need to use the string method?  The byte is stored inside the computer in binary...  it's just the representation that's a string...
0
 

Author Comment

by:imrama
ID: 8081722
Ok, got that I have that part but I should have stated that b ( in ur case byte b ) has to be of varying length i.e it could be 9 bits.  I want to write to a file 1 byte ( 8 bits ) so that if one binary is 7 bits and a second binary is 9 bits the first entry on the second binary (9 bits) will be added to the 7 bits of the first binary number and outputted to a file as one byte ( 8 bits ).  This is becoming very confusing.  So when reading it back from the file ( decompressing ) how do i know where one number finishes and another one starts? This is the coding I cant fathom!

here is an example:

       
1010101         first binary string
11111111        second binary string
11111101
110101011


8 bits written to a file as one byte:

10101011           First binary string
11111111           second binary string afterwards

Thanks a lot for the help, I know this is confusing I should have been clearer from the start!  I understand the concept of masking every byte before writing it but coding this is the problem.

John
0
 

Author Comment

by:imrama
ID: 8081740
Ok, got that I have that part but I should have stated that b ( in ur case byte b ) has to be of varying length i.e it could be 9 bits.  I want to write to a file 1 byte ( 8 bits ) so that if one binary is 7 bits and a second binary is 9 bits the first entry on the second binary (9 bits) will be added to the 7 bits of the first binary number and outputted to a file as one byte ( 8 bits ).  This is becoming very confusing.  So when reading it back from the file ( decompressing ) how do i know where one number finishes and another one starts? This is the coding I cant fathom!

here is an example:

       
1010101         first binary string
11111111        second binary string
11111101
110101011


8 bits written to a file as one byte:

10101011           First binary string
11111111           second binary string afterwards

Thanks a lot for the help, I know this is confusing I should have been clearer from the start!  I understand the concept of masking every byte before writing it but coding this is the problem.

John
0
 
LVL 35

Expert Comment

by:TimYates
ID: 8081832
you _could_ have:

First binary string:

[ length ] [ string ]
00000111   1010101

second binary string:

[ length ] [ string ]
00001001   111111111

So that the length is always 8bits long, and the data afterwards is variable, then unpacking it *shouldn't* be that hard...  

You will either need to know the length of the bits, or have the bits fit into a definite header structure before you will be able to seperate them out again :-(

Without some method of indexing, or without knowing where all your bytes start as they follow a set patter, you won't be able to extract them out :-(
0
 
LVL 2

Expert Comment

by:karlika
ID: 8081838
The only possibility I know is to write a table where you define the length of each binary.
Are you sure you saw this done in c?

Marko
0
 
LVL 35

Expert Comment

by:TimYates
ID: 8081862
> The only possibility I know is to write a table where you define the length of each binary.

or to precede each binary block with a byte saying how long it is like in my example ;-)
0
 
LVL 14

Expert Comment

by:Tommy Braas
ID: 8081863
Actually, in Java each character in a String takes up 2 bytes.

The way I see it, you have three options:

Common to both options: Read in the data into a byte[].

Option 1:
In a loop, you want to 'push' the bits from what you have read, into a new array of bytes, 7 bits at a time. The hard part will be to keep track of which byte you are at in the source array and in the destination array AND which bit you're transferring.

Option 2:
Instantiate a java.util.BitSet.
In a loop iterating the array of bytes read in, for each bit in a byte that you want (7 out of 8), call set(index) on the BitSet instance, with the index of the bit your reading times the byte you're reading minus 1.

public BitSet convert(byte[] inBytes)
{
   // the order of the values in the bit mask is determined
   // by the order of the bits written
   byte[] bitMasks = { 64, 32, 16, 8, 4, 2, 1 };

   // instantiate the BitSet with the right length
   BitSet bitSet = new BitSet(inBytes.length * 8 - inBytes.length);

   for( int i=0; i<inBytes.length; i++ )
   {
      for(int j=6; j>-1; j-- )
      {
         if( (inBytes[i] & bitMask[bitMasks.length - j]) != 0 )
         {
            bitSet.set(i*j);
         }
      }
   }
}

Option 3:
Pretty much the same as option 2. Read all the bytes and their respective bit values in to the BitSet. Then upon reading, sift out the 7 bits you want from every byte represented in the bitset.
0
 

Author Comment

by:imrama
ID: 8082158
The problem is not reading the bits from the input stream, its knowing which
bits correspond to which codewords. This is because when writing bits to the outputstream codewords are represented in binary strings of different
lengths. For example:

1101101   7 bits
11100010  8 bits
111000011 9 bits
etc.

I write these to the outputstream in 1 byte chucks(ie 8 bits). To save space
I will write the first bit of the second codeword in the last space of the
first codeword and so on. For example the 3 codewords above would be written to the
outputstream in 3 bytes as follows

11011011 11000101 11000011

I will continue to do this until all bits have been written to the outputstream

Then when it comes to reading the bits all I have is one long stream of
bits. I can easily read this and break this into 7, 8, 9 or any number of
bits length. The problem is that I want to distinguish where one codeword
ends and another one begins. As the lengths will differ I dont know how to
solve this


0
 

Author Comment

by:imrama
ID: 8082206
The problem is not reading the bits from the input stream, its knowing which
bits correspond to which codewords. This is because when writing bits to the outputstream codewords are represented in binary strings of different
lengths. For example:

1101101   7 bits
11100010  8 bits
111000011 9 bits
etc.

I write these to the outputstream in 1 byte chucks(ie 8 bits). To save space
I will write the first bit of the second codeword in the last space of the
first codeword and so on. For example the 3 codewords above would be written to the
outputstream in 3 bytes as follows

11011011 11000101 11000011

I will continue to do this until all bits have been written to the outputstream

Then when it comes to reading the bits all I have is one long stream of
bits. I can easily read this and break this into 7, 8, 9 or any number of
bits length. The problem is that I want to distinguish where one codeword
ends and another one begins. As the lengths will differ I dont know how to
solve this


0
 
LVL 14

Expert Comment

by:Tommy Braas
ID: 8082235
Aha, that wasn't obvious from your initial example. I thought the problem you were having was AFTER decompressing.

Am I correct in the assumption that you're problem lies in the actual decompression of your data?
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 8082472
Let me get this clear: you're saying you've found redundancy in the internal metadata in the compression routines you're using? Can you prove this?
0
 

Author Comment

by:imrama
ID: 8082485
Essentially it is both a compression and decompression problem, so that when I decompresse the file I will be able to read and distinguish each codeword as it was compressed.

If my codewords were 127, 128, 130, 124 this would give me the following binary representations of the codewords:

127: 1111111
128: 10000000
130: 10000010
124: 1111100

Now in compression I write this to the output stream in 8 bit chunks(1 byte) at a time as follows:

1st byte: 11111111
2nd byte: 00000001
3rd byte: 00000101
4th byte: 111100

I will pad the final byte with exra 0's as to make it a full byte so

4th byte = 11110000

These bytes represent a compressed version of the original codewords and are written to the outputstream. Essentially the file will hold 11111111000000010000010111110000, which is the 4 bytes one after another with padding of 0's at the end

Then when reading from the compressed file I can read bytes and break them up into their 8 bit values. If the original codewords were simply 8 bit values I could easily work out their original Integer value. However since the lengths can vary I cannot distinguish the end of one codeword and the start of another.

I need to find a way to do this.... preferably without increasing the size of the compressed file with extra information.  I cant code this and am hoping someone can provide code to do this!!



0
 
LVL 14

Expert Comment

by:Tommy Braas
ID: 8082907
Since you can't encode 'symbol' boundaries without adversly affecting file size, you have two different options:

Option 1: Save your dictionary in the file

Option 2: Use a static dictionary.

I don't know if you dynamically generate your dictionary based on the occurence of symbols in the 'text' to compress. If that is the case, you will HAVE to save that dictionary in the compressed file.

Using your example above, (127, 128, 130, 124), compressed that should take up more than about 12 bits.
0
 

Author Comment

by:imrama
ID: 8110993
thanks for the help, i have been trying that I have it working the easy way i.e using a bitinpustream calss to write a bit to a file.  But I want to be able do it the way I was saying earlier, for the ideas people have given can they back/provide some sample code, making sure that the compressed file is any bigger.  Whoever can provide this code will get the points.  Thanks again.
0
 
LVL 1

Accepted Solution

by:
pronane earned 225 total points
ID: 8219000
the easiest way to do this ( that i know of ) is to changed the lenght of the bits you are writing to a file everytime the next codeword you read in ( in binary format ) is longer than the current length of the bits you are reading in , therefore when decompressin you can do it like this:

when the dictionary size is greater than 127 you can increase the length of bits to be reading in and so on.
0
 

Author Comment

by:imrama
ID: 8219009
I have actually tried to do that and its sort of working, but I dont know where to go from here, seeing as you were hte last to respond I will award you the points.  Thanks for the help.
0

Featured Post

Get 15 Days FREE Full-Featured Trial

Benefit from a mission critical IT monitoring with Monitis Premium or get it FREE for your entry level monitoring needs.
-Over 200,000 users
-More than 300,000 websites monitored
-Used in 197 countries
-Recommended by 98% of users

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

INTRODUCTION Working with files is a moderately common task in Java.  For most projects hard coding the file names, using parameters in configuration files, or using command-line arguments is sufficient.   However, when your application has vi…
Go is an acronym of golang, is a programming language developed Google in 2007. Go is a new language that is mostly in the C family, with significant input from Pascal/Modula/Oberon family. Hence Go arisen as low-level language with fast compilation…
Viewers will learn about the different types of variables in Java and how to declare them. Decide the type of variable desired: Put the keyword corresponding to the type of variable in front of the variable name: Use the equal sign to assign a v…
How to fix incompatible JVM issue while installing Eclipse While installing Eclipse in windows, got one error like above and unable to proceed with the installation. This video describes how to successfully install Eclipse. How to solve incompa…
Suggested Courses
Course of the Month13 days, 4 hours left to enroll

777 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question