Link to home
Start Free TrialLog in
Avatar of imrama
imrama

asked on

LZW compression - converting a string binary into bits

I am using LZW as the compression algorithm, that part works fine, but I am trying to convert a binary string of a codeword into bits as to compress the file further.
Here is an example:
Input from file:  ababababababbbabac
codeword ( eg ):  12344634
binary for the codeword ( not correct just example )
1          0000001
2          0000010
3          0000011
4          0000101
4          0000101
...

Lets assume that the  alphabet consists of the 127  ASCII characters from decimal code 0 to  decimal code 126 (tilde), inclusive.  Therefore, I will need at least 7 bits per code initially.  So how can i convert each element in the binary representation to a bit, and lets say enter into a 7 bit stream?  Also how can I read it back, i.e how do i know what each integer bit was entered at and that it doesnt come back as one big long number that cant be broken up into one byte when decompressing?

Thanks any help would be greatly appreciated

John

i have seen this done in C.
Avatar of TimYates
TimYates
Flag of United Kingdom of Great Britain and Northern Ireland image

I'm confused...

Are you trying to get chars into a 7bit numeric representation?
ahhh...then read 7 bits at a time from a stream, and convert that back into the chars?

hmmmm...not gonna be pretty in Java :-(
Avatar of imrama
imrama

ASKER

Ya im trying to convert an integer -> binary -> 7bit representation.

input   codeword   binary     bit representation
ab      4          000101     ?
Avatar of imrama

ASKER

Ya thats the problem, its not, have you any source code to help this, I was trying to do this but its not working and now I have got all confused!
Avatar of imrama

ASKER

IN java binary is represented in strings so every 1 or 0 in the string representation takes up 1 byte ( 8 bits ) so how do i represent each 1 or 0 ( in the string form ) as a bit instead of 1 byte.  get me? confusing I know!!
byte setBits( String bitRepresentation )
{
  if( bitRepresentation.length() > 8 )
    return 0 ;
  byte ret = 0 ;
  for( int i = bitRepresentation.length - 1 ; i >= 0 ; i-- )
  {
    if( bitRepresentation.charAt( i ) == '1' )
      ret |= ( 1 << i ) ;
  }
  return ret ;
}

.....

byte b = setBits( "01010101" ) ;

----------------

Should set b to 85
of course, byte is signed too, so

byte b = setBits( "11010101" ) ;

would be negative...

ho hum!
Avatar of imrama

ASKER

IN java binary is represented in strings so every 1 or 0 in the string representation takes up 1 byte ( 8 bits ) so how do i represent each 1 or 0 ( in the string form ) as a bit instead of 1 byte.  get me? confusing I know!!
oooh...an echo ;-)
I think you may be going a little off track though...

Ok, you have a char:

A

which has an ASCII value of (64) I think...

doing

Integer.toBinaryString( (int)'A' ) ;

will give you the string

"01000000"

(or something similar)

however, doing

byte b = (byte)'A' ;

will give you the byte that represents the number 64

Are you sure you need to use the string method?  The byte is stored inside the computer in binary...  it's just the representation that's a string...
Avatar of imrama

ASKER

Ok, got that I have that part but I should have stated that b ( in ur case byte b ) has to be of varying length i.e it could be 9 bits.  I want to write to a file 1 byte ( 8 bits ) so that if one binary is 7 bits and a second binary is 9 bits the first entry on the second binary (9 bits) will be added to the 7 bits of the first binary number and outputted to a file as one byte ( 8 bits ).  This is becoming very confusing.  So when reading it back from the file ( decompressing ) how do i know where one number finishes and another one starts? This is the coding I cant fathom!

here is an example:

       
1010101         first binary string
11111111        second binary string
11111101
110101011


8 bits written to a file as one byte:

10101011           First binary string
11111111           second binary string afterwards

Thanks a lot for the help, I know this is confusing I should have been clearer from the start!  I understand the concept of masking every byte before writing it but coding this is the problem.

John
Avatar of imrama

ASKER

Ok, got that I have that part but I should have stated that b ( in ur case byte b ) has to be of varying length i.e it could be 9 bits.  I want to write to a file 1 byte ( 8 bits ) so that if one binary is 7 bits and a second binary is 9 bits the first entry on the second binary (9 bits) will be added to the 7 bits of the first binary number and outputted to a file as one byte ( 8 bits ).  This is becoming very confusing.  So when reading it back from the file ( decompressing ) how do i know where one number finishes and another one starts? This is the coding I cant fathom!

here is an example:

       
1010101         first binary string
11111111        second binary string
11111101
110101011


8 bits written to a file as one byte:

10101011           First binary string
11111111           second binary string afterwards

Thanks a lot for the help, I know this is confusing I should have been clearer from the start!  I understand the concept of masking every byte before writing it but coding this is the problem.

John
you _could_ have:

First binary string:

[ length ] [ string ]
00000111   1010101

second binary string:

[ length ] [ string ]
00001001   111111111

So that the length is always 8bits long, and the data afterwards is variable, then unpacking it *shouldn't* be that hard...  

You will either need to know the length of the bits, or have the bits fit into a definite header structure before you will be able to seperate them out again :-(

Without some method of indexing, or without knowing where all your bytes start as they follow a set patter, you won't be able to extract them out :-(
The only possibility I know is to write a table where you define the length of each binary.
Are you sure you saw this done in c?

Marko
> The only possibility I know is to write a table where you define the length of each binary.

or to precede each binary block with a byte saying how long it is like in my example ;-)
Actually, in Java each character in a String takes up 2 bytes.

The way I see it, you have three options:

Common to both options: Read in the data into a byte[].

Option 1:
In a loop, you want to 'push' the bits from what you have read, into a new array of bytes, 7 bits at a time. The hard part will be to keep track of which byte you are at in the source array and in the destination array AND which bit you're transferring.

Option 2:
Instantiate a java.util.BitSet.
In a loop iterating the array of bytes read in, for each bit in a byte that you want (7 out of 8), call set(index) on the BitSet instance, with the index of the bit your reading times the byte you're reading minus 1.

public BitSet convert(byte[] inBytes)
{
   // the order of the values in the bit mask is determined
   // by the order of the bits written
   byte[] bitMasks = { 64, 32, 16, 8, 4, 2, 1 };

   // instantiate the BitSet with the right length
   BitSet bitSet = new BitSet(inBytes.length * 8 - inBytes.length);

   for( int i=0; i<inBytes.length; i++ )
   {
      for(int j=6; j>-1; j-- )
      {
         if( (inBytes[i] & bitMask[bitMasks.length - j]) != 0 )
         {
            bitSet.set(i*j);
         }
      }
   }
}

Option 3:
Pretty much the same as option 2. Read all the bytes and their respective bit values in to the BitSet. Then upon reading, sift out the 7 bits you want from every byte represented in the bitset.
Avatar of imrama

ASKER

The problem is not reading the bits from the input stream, its knowing which
bits correspond to which codewords. This is because when writing bits to the outputstream codewords are represented in binary strings of different
lengths. For example:

1101101   7 bits
11100010  8 bits
111000011 9 bits
etc.

I write these to the outputstream in 1 byte chucks(ie 8 bits). To save space
I will write the first bit of the second codeword in the last space of the
first codeword and so on. For example the 3 codewords above would be written to the
outputstream in 3 bytes as follows

11011011 11000101 11000011

I will continue to do this until all bits have been written to the outputstream

Then when it comes to reading the bits all I have is one long stream of
bits. I can easily read this and break this into 7, 8, 9 or any number of
bits length. The problem is that I want to distinguish where one codeword
ends and another one begins. As the lengths will differ I dont know how to
solve this


Avatar of imrama

ASKER

The problem is not reading the bits from the input stream, its knowing which
bits correspond to which codewords. This is because when writing bits to the outputstream codewords are represented in binary strings of different
lengths. For example:

1101101   7 bits
11100010  8 bits
111000011 9 bits
etc.

I write these to the outputstream in 1 byte chucks(ie 8 bits). To save space
I will write the first bit of the second codeword in the last space of the
first codeword and so on. For example the 3 codewords above would be written to the
outputstream in 3 bytes as follows

11011011 11000101 11000011

I will continue to do this until all bits have been written to the outputstream

Then when it comes to reading the bits all I have is one long stream of
bits. I can easily read this and break this into 7, 8, 9 or any number of
bits length. The problem is that I want to distinguish where one codeword
ends and another one begins. As the lengths will differ I dont know how to
solve this


Aha, that wasn't obvious from your initial example. I thought the problem you were having was AFTER decompressing.

Am I correct in the assumption that you're problem lies in the actual decompression of your data?
Let me get this clear: you're saying you've found redundancy in the internal metadata in the compression routines you're using? Can you prove this?
Avatar of imrama

ASKER

Essentially it is both a compression and decompression problem, so that when I decompresse the file I will be able to read and distinguish each codeword as it was compressed.

If my codewords were 127, 128, 130, 124 this would give me the following binary representations of the codewords:

127: 1111111
128: 10000000
130: 10000010
124: 1111100

Now in compression I write this to the output stream in 8 bit chunks(1 byte) at a time as follows:

1st byte: 11111111
2nd byte: 00000001
3rd byte: 00000101
4th byte: 111100

I will pad the final byte with exra 0's as to make it a full byte so

4th byte = 11110000

These bytes represent a compressed version of the original codewords and are written to the outputstream. Essentially the file will hold 11111111000000010000010111110000, which is the 4 bytes one after another with padding of 0's at the end

Then when reading from the compressed file I can read bytes and break them up into their 8 bit values. If the original codewords were simply 8 bit values I could easily work out their original Integer value. However since the lengths can vary I cannot distinguish the end of one codeword and the start of another.

I need to find a way to do this.... preferably without increasing the size of the compressed file with extra information.  I cant code this and am hoping someone can provide code to do this!!



Since you can't encode 'symbol' boundaries without adversly affecting file size, you have two different options:

Option 1: Save your dictionary in the file

Option 2: Use a static dictionary.

I don't know if you dynamically generate your dictionary based on the occurence of symbols in the 'text' to compress. If that is the case, you will HAVE to save that dictionary in the compressed file.

Using your example above, (127, 128, 130, 124), compressed that should take up more than about 12 bits.
Avatar of imrama

ASKER

thanks for the help, i have been trying that I have it working the easy way i.e using a bitinpustream calss to write a bit to a file.  But I want to be able do it the way I was saying earlier, for the ideas people have given can they back/provide some sample code, making sure that the compressed file is any bigger.  Whoever can provide this code will get the points.  Thanks again.
ASKER CERTIFIED SOLUTION
Avatar of pronane
pronane

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of imrama

ASKER

I have actually tried to do that and its sort of working, but I dont know where to go from here, seeing as you were hte last to respond I will award you the points.  Thanks for the help.