Link to home
Start Free TrialLog in
Avatar of DSOM
DSOMFlag for United States of America

asked on

Using bzip2 to compress strings

I'm looking for a way to use bzip2 or even something better that doesn't require an external dll.  I have tried using zlib but the "compressed" string comes out longer than the original.  I'd appreciate any help and/or insight, thanks!
Avatar of aikimark
aikimark
Flag of United States of America image

What kind of strings are you compressing?
What volume of data are you compressing? (how many strings)
Here's a Delphi unit for implementing bzip2 data compression...
BZIP2 Data Compression Interface Unit
Avatar of DSOM

ASKER

aikimark: It's about 4 million strings like these:


83852C953D0A8AFE0BEA1D9794382B52
30CC848AE998222B945DAB2DFA619727

Thommy: Can you make a compress and decompress string function that uses that unit?

sinisav: thanks, I am reviewing those links
Avatar of DSOM

ASKER

Smaz and Snappy both look interesting but I can't find a Delphi port for them
Now I can see your strings. Seems that your strings are hex numbers, representing with 16 states. You can use Base64 encoding to shrink your strings. Base64 have 64 possible states.
Translate your hex number to bit array  and use 6 instead of 4 bits to look in base64 table.
This way you can shrink from 42 chars to 42*4/6=28 chars.

www.koders.com
http://www.swissdelphicenter.ch/torry/showcode.php?id=1524
Avatar of DSOM

ASKER

I am not sure I follow.  If I base64 encode the string I get
83852C953D0A8AFE0BEA1D9794382B52 ->
E3CuDJ93EJKpH311E456HJ12HK4nH3atEJGpE392DJ8

Which takes up more space
@DSOM

You're trying to compress GUIDs?!?

When encoding in Base64, you go back to the byte level, not at the (character) hex digit representation of the data.

You can get a 50% size reduction by converting your hex string into a record of four long integer values.
Example:
'83852C953D0A8AFE0BEA1D9794382B52'
Produces these long integer values:
-2088424299
1024101118
199892375
-1808258222

If you sorted your hex strings, you might be able to get more compression.  In such a scenario, each value is the difference to the prior value in the sequence or you use a short integer for the first part of the numeric values record and have a separate bit of data that has the first two bytes of the numeric value along with the number of records that share those first two bytes.
Avatar of DSOM

ASKER

Ok, the only problem I see storing the data like that is that my database uses it for the primary key.  If I were able to compress the string and store that it would still work but if I store 4 integers I don't think I could use the field as primary or unique.
you can't compress or encrypt a key field unless that is a function within the database.

Is this a static or dynamic table?  If static, then convert the GUID keys to an autonumber (=long integer) column.  You will get a 75% reduction in column size.  The current size of your table is .1% of the usable long integer key range.  (1/10 of one percent)

If such GUID values are used in foreign key relationships, then you can get more than 75% reduction.
Avatar of DSOM

ASKER

I would be storing the compressed string as the data in the key field and using that to index instead.
Once you've done such a conversion, I don't see any need to save the GUID value.  If you did need to save the GUID values, then use the 4-tuple of long integer values or a byte array.  When a new GUID value comes in, you would convert that hex string to your stored format and search your table of existing values.  If found, you would use the corresponding long integer key value or add the new GUID value (in converted format) to your table.
I made my example:

const
  Codes64 = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz+/';
....
function EncodeHexToBase64(sInStr: String): String;
const
  HEX_BIN_TABLE : array[0..15, 0..3] of Byte =
    (
    (0,0,0,0),  //0
    (0,0,0,1),  //1
    (0,0,1,0),  //2
    (0,0,1,1),  //3
    (0,1,0,0),  //4
    (0,1,0,1),  //5
    (0,1,1,0),  //6
    (0,1,1,1),  //7
    (1,0,0,0),  //8
    (1,0,0,1),  //9
    (1,0,1,0),  //a
    (1,0,1,1),  //b
    (1,1,0,0),  //c
    (1,1,0,1),  //d
    (1,1,1,0),  //e
    (1,1,1,1)   //f
    );
var
  i, j, b, n: Integer;
  arBin: Array of Byte;
begin
  Result := '';
  SetLength(arBin, Length(sInStr)*4); //allocate space for binary

  //make binary array
  for i:=1 to Length(sInStr) do
  begin
    j := 0;
    case sInStr[i] of
      'A'..'F':
        j := (Ord(sInStr[i]) - 55);
      '0'..'9':
        j := (Ord(sInStr[i]) - 48);
    end;

    arBin[(i-1)*4+0] := HEX_BIN_TABLE[j, 0];
    arBin[(i-1)*4+1] := HEX_BIN_TABLE[j, 1];
    arBin[(i-1)*4+2] := HEX_BIN_TABLE[j, 2];
    arBin[(i-1)*4+3] := HEX_BIN_TABLE[j, 3];
  end;

  b := 0;
  n := 0;
  for i:=Low(arBin) to High(arBin) do
  begin
    Inc(b);

    n := n * 2 + arBin[i];
    if b=6 then
    begin
      Result := Result + Codes64[n+1];
      b := 0;
      n := 0;
    end;
  end;

  if b>0 then
  begin
    Result := Result + Codes64[n+1];
  end;
end;

function DecodeHexToBase64(sInStr: String): String;
const
  HEXSTR : String = '0123456789ABCDEF';
var
  i, j, b, n, m: Integer;
  arBin: Array of Byte;
begin
  Result := '';
  SetLength(arBin, (6 * Length(sInStr))); //allocate space

  //make binary array
  b := 0;
  for i:=1 to Length(sInStr) do
  begin
    n := Pos(sInStr[i], codes64) - 1;

    m :=6;
    if i=Length(sInStr) then //last char
      m := Trunc((Math.Log10(n)/Math.Log10(2)) + 1);

    for j:=1 to m do
    begin
      arBin[b+m-j] := (n mod 2);
      n := n div 2;
    end;

    Inc(b, m);
  end;

  m := 0;
  n := 0;
  for i:=Low(arBin) to b-1 do
  begin
    Inc(m);
    n := n * 2 + arBin[i];
    if m=4 then
    begin
      Result := Result + HEXSTR[n+1];
      m := 0;
      n := 0;
    end;
  end;

  if m>0 then
  begin
    Result := Result + HEXSTR[n+1];
  end;
end;
...

Open in new window


This could be done more fancy, but for show my idea is fine. Please
note that best compress ratio is for input string which length is divisible by 6 like 36, 42,48...
@DSOM

What kind of database are you using?
Avatar of DSOM

ASKER

it's called Absolute Database by ComponentAce.  It's a drop-in, standalone replacement for BDE.  

http://www.componentace.com/bde_replacement_database_delphi_absolute_database.htm

Over the years, the database has grown to around 3.5 million records and takes up almost 500Mb which is getting to be a problem for distribution.  So I am looking for ways to cut every bit and byte I can to save space.
Are the GUID values being generated by the database or your application?
If you're using an aftGuid field type, then the database is probably already storing the values in 16-byte chunks -- no savings possible.

Are your concerns the distribution footprint of the RAM footprint at run time?
Avatar of DSOM

ASKER

It's not a GUID, it's an MD5 so it's just stored as a fixed-length string.  The MD5s are generated by an app that I use to produce the database.
what data are you hashing?
Avatar of DSOM

ASKER

The hash is from data files.  Could be any file really.

I tried converting the string into 4 largints in the database but that caused it grow from 500Mb to 700Mb.
what is the field type you are storing the MD5 hash values
Avatar of DSOM

ASKER

The field type is string
ASKER CERTIFIED SOLUTION
Avatar of aikimark
aikimark
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial