• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 686
  • Last Modified:

Using bzip2 to compress strings

I'm looking for a way to use bzip2 or even something better that doesn't require an external dll.  I have tried using zlib but the "compressed" string comes out longer than the original.  I'd appreciate any help and/or insight, thanks!
0
DSOM
Asked:
DSOM
  • 10
  • 9
  • 3
  • +1
1 Solution
 
aikimarkCommented:
What kind of strings are you compressing?
What volume of data are you compressing? (how many strings)
0
 
ThommyCommented:
Here's a Delphi unit for implementing bzip2 data compression...
BZIP2 Data Compression Interface Unit
0
 
ThommyCommented:
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
DSOMAuthor Commented:
aikimark: It's about 4 million strings like these:


83852C953D0A8AFE0BEA1D9794382B52
30CC848AE998222B945DAB2DFA619727

Thommy: Can you make a compress and decompress string function that uses that unit?

sinisav: thanks, I am reviewing those links
0
 
DSOMAuthor Commented:
Smaz and Snappy both look interesting but I can't find a Delphi port for them
0
 
Sinisa VukCommented:
Now I can see your strings. Seems that your strings are hex numbers, representing with 16 states. You can use Base64 encoding to shrink your strings. Base64 have 64 possible states.
Translate your hex number to bit array  and use 6 instead of 4 bits to look in base64 table.
This way you can shrink from 42 chars to 42*4/6=28 chars.

www.koders.com
http://www.swissdelphicenter.ch/torry/showcode.php?id=1524
0
 
DSOMAuthor Commented:
I am not sure I follow.  If I base64 encode the string I get
83852C953D0A8AFE0BEA1D9794382B52 ->
E3CuDJ93EJKpH311E456HJ12HK4nH3atEJGpE392DJ8

Which takes up more space
0
 
aikimarkCommented:
@DSOM

You're trying to compress GUIDs?!?

When encoding in Base64, you go back to the byte level, not at the (character) hex digit representation of the data.

You can get a 50% size reduction by converting your hex string into a record of four long integer values.
Example:
'83852C953D0A8AFE0BEA1D9794382B52'
Produces these long integer values:
-2088424299
1024101118
199892375
-1808258222

If you sorted your hex strings, you might be able to get more compression.  In such a scenario, each value is the difference to the prior value in the sequence or you use a short integer for the first part of the numeric values record and have a separate bit of data that has the first two bytes of the numeric value along with the number of records that share those first two bytes.
0
 
DSOMAuthor Commented:
Ok, the only problem I see storing the data like that is that my database uses it for the primary key.  If I were able to compress the string and store that it would still work but if I store 4 integers I don't think I could use the field as primary or unique.
0
 
aikimarkCommented:
you can't compress or encrypt a key field unless that is a function within the database.

Is this a static or dynamic table?  If static, then convert the GUID keys to an autonumber (=long integer) column.  You will get a 75% reduction in column size.  The current size of your table is .1% of the usable long integer key range.  (1/10 of one percent)

If such GUID values are used in foreign key relationships, then you can get more than 75% reduction.
0
 
DSOMAuthor Commented:
I would be storing the compressed string as the data in the key field and using that to index instead.
0
 
aikimarkCommented:
Once you've done such a conversion, I don't see any need to save the GUID value.  If you did need to save the GUID values, then use the 4-tuple of long integer values or a byte array.  When a new GUID value comes in, you would convert that hex string to your stored format and search your table of existing values.  If found, you would use the corresponding long integer key value or add the new GUID value (in converted format) to your table.
0
 
Sinisa VukCommented:
I made my example:

const
  Codes64 = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz+/';
....
function EncodeHexToBase64(sInStr: String): String;
const
  HEX_BIN_TABLE : array[0..15, 0..3] of Byte =
    (
    (0,0,0,0),  //0
    (0,0,0,1),  //1
    (0,0,1,0),  //2
    (0,0,1,1),  //3
    (0,1,0,0),  //4
    (0,1,0,1),  //5
    (0,1,1,0),  //6
    (0,1,1,1),  //7
    (1,0,0,0),  //8
    (1,0,0,1),  //9
    (1,0,1,0),  //a
    (1,0,1,1),  //b
    (1,1,0,0),  //c
    (1,1,0,1),  //d
    (1,1,1,0),  //e
    (1,1,1,1)   //f
    );
var
  i, j, b, n: Integer;
  arBin: Array of Byte;
begin
  Result := '';
  SetLength(arBin, Length(sInStr)*4); //allocate space for binary

  //make binary array
  for i:=1 to Length(sInStr) do
  begin
    j := 0;
    case sInStr[i] of
      'A'..'F':
        j := (Ord(sInStr[i]) - 55);
      '0'..'9':
        j := (Ord(sInStr[i]) - 48);
    end;

    arBin[(i-1)*4+0] := HEX_BIN_TABLE[j, 0];
    arBin[(i-1)*4+1] := HEX_BIN_TABLE[j, 1];
    arBin[(i-1)*4+2] := HEX_BIN_TABLE[j, 2];
    arBin[(i-1)*4+3] := HEX_BIN_TABLE[j, 3];
  end;

  b := 0;
  n := 0;
  for i:=Low(arBin) to High(arBin) do
  begin
    Inc(b);

    n := n * 2 + arBin[i];
    if b=6 then
    begin
      Result := Result + Codes64[n+1];
      b := 0;
      n := 0;
    end;
  end;

  if b>0 then
  begin
    Result := Result + Codes64[n+1];
  end;
end;

function DecodeHexToBase64(sInStr: String): String;
const
  HEXSTR : String = '0123456789ABCDEF';
var
  i, j, b, n, m: Integer;
  arBin: Array of Byte;
begin
  Result := '';
  SetLength(arBin, (6 * Length(sInStr))); //allocate space

  //make binary array
  b := 0;
  for i:=1 to Length(sInStr) do
  begin
    n := Pos(sInStr[i], codes64) - 1;

    m :=6;
    if i=Length(sInStr) then //last char
      m := Trunc((Math.Log10(n)/Math.Log10(2)) + 1);

    for j:=1 to m do
    begin
      arBin[b+m-j] := (n mod 2);
      n := n div 2;
    end;

    Inc(b, m);
  end;

  m := 0;
  n := 0;
  for i:=Low(arBin) to b-1 do
  begin
    Inc(m);
    n := n * 2 + arBin[i];
    if m=4 then
    begin
      Result := Result + HEXSTR[n+1];
      m := 0;
      n := 0;
    end;
  end;

  if m>0 then
  begin
    Result := Result + HEXSTR[n+1];
  end;
end;
...

Open in new window


This could be done more fancy, but for show my idea is fine. Please
note that best compress ratio is for input string which length is divisible by 6 like 36, 42,48...
0
 
aikimarkCommented:
@DSOM

What kind of database are you using?
0
 
DSOMAuthor Commented:
it's called Absolute Database by ComponentAce.  It's a drop-in, standalone replacement for BDE.  

http://www.componentace.com/bde_replacement_database_delphi_absolute_database.htm

Over the years, the database has grown to around 3.5 million records and takes up almost 500Mb which is getting to be a problem for distribution.  So I am looking for ways to cut every bit and byte I can to save space.
0
 
aikimarkCommented:
Are the GUID values being generated by the database or your application?
0
 
aikimarkCommented:
If you're using an aftGuid field type, then the database is probably already storing the values in 16-byte chunks -- no savings possible.

Are your concerns the distribution footprint of the RAM footprint at run time?
0
 
DSOMAuthor Commented:
It's not a GUID, it's an MD5 so it's just stored as a fixed-length string.  The MD5s are generated by an app that I use to produce the database.
0
 
aikimarkCommented:
what data are you hashing?
0
 
DSOMAuthor Commented:
The hash is from data files.  Could be any file really.

I tried converting the string into 4 largints in the database but that caused it grow from 500Mb to 700Mb.
0
 
aikimarkCommented:
what is the field type you are storing the MD5 hash values
0
 
DSOMAuthor Commented:
The field type is string
0
 
aikimarkCommented:
You should be able to store the (128 bit) MD5 hash value into a GUID field type or two large int or (as I already suggested) four integer fields.  You want to save the MD5 hash in its binary form, not in its hex character form.  Doing this will give you a 50% memory/footprint savings.
0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

  • 10
  • 9
  • 3
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now