DSOM
asked on
Using bzip2 to compress strings
I'm looking for a way to use bzip2 or even something better that doesn't require an external dll. I have tried using zlib but the "compressed" string comes out longer than the original. I'd appreciate any help and/or insight, thanks!
Here's a Delphi unit for implementing bzip2 data compression...
BZIP2 Data Compression Interface Unit
BZIP2 Data Compression Interface Unit
For verzy short strings you cannot have "shorter" strings. This is a fact. Solution is to use come coding algorithm.
http://answers.google.com/ answers/th readview?i d=498051
http://stackoverflow.com/q uestions/4 79218/how- to-compres s-small-st rings
http://stackoverflow.com/q uestions/1 138345/bes t-compress ion-algori thm-for-sh ort-text-s trings
http://www.pal-blog.de/ent wicklung/p erl/2012/c ompressing -test-for- short-stri ngs.html
experts-exchange.com old answer
http://answers.google.com/
http://stackoverflow.com/q
http://stackoverflow.com/q
http://www.pal-blog.de/ent
experts-exchange.com old answer
ASKER
aikimark: It's about 4 million strings like these:
83852C953D0A8AFE0BEA1D9794 382B52
30CC848AE998222B945DAB2DFA 619727
Thommy: Can you make a compress and decompress string function that uses that unit?
sinisav: thanks, I am reviewing those links
83852C953D0A8AFE0BEA1D9794
30CC848AE998222B945DAB2DFA
Thommy: Can you make a compress and decompress string function that uses that unit?
sinisav: thanks, I am reviewing those links
ASKER
Smaz and Snappy both look interesting but I can't find a Delphi port for them
Now I can see your strings. Seems that your strings are hex numbers, representing with 16 states. You can use Base64 encoding to shrink your strings. Base64 have 64 possible states.
Translate your hex number to bit array and use 6 instead of 4 bits to look in base64 table.
This way you can shrink from 42 chars to 42*4/6=28 chars.
www.koders.com
http://www.swissdelphicent er.ch/torr y/showcode .php?id=15 24
Translate your hex number to bit array and use 6 instead of 4 bits to look in base64 table.
This way you can shrink from 42 chars to 42*4/6=28 chars.
www.koders.com
http://www.swissdelphicent
ASKER
I am not sure I follow. If I base64 encode the string I get
83852C953D0A8AFE0BEA1D9794 382B52 ->
E3CuDJ93EJKpH311E456HJ12HK 4nH3atEJGp E392DJ8
Which takes up more space
83852C953D0A8AFE0BEA1D9794
E3CuDJ93EJKpH311E456HJ12HK
Which takes up more space
@DSOM
You're trying to compress GUIDs?!?
When encoding in Base64, you go back to the byte level, not at the (character) hex digit representation of the data.
You can get a 50% size reduction by converting your hex string into a record of four long integer values.
Example:
'83852C953D0A8AFE0BEA1D979 4382B52'
Produces these long integer values:
-2088424299
1024101118
199892375
-1808258222
If you sorted your hex strings, you might be able to get more compression. In such a scenario, each value is the difference to the prior value in the sequence or you use a short integer for the first part of the numeric values record and have a separate bit of data that has the first two bytes of the numeric value along with the number of records that share those first two bytes.
You're trying to compress GUIDs?!?
When encoding in Base64, you go back to the byte level, not at the (character) hex digit representation of the data.
You can get a 50% size reduction by converting your hex string into a record of four long integer values.
Example:
'83852C953D0A8AFE0BEA1D979
Produces these long integer values:
-2088424299
1024101118
199892375
-1808258222
If you sorted your hex strings, you might be able to get more compression. In such a scenario, each value is the difference to the prior value in the sequence or you use a short integer for the first part of the numeric values record and have a separate bit of data that has the first two bytes of the numeric value along with the number of records that share those first two bytes.
ASKER
Ok, the only problem I see storing the data like that is that my database uses it for the primary key. If I were able to compress the string and store that it would still work but if I store 4 integers I don't think I could use the field as primary or unique.
you can't compress or encrypt a key field unless that is a function within the database.
Is this a static or dynamic table? If static, then convert the GUID keys to an autonumber (=long integer) column. You will get a 75% reduction in column size. The current size of your table is .1% of the usable long integer key range. (1/10 of one percent)
If such GUID values are used in foreign key relationships, then you can get more than 75% reduction.
Is this a static or dynamic table? If static, then convert the GUID keys to an autonumber (=long integer) column. You will get a 75% reduction in column size. The current size of your table is .1% of the usable long integer key range. (1/10 of one percent)
If such GUID values are used in foreign key relationships, then you can get more than 75% reduction.
ASKER
I would be storing the compressed string as the data in the key field and using that to index instead.
Once you've done such a conversion, I don't see any need to save the GUID value. If you did need to save the GUID values, then use the 4-tuple of long integer values or a byte array. When a new GUID value comes in, you would convert that hex string to your stored format and search your table of existing values. If found, you would use the corresponding long integer key value or add the new GUID value (in converted format) to your table.
I made my example:
This could be done more fancy, but for show my idea is fine. Please
note that best compress ratio is for input string which length is divisible by 6 like 36, 42,48...
const
Codes64 = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz+/';
....
function EncodeHexToBase64(sInStr: String): String;
const
HEX_BIN_TABLE : array[0..15, 0..3] of Byte =
(
(0,0,0,0), //0
(0,0,0,1), //1
(0,0,1,0), //2
(0,0,1,1), //3
(0,1,0,0), //4
(0,1,0,1), //5
(0,1,1,0), //6
(0,1,1,1), //7
(1,0,0,0), //8
(1,0,0,1), //9
(1,0,1,0), //a
(1,0,1,1), //b
(1,1,0,0), //c
(1,1,0,1), //d
(1,1,1,0), //e
(1,1,1,1) //f
);
var
i, j, b, n: Integer;
arBin: Array of Byte;
begin
Result := '';
SetLength(arBin, Length(sInStr)*4); //allocate space for binary
//make binary array
for i:=1 to Length(sInStr) do
begin
j := 0;
case sInStr[i] of
'A'..'F':
j := (Ord(sInStr[i]) - 55);
'0'..'9':
j := (Ord(sInStr[i]) - 48);
end;
arBin[(i-1)*4+0] := HEX_BIN_TABLE[j, 0];
arBin[(i-1)*4+1] := HEX_BIN_TABLE[j, 1];
arBin[(i-1)*4+2] := HEX_BIN_TABLE[j, 2];
arBin[(i-1)*4+3] := HEX_BIN_TABLE[j, 3];
end;
b := 0;
n := 0;
for i:=Low(arBin) to High(arBin) do
begin
Inc(b);
n := n * 2 + arBin[i];
if b=6 then
begin
Result := Result + Codes64[n+1];
b := 0;
n := 0;
end;
end;
if b>0 then
begin
Result := Result + Codes64[n+1];
end;
end;
function DecodeHexToBase64(sInStr: String): String;
const
HEXSTR : String = '0123456789ABCDEF';
var
i, j, b, n, m: Integer;
arBin: Array of Byte;
begin
Result := '';
SetLength(arBin, (6 * Length(sInStr))); //allocate space
//make binary array
b := 0;
for i:=1 to Length(sInStr) do
begin
n := Pos(sInStr[i], codes64) - 1;
m :=6;
if i=Length(sInStr) then //last char
m := Trunc((Math.Log10(n)/Math.Log10(2)) + 1);
for j:=1 to m do
begin
arBin[b+m-j] := (n mod 2);
n := n div 2;
end;
Inc(b, m);
end;
m := 0;
n := 0;
for i:=Low(arBin) to b-1 do
begin
Inc(m);
n := n * 2 + arBin[i];
if m=4 then
begin
Result := Result + HEXSTR[n+1];
m := 0;
n := 0;
end;
end;
if m>0 then
begin
Result := Result + HEXSTR[n+1];
end;
end;
...
This could be done more fancy, but for show my idea is fine. Please
note that best compress ratio is for input string which length is divisible by 6 like 36, 42,48...
@DSOM
What kind of database are you using?
What kind of database are you using?
ASKER
it's called Absolute Database by ComponentAce. It's a drop-in, standalone replacement for BDE.
http://www.componentace.com/bde_replacement_database_delphi_absolute_database.htm
Over the years, the database has grown to around 3.5 million records and takes up almost 500Mb which is getting to be a problem for distribution. So I am looking for ways to cut every bit and byte I can to save space.
http://www.componentace.com/bde_replacement_database_delphi_absolute_database.htm
Over the years, the database has grown to around 3.5 million records and takes up almost 500Mb which is getting to be a problem for distribution. So I am looking for ways to cut every bit and byte I can to save space.
Are the GUID values being generated by the database or your application?
If you're using an aftGuid field type, then the database is probably already storing the values in 16-byte chunks -- no savings possible.
Are your concerns the distribution footprint of the RAM footprint at run time?
Are your concerns the distribution footprint of the RAM footprint at run time?
ASKER
It's not a GUID, it's an MD5 so it's just stored as a fixed-length string. The MD5s are generated by an app that I use to produce the database.
what data are you hashing?
ASKER
The hash is from data files. Could be any file really.
I tried converting the string into 4 largints in the database but that caused it grow from 500Mb to 700Mb.
I tried converting the string into 4 largints in the database but that caused it grow from 500Mb to 700Mb.
what is the field type you are storing the MD5 hash values
ASKER
The field type is string
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
What volume of data are you compressing? (how many strings)