• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 546
  • Last Modified:

Determining a Compression Algorithm to create a decompression routine

I have a block of data that is stored by a database application.  I can access the data in the native application environment with no issues.  The application allows me to store the data in either compressed or uncompressed format.  Over the years, I have become intricately familiar with the uncompressed format that this application uses, and have written several file-recovery tools to help repair corrupted files.  This is the easy part.

What I am now trying to do is to interpret the raw data stream for a compressed file -- so that I can update the repair tool and be able to repair and recover compressed data whenever possible.  I have attached two files (ignore the txt extension -- both are binary files and should be viewed with a hex viewer), one which shows the original, uncompressed data, and one which shows the compressed data.  I am now trying to determine HOW this data was compressed, and to determine a decompression algorithm if possible.

I originally assumed a simple RLE compression, but this seems not to be the case.  Note that the "t" at the end of "Vincent" is present, but the "t" at the end of "Street" is missing, so it's got to me more complicated than this.  Further, after the zip code we see the string "31 32 30 31 00 00" gets actually increased in length from 6 to 7 bytes as "24 09 00 05 31 00 00" before the subsequent field.
uncompressed.txt
compressed.txt
0
Bill Bach
Asked:
Bill Bach
  • 11
  • 9
2 Solutions
 
thehagmanCommented:
It appears that the compression consists of sequences of either literal text (that is what is still readable) or backreferences to previous strings from where the next few characters should be copied.
As of now, I am unsure
- how the switching between literal and copy mode are encoded
- how the offset and length od the string to copy are encoded.
Probably there is also some "repaeat" command as for RLE encodng
After "Vincent", we see a backreference to a previous long strnig of 00's or such a repeat command.
The missing "t" in "Street" is in fact part of a backref to "t followed by many 00's" as previously seen at the end of "Vincent".
The compression may sometimes be unnecessarily aggressive, e.g. the three char sequence "120" in "78726-1201" seems to have been replaced by a backref into "11204 Antler Lane", but the reference takes up 4 bytes.
Also, " St" as in "United States" seems to have been replaced by a backref into "7th Street".
On the other hand, other backreferences seem to be very efficient (e.g. "th Street plus many 00's")
You may try to determine the magic relation between the bytes replacing such a backreference and the original offset of the first occurance of the longest match
0
 
Bill BachPresidentAuthor Commented:
Much of what I had already discerned, though I admit that I didn't catch the " St" backreference in my first cut.  

Any further suggestions as to how to decompress the data, then?  After my original post, I tried stripping out the  header data amd using 7ZIp to archive the original data block in a number of different methods, but none seemed to work.  I am hoping to find some sort of algorithm, preferably public domain, that can be used to decompress this.  I can certainly build my own, if I need to, but was hoping to avoid reinventing the wheel if I could.
0
 
aikimarkCommented:
What is the database application?
0
Get quick recovery of individual SharePoint items

Free tool – Veeam Explorer for Microsoft SharePoint, enables fast, easy restores of SharePoint sites, documents, libraries and lists — all with no agents to manage and no additional licenses to buy.

 
Bill BachPresidentAuthor Commented:
Shouldn't really matter, but it is PSQLv10.
0
 
aikimarkCommented:
so...Pervasive PSQL v10

I've done some research with their online documentation, and it doesn't reveal the algorithm.
0
 
Bill BachPresidentAuthor Commented:
Honestly, I am considered one of the foremost authorities on PSQL.  I can assure you that it is not yet documented.
I have requested info from Pervasive, and they have not been able to provide anything to date.
That is why I have posted here. I am updating my file recovery tools to work with page compressed files, which is why I am looking for a way to find the algorithm.
My guess was that they'd used a PD or open-source algorithm, but I coulf be wrong.
 
0
 
aikimarkCommented:
Can you tell us if both row and page compression are enabled?

In addition to RLE, reference to strings might account for the missing "t", since it precedes space characters.
0
 
Bill BachPresidentAuthor Commented:
Only page compression. No record compression enabled. I thinl
We determined earlier that a string copy backreference was being used.
Is this a standard or public domain compression that we can figure out?

0
 
aikimarkCommented:
@BillBach

Back in 2003, when you wrote:
"Often, a thorough understanding of the data structures and the
compression algorithm can lead to a way to repair the data."
In http://www.dbmonster.com/Uwe/Forum.aspx/pervasive/154/Error-54-The-variable-length-portion-of-the-record-is-corrupt

I assumed that you had already figured this out.

==========
I'm leaning toward this being a version of the Deflate algorithm.  This algorithm definition allows for back references as far back as 32k.  Part of the Deflate algorithm implements Lipel-Ziv.  I don't think that this Pervasive algorithm implements any of the Huffman or Welch compression techniques (gut feeling).  Increases in length may be the result of the variable-length segmented format.

This is consistent with the Pervasive documentation that page compression works best with mostly read activity.

Have you seen any back references that extend outside of the current block?
0
 
Bill BachPresidentAuthor Commented:
Page compression was added with PSQLv10, which was released only in 2007, so this is a new development. Until recently, developers hadn't started using page compression, so it is only recently that server crashes have resulted in corrupted data.

I will start by finding all of the L-Z treatises that I can and reading up on them until I understand the entire algorithm. What I do know is that each page is its own compression block. As such, since the maximum logical page si2 is currently 16K, we don't have to worry about back-references farther than that. This info may lead to an understanding of the number of bits required for a backref, and may yield some new clues.

Give me a week to work on this some more, unless you can see a shortcut.  
0
 
Bill BachPresidentAuthor Commented:
Making some progress, perhaps?  Would like another opinion on these latest developments.

Here's three data buffers.  The first is the uncompressed data.  The second is the compressed data for the first block, as handled by the unknown compression format.  The third is the compressed data, compressed by "FastLZ", and open source tool found here: http://www.fastlz.org/build.htm

They are VERY close, but not quite exact yet.  Obviously, it is possible that the developer put his own spin on the code in some way, and I'm just looking for another pair of eyes to see if anything jumps to mind.  

Also useful would be if anyone has any other LZ compression systems that might be good candidates to check this against.


Uncompressed:

00022000: 13 00 00 00 44 00 00 00 08 80 01 00 09 03 CF 07   ....D.........O.
00022010: 00 00 00 00 00 56 69 6E 63 65 6E 74 00 00 00 00   .....Vincent....
00022020: 00 00 00 00 00 00 43 69 76 61 72 65 6C 6C 69 00   ......Civarelli.
00022030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
00022040: 00 34 31 30 30 20 4E 57 20 37 74 68 20 53 74 72   .4100 NW 7th Str
00022050: 65 65 74 00 00 00 00 00 00 00 00 00 00 00 00 00   eet.............
00022060: 00 48 69 61 6C 65 61 68 00 00 00 00 00 00 00 00   .Hialeah........
00022070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
00022080: 00 46 4C 00 00 33 33 30 31 32 00 00 00 00 00 00   .FL..33012......
00022090: 00 55 53 41 00 00 00 00 00 00 00 00 00 00 00 00   .USA............
000220A0: 00 00 00 00 00 00 00 31 31 32 30 34 20 41 6E 74   .......11204 Ant
000220B0: 6C 65 72 20 4C 61 6E 65 00 00 00 00 00 00 00 00   ler Lane........
000220C0: 00 00 00 00 00 00 00 41 75 73 74 69 6E 00 00 00   .......Austin...
000220D0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
000220E0: 00 00 00 00 00 00 00 54 58 00 00 37 38 37 32 36   .......TX..78726
000220F0: 2D 31 32 30 31 00 00 05 12 55 52 82 0F 00 33 30   -1201....UR...30
00022100: 35 2D 35 35 35 2D 34 37 31 36 20 20 20 20 20 20   5-555-4716
00022110: 20 20 01 00 05 09 B5 07 00 56 43 49 56 41 52 45     ....5..VCIVARE
00022120: 4C 20 40 42 54 55 2E 45 44 55 00 00 00 00 00 00   L @BTU.EDU......
00022130: 00 00 00 00 00 00 00 00 01 00 55 6E 69 74 65 64   ..........United
00022140: 20 53 74 61 74 65 73 00 00 00 00 00 00 00 00 00    States.........
00022150: 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00   ................
00022160: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
00022170: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
00022180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
00022190: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
000221A0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
000221B0: 00 00 00 00 00 01 00 39 46 D2 07 00 00 00 00 00   .......9FR......
000221C0: 43 68 61 00 00 00 00 00 00 00 00 00 00 00 00 00   Cha.............
000221D0: 00 4B 69 6D 00 00 00 00 00 00 00 00 00 00 00 00   .Kim............
000221E0: 00 00 00 00 00 00 00 00 00 00 00 00 36 32 33 33   ............6233
000221F0: 20 4E 20 31 36 74 68 20 53 74 72 65 65 74 00 00    N 16th Street..
00022200: 00 00 00 00 00 00 00 00 00 00 00 00 42 6F 73 74   ............Bost
00022210: 6F 6E 00 00 00 00 00 00 00 00 00 00 00 00 00 00   on..............
00022220: 00 00 00 00 00 00 00 00 00 00 00 00 4D 41 00 00   ............MA..
00022230: 30 32 31 31 38 2D 31 36 30 39 00 00 55 53 41 00   02118-1609..USA.
00022240: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
00022250: 00 00 32 31 31 36 20 57 20 31 32 74 68 20 53 74   ..2116 W 12th St
00022260: 72 65 65 74 00 00 00 00 00 00 00 00 00 00 00 00   reet............
00022270: 00 00 41 75 73 74 69 6E 00 00 00 00 00 00 00 00   ..Austin........
00022280: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
00022290: 00 00 54 58 00 00 37 38 37 30 33 2D 33 38 35 34   ..TX..78703-3854
000222A0: 00 00 05 12 55 55 56 9F 00 36 31 37 2D 35 35 35   ....UUV..617-555
000222B0: 2D 39 38 32 38 20 20 20 20 20 20 20 20 00 00 06   -9828        ...
000222C0: 0A 94 07 00 43 4B 49 4D 20 40 42 54 55 2E 45 44   ....CKIM @BTU.ED
000222D0: 55 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   U...............
000222E0: 00 00 00 01 00 54 61 69 77 61 6E 00 00 00 00 00   .....Taiwan.....
000222F0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
00022300: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
00022310: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
00022320: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
00022330: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
00022340: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
00022350: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
00022360: 01 00 42 AA D3 07 00 00 00 00 00 50 68 69 6C 00   ..B*S......Phil.
00022370: 00 00 00 00 00 00 00 00 00 00 00 00 42 75 74 74   ............Butt
00022380: 61 66 75 6F 63 6F 00 00 00 00 00 00 00 00 00 00   afuoco..........
00022390: 00 00 00 00 00 00 00 31 31 35 31 33 20 41 6E 74   .......11513 Ant
000223A0: 69 67 75 61 20 44 72 69 76 65 00 00 00 00 00 00   igua Drive......
000223B0: 00 00 00 00 00 00 00 41 75 64 75 62 6F 6E 00 00   .......Audubon..
000223C0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
000223D0: 00 00 00 00 00 00 00 49 41 00 00 35 30 30 32 35   .......IA..50025
000223E0: 2D 31 34 31 31 00 00 55 53 41 00 00 00 00 00 00   -1411..USA......
000223F0: 00 00 00 00 00 00 00 00 00 00 00 00 00 31 36 30   .............160


Compressed with Unknown:

00013400: 13 00 00 00 44 00 01 00 35 49 FE E9 63 F1 79 C0   ....D...5I~icqy@
00013410: F3 05 0E 13 00 00 00 44 00 00 00 08 80 01 00 09   s......D........
00013420: 03 CF 07 00 44 00 04 56 69 6E 63 65 6E 74 4C 01   .O..D..VincentL.
00013430: 90 00 06 43 69 76 61 72 65 6C 6C 69 9C 01 13 18   ...Civarelli....
00013440: 00 0E 34 31 30 30 20 4E 57 20 37 74 68 20 53 74   ..4100 NW 7th St
00013450: 72 65 65 12 DC 00 40 05 04 48 69 61 6C 65 61 68   ree.\.@..Hialeah
00013460: 4C 01 1C 10 00 06 46 4C 00 00 33 33 30 31 32 BB   L.....FL..33012;
00013470: 03 55 53 41 A8 01 13 1C 00 0E 31 31 32 30 34 20   .USA(.....11204
00013480: 41 6E 74 6C 65 72 20 4C 61 6E 65 13 74 00 30 01   Antler Lane.t.0.
00013490: 03 41 75 73 74 69 6E 24 01 1E 0C 00 07 54 58 00   .Austin$.....TX.
000134A0: 00 37 38 37 32 36 2D 24 09 00 05 31 00 00 05 12   .78726-$...1....
000134B0: 55 52 82 0F 00 33 30 35 2D 35 35 35 2D 34 37 31   UR...305-555-471
000134C0: 36 20 A4 00 00 06 01 00 05 09 B5 07 00 56 43 49   6 $.......5..VCI
000134D0: 56 41 52 45 4C 20 40 42 54 55 2E 45 44 55 15 68   VAREL @BTU.EDU.h
000134E0: 01 05 01 00 55 6E 69 74 65 64 30 1E 01 61 74 65   ....United0..ate
000134F0: 73 17 74 00 15 40 00 10 38 38 00 03 01 00 39 46   s.t..@..88....9F
00013500: D2 07 7B 0A 43 68 61 60 01 F7 00 4B 69 6D F0 01   R.{.Cha`.w.Kimp.
00013510: 16 24 00 06 36 32 33 33 20 4E 20 31 36 1E AC 06   .$..6233 N 16.,.
00013520: 02 42 6F 73 74 6F 10 03 15 05 4D 28 33 09 30 32   .Bosto....M(3.02
00013530: 31 31 38 2D 31 36 30 39 00 00 1D AC 06 06 32 31   118-1609...,..21
00013540: 31 36 20 57 20 31 32 1E 98 01 10 0F AC 06 04 30   16 W 12.....,..0
00013550: 33 2D 33 38 35 34 6C 35 04 55 56 9F 00 36 31 37   3-3854l5.UV..617
00013560: 6C 35 01 39 38 32 38 A8 35 09 20 00 00 06 0A 94   l5.9828(5. .....
00013570: 07 00 43 4B 49 4D 1E 9D 06 00 24 00 04 01 00 54   ..CKIM....$....T
00013580: 61 69 77 61 1D 65 03 01 24 04 10 44 0C 00 02 01   aiwa.e..$..D....
00013590: 00 42 AA D3 8C 35 01 50 68 69 6C 14 AC 01 07 42   .B*S.5.Phil.,..B
000135A0: 75 74 74 61 66 75 6F 63 6F 14 5C 00 54 01 02 31   uttafuoco.\.T..1
000135B0: 31 35 31 33 40 5E 06 69 67 75 61 20 44 72 69 76   1513@^.igua Driv
000135C0: 15 C8 0B 02 41 75 64 75 62 10 03 B1 06 49 2C 35   .H..Audub..1.I,5
000135D0: 06 35 30 30 32 35 2D 31 34 31 20 5E 1D AC 06 08   .50025-141 ^.,..
000135E0: 31 36 30                                          160


Compressed with FastLZ:

00013412: 0E 13 00 00 00 44 00 00 00 08 80 01 00 09 03 CF   .....D.........O
00013422: 07 00 44 00 04 56 69 6E 63 65 6E 74 4C 01 90 00   ..D..VincentL...
00013432: 06 43 69 76 61 72 65 6C 6C 69 9C 01 13 18 00 0E   .Civarelli......
00013442: 34 31 30 30 20 4E 57 20 37 74 68 20 53 74 72 65   4100 NW 7th Stre
00013452: 65 12 DC 00 40 05 04 48 69 61 6C 65 61 68 4C 01   e.\.@..HialeahL.
00013462: 1C 10 00 06 46 4C 00 00 33 33 30 31 32 BB 03 55   ....FL..33012;.U
00013472: 53 41 A8 01 13 1C 00 0E 31 31 32 30 34 20 41 6E   SA(.....11204 An
00013482: 74 6C 65 72 20 4C 61 6E 65 13 74 00 30 01 03 41   tler Lane.t.0..A
00013492: 75 73 74 69 6E 24 01 1E 0C 00 07 54 58 00 00 37   ustin$.....TX..7
000134A2: 38 37 32 36 2D 24 09 00 05 31 00 00 05 12 55 52   8726-$...1....UR
000134B2: 82 0F 00 33 30 35 2D 35 35 35 2D 34 37 31 36 20   ...305-555-4716
000134C2: A4 00 00 06 01 00 05 09 B5 07 00 56 43 49 56 41   $.......5..VCIVA
000134D2: 52 45 4C 20 40 42 54 55 2E 45 44 55 15 68 01 05   REL @BTU.EDU.h..
000134E2: 01 00 55 6E 69 74 65 64 30 1E 01 61 74 65 73 17   ..United0..ates.
000134F2: 74 00 15 40 00 10 38 38 00 03 01 00 39 46 D2 07   t..@..88....9FR.
00013502: 7B 0A 43 68 61 60 01 F7 00 4B 69 6D F0 01 16 24   {.Cha`.w.Kimp..$
00013512: 00 06 36 32 33 33 20 4E 20 31 36 1E AC 06 02 42   ..6233 N 16.,..B
00013522: 6F 73 74 6F 10 03 15 05 4D 28 33 09 30 32 31 31   osto....M(3.0211
00013532: 38 2D 31 36 30 39 00 00 1D AC 06 06 32 31 31 36   8-1609...,..2116
00013542: 20 57 20 31 32 1E 98 01 10 0F AC 06 04 30 33 2D    W 12.....,..03-
00013552: 33 38 35 34 6C 35 04 55 56 9F 00 36 31 37 6C 35   3854l5.UV..617l5
00013562: 01 39 38 32 38 A8 35 09 20 00 00 06 0A 94 07 00   .9828(5. .......
00013572: 43 4B 49 4D 1E 9D 06 00 24 00 04 01 00 54 61 69   CKIM....$....Tai
00013582: 77 61 1D 65 03 01 24 04 10 44 0C 00 02 01 00 42   wa.e..$..D.....B
00013592: AA D3 8C 35 01 50 68 69 6C 14 AC 01 07 42 75 74   *S.5.Phil.,..But
000135A2: 74 61 66 75 6F 63 6F 14 5C 00 54 01 02 31 31 35   tafuoco.\.T..115
000135B2: 31 33 40 5E 06 69 67 75 61 20 44 72 69 76 15 C8   13@^.igua Driv.H
000135C2: 0B 02 41 75 64 75 62 10 03 B1 06 49 2C 35 06 35   ..Audub..1.I,5.5
000135D2: 30 30 32 35 2D 31 34 31 20 5E 1D AC 06 08 31 36   0025-141 ^.,..16
000135E2: 30                                                0

Open in new window

0
 
Bill BachPresidentAuthor Commented:
Correction -- I messed up the PASTE on that.  Instead, here's the three examples as text files.

CompressedFastLZ.txt
CompressedUnknown.txt
Uncompressed.txt
0
 
aikimarkCommented:
Damned near close enough to be twins.  I think your hunch is correct.  This is probably a custom implementation of LZ.

I've got some guesses that might start the process.

00013412: 0E
Length of uncompressed 14 byte "header"

00013452: 65 12 DC 00 40
65: last 'e' in Street
12: back reference
DC: first three bits are unknown masked with 1C, which is the backwards distance to the 't' in Vincent.
00 4: add four zeros to compliment the first name zero pad
0
 
Bill BachPresidentAuthor Commented:
Actually, the "header" is really just 0x12 bytes of uncompressed data.  Seems that this algorithm is storing the "bytes - 3" value -- but not in every case.  

Ugh.  I've now tried open source versions of FastLZ and QuickLZ (both found with Google), as well as two separate LZ77 algorithms (one from the SourceForge Basic COmpression Library).  Although all are close, none provide the exact same compression buffer.  Further, there is actually a block of 10 bytes in the beginning of the compression buffer (look at the original two data files posted) that is still unidentified.  (What's funny is that these open source algorithms usually provide a smaller result set than the one that they are using.)

Do you know of any more open source algorithms that might be worth trying out?  If not, then the only answer is likely to be rolling up my sleeves and digging into the open-source LZ implementations to such a level that I fully understand it and to where I can make better interpretations as to what they are doing, then engineer a suitable decompression function...  I don't relish THAT idea...
0
 
aikimarkCommented:
I've looked at some of the LZ specs and don't see any other path for you to follow than to roll up your sleeves.  I'll see if I might tweak the zone list for this question.

The one thing I don't see in the three sets of data you posted is Huffman coding.
0
 
Bill BachPresidentAuthor Commented:
Right -- no Huffman or any other coding to reduce the symbol set.  This makes it easier.  

I have started hacking the FastLZ algorithm (which seems to offer the closest starting point).  If you don't mind, I'll keep this open a bit longer to see if anything else develops.
0
 
Bill BachPresidentAuthor Commented:
This was likely an impossible question, and I wanted to award points to those who helped, even though a final solution still eludes us.
0
 
aikimarkCommented:
I would prepare a nice 'gift' and visit the Pervasive offices, asking for help on this.

Good luck
0
 
Bill BachPresidentAuthor Commented:
Already tried. They claimed IP and turned it down.
0
 
aikimarkCommented:
What if you signed an NDA and offered to compensate them for the IP?
0
 
aikimarkCommented:
Here's an idea...if they decline that offer, then offer that amount as a prize to the winner of a contest to define this compression algorithm.  It would be your version of an X-prize
0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

  • 11
  • 9
Tackle projects and never again get stuck behind a technical roadblock.
Join Now