Link to home
Start Free TrialLog in
Avatar of Bill Bach
Bill BachFlag for United States of America

asked on

Determining a Compression Algorithm to create a decompression routine

I have a block of data that is stored by a database application.  I can access the data in the native application environment with no issues.  The application allows me to store the data in either compressed or uncompressed format.  Over the years, I have become intricately familiar with the uncompressed format that this application uses, and have written several file-recovery tools to help repair corrupted files.  This is the easy part.

What I am now trying to do is to interpret the raw data stream for a compressed file -- so that I can update the repair tool and be able to repair and recover compressed data whenever possible.  I have attached two files (ignore the txt extension -- both are binary files and should be viewed with a hex viewer), one which shows the original, uncompressed data, and one which shows the compressed data.  I am now trying to determine HOW this data was compressed, and to determine a decompression algorithm if possible.

I originally assumed a simple RLE compression, but this seems not to be the case.  Note that the "t" at the end of "Vincent" is present, but the "t" at the end of "Street" is missing, so it's got to me more complicated than this.  Further, after the zip code we see the string "31 32 30 31 00 00" gets actually increased in length from 6 to 7 bytes as "24 09 00 05 31 00 00" before the subsequent field.
uncompressed.txt
compressed.txt
ASKER CERTIFIED SOLUTION
Avatar of Member_2_4694817
Member_2_4694817

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Bill Bach

ASKER

Much of what I had already discerned, though I admit that I didn't catch the " St" backreference in my first cut.  

Any further suggestions as to how to decompress the data, then?  After my original post, I tried stripping out the  header data amd using 7ZIp to archive the original data block in a number of different methods, but none seemed to work.  I am hoping to find some sort of algorithm, preferably public domain, that can be used to decompress this.  I can certainly build my own, if I need to, but was hoping to avoid reinventing the wheel if I could.
What is the database application?
Shouldn't really matter, but it is PSQLv10.
so...Pervasive PSQL v10

I've done some research with their online documentation, and it doesn't reveal the algorithm.
Honestly, I am considered one of the foremost authorities on PSQL.  I can assure you that it is not yet documented.
I have requested info from Pervasive, and they have not been able to provide anything to date.
That is why I have posted here. I am updating my file recovery tools to work with page compressed files, which is why I am looking for a way to find the algorithm.
My guess was that they'd used a PD or open-source algorithm, but I coulf be wrong.
 
Can you tell us if both row and page compression are enabled?

In addition to RLE, reference to strings might account for the missing "t", since it precedes space characters.
Only page compression. No record compression enabled. I thinl
We determined earlier that a string copy backreference was being used.
Is this a standard or public domain compression that we can figure out?

@BillBach

Back in 2003, when you wrote:
"Often, a thorough understanding of the data structures and the
compression algorithm can lead to a way to repair the data."
In http://www.dbmonster.com/Uwe/Forum.aspx/pervasive/154/Error-54-The-variable-length-portion-of-the-record-is-corrupt

I assumed that you had already figured this out.

==========
I'm leaning toward this being a version of the Deflate algorithm.  This algorithm definition allows for back references as far back as 32k.  Part of the Deflate algorithm implements Lipel-Ziv.  I don't think that this Pervasive algorithm implements any of the Huffman or Welch compression techniques (gut feeling).  Increases in length may be the result of the variable-length segmented format.

This is consistent with the Pervasive documentation that page compression works best with mostly read activity.

Have you seen any back references that extend outside of the current block?
Page compression was added with PSQLv10, which was released only in 2007, so this is a new development. Until recently, developers hadn't started using page compression, so it is only recently that server crashes have resulted in corrupted data.

I will start by finding all of the L-Z treatises that I can and reading up on them until I understand the entire algorithm. What I do know is that each page is its own compression block. As such, since the maximum logical page si2 is currently 16K, we don't have to worry about back-references farther than that. This info may lead to an understanding of the number of bits required for a backref, and may yield some new clues.

Give me a week to work on this some more, unless you can see a shortcut.  
Making some progress, perhaps?  Would like another opinion on these latest developments.

Here's three data buffers.  The first is the uncompressed data.  The second is the compressed data for the first block, as handled by the unknown compression format.  The third is the compressed data, compressed by "FastLZ", and open source tool found here: http://www.fastlz.org/build.htm

They are VERY close, but not quite exact yet.  Obviously, it is possible that the developer put his own spin on the code in some way, and I'm just looking for another pair of eyes to see if anything jumps to mind.  

Also useful would be if anyone has any other LZ compression systems that might be good candidates to check this against.


Uncompressed:

00022000: 13 00 00 00 44 00 00 00 08 80 01 00 09 03 CF 07   ....D.........O.
00022010: 00 00 00 00 00 56 69 6E 63 65 6E 74 00 00 00 00   .....Vincent....
00022020: 00 00 00 00 00 00 43 69 76 61 72 65 6C 6C 69 00   ......Civarelli.
00022030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
00022040: 00 34 31 30 30 20 4E 57 20 37 74 68 20 53 74 72   .4100 NW 7th Str
00022050: 65 65 74 00 00 00 00 00 00 00 00 00 00 00 00 00   eet.............
00022060: 00 48 69 61 6C 65 61 68 00 00 00 00 00 00 00 00   .Hialeah........
00022070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
00022080: 00 46 4C 00 00 33 33 30 31 32 00 00 00 00 00 00   .FL..33012......
00022090: 00 55 53 41 00 00 00 00 00 00 00 00 00 00 00 00   .USA............
000220A0: 00 00 00 00 00 00 00 31 31 32 30 34 20 41 6E 74   .......11204 Ant
000220B0: 6C 65 72 20 4C 61 6E 65 00 00 00 00 00 00 00 00   ler Lane........
000220C0: 00 00 00 00 00 00 00 41 75 73 74 69 6E 00 00 00   .......Austin...
000220D0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
000220E0: 00 00 00 00 00 00 00 54 58 00 00 37 38 37 32 36   .......TX..78726
000220F0: 2D 31 32 30 31 00 00 05 12 55 52 82 0F 00 33 30   -1201....UR...30
00022100: 35 2D 35 35 35 2D 34 37 31 36 20 20 20 20 20 20   5-555-4716
00022110: 20 20 01 00 05 09 B5 07 00 56 43 49 56 41 52 45     ....5..VCIVARE
00022120: 4C 20 40 42 54 55 2E 45 44 55 00 00 00 00 00 00   L @BTU.EDU......
00022130: 00 00 00 00 00 00 00 00 01 00 55 6E 69 74 65 64   ..........United
00022140: 20 53 74 61 74 65 73 00 00 00 00 00 00 00 00 00    States.........
00022150: 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00   ................
00022160: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
00022170: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
00022180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
00022190: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
000221A0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
000221B0: 00 00 00 00 00 01 00 39 46 D2 07 00 00 00 00 00   .......9FR......
000221C0: 43 68 61 00 00 00 00 00 00 00 00 00 00 00 00 00   Cha.............
000221D0: 00 4B 69 6D 00 00 00 00 00 00 00 00 00 00 00 00   .Kim............
000221E0: 00 00 00 00 00 00 00 00 00 00 00 00 36 32 33 33   ............6233
000221F0: 20 4E 20 31 36 74 68 20 53 74 72 65 65 74 00 00    N 16th Street..
00022200: 00 00 00 00 00 00 00 00 00 00 00 00 42 6F 73 74   ............Bost
00022210: 6F 6E 00 00 00 00 00 00 00 00 00 00 00 00 00 00   on..............
00022220: 00 00 00 00 00 00 00 00 00 00 00 00 4D 41 00 00   ............MA..
00022230: 30 32 31 31 38 2D 31 36 30 39 00 00 55 53 41 00   02118-1609..USA.
00022240: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
00022250: 00 00 32 31 31 36 20 57 20 31 32 74 68 20 53 74   ..2116 W 12th St
00022260: 72 65 65 74 00 00 00 00 00 00 00 00 00 00 00 00   reet............
00022270: 00 00 41 75 73 74 69 6E 00 00 00 00 00 00 00 00   ..Austin........
00022280: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
00022290: 00 00 54 58 00 00 37 38 37 30 33 2D 33 38 35 34   ..TX..78703-3854
000222A0: 00 00 05 12 55 55 56 9F 00 36 31 37 2D 35 35 35   ....UUV..617-555
000222B0: 2D 39 38 32 38 20 20 20 20 20 20 20 20 00 00 06   -9828        ...
000222C0: 0A 94 07 00 43 4B 49 4D 20 40 42 54 55 2E 45 44   ....CKIM @BTU.ED
000222D0: 55 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   U...............
000222E0: 00 00 00 01 00 54 61 69 77 61 6E 00 00 00 00 00   .....Taiwan.....
000222F0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
00022300: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
00022310: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
00022320: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
00022330: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
00022340: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
00022350: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
00022360: 01 00 42 AA D3 07 00 00 00 00 00 50 68 69 6C 00   ..B*S......Phil.
00022370: 00 00 00 00 00 00 00 00 00 00 00 00 42 75 74 74   ............Butt
00022380: 61 66 75 6F 63 6F 00 00 00 00 00 00 00 00 00 00   afuoco..........
00022390: 00 00 00 00 00 00 00 31 31 35 31 33 20 41 6E 74   .......11513 Ant
000223A0: 69 67 75 61 20 44 72 69 76 65 00 00 00 00 00 00   igua Drive......
000223B0: 00 00 00 00 00 00 00 41 75 64 75 62 6F 6E 00 00   .......Audubon..
000223C0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
000223D0: 00 00 00 00 00 00 00 49 41 00 00 35 30 30 32 35   .......IA..50025
000223E0: 2D 31 34 31 31 00 00 55 53 41 00 00 00 00 00 00   -1411..USA......
000223F0: 00 00 00 00 00 00 00 00 00 00 00 00 00 31 36 30   .............160


Compressed with Unknown:

00013400: 13 00 00 00 44 00 01 00 35 49 FE E9 63 F1 79 C0   ....D...5I~icqy@
00013410: F3 05 0E 13 00 00 00 44 00 00 00 08 80 01 00 09   s......D........
00013420: 03 CF 07 00 44 00 04 56 69 6E 63 65 6E 74 4C 01   .O..D..VincentL.
00013430: 90 00 06 43 69 76 61 72 65 6C 6C 69 9C 01 13 18   ...Civarelli....
00013440: 00 0E 34 31 30 30 20 4E 57 20 37 74 68 20 53 74   ..4100 NW 7th St
00013450: 72 65 65 12 DC 00 40 05 04 48 69 61 6C 65 61 68   ree.\.@..Hialeah
00013460: 4C 01 1C 10 00 06 46 4C 00 00 33 33 30 31 32 BB   L.....FL..33012;
00013470: 03 55 53 41 A8 01 13 1C 00 0E 31 31 32 30 34 20   .USA(.....11204
00013480: 41 6E 74 6C 65 72 20 4C 61 6E 65 13 74 00 30 01   Antler Lane.t.0.
00013490: 03 41 75 73 74 69 6E 24 01 1E 0C 00 07 54 58 00   .Austin$.....TX.
000134A0: 00 37 38 37 32 36 2D 24 09 00 05 31 00 00 05 12   .78726-$...1....
000134B0: 55 52 82 0F 00 33 30 35 2D 35 35 35 2D 34 37 31   UR...305-555-471
000134C0: 36 20 A4 00 00 06 01 00 05 09 B5 07 00 56 43 49   6 $.......5..VCI
000134D0: 56 41 52 45 4C 20 40 42 54 55 2E 45 44 55 15 68   VAREL @BTU.EDU.h
000134E0: 01 05 01 00 55 6E 69 74 65 64 30 1E 01 61 74 65   ....United0..ate
000134F0: 73 17 74 00 15 40 00 10 38 38 00 03 01 00 39 46   s.t..@..88....9F
00013500: D2 07 7B 0A 43 68 61 60 01 F7 00 4B 69 6D F0 01   R.{.Cha`.w.Kimp.
00013510: 16 24 00 06 36 32 33 33 20 4E 20 31 36 1E AC 06   .$..6233 N 16.,.
00013520: 02 42 6F 73 74 6F 10 03 15 05 4D 28 33 09 30 32   .Bosto....M(3.02
00013530: 31 31 38 2D 31 36 30 39 00 00 1D AC 06 06 32 31   118-1609...,..21
00013540: 31 36 20 57 20 31 32 1E 98 01 10 0F AC 06 04 30   16 W 12.....,..0
00013550: 33 2D 33 38 35 34 6C 35 04 55 56 9F 00 36 31 37   3-3854l5.UV..617
00013560: 6C 35 01 39 38 32 38 A8 35 09 20 00 00 06 0A 94   l5.9828(5. .....
00013570: 07 00 43 4B 49 4D 1E 9D 06 00 24 00 04 01 00 54   ..CKIM....$....T
00013580: 61 69 77 61 1D 65 03 01 24 04 10 44 0C 00 02 01   aiwa.e..$..D....
00013590: 00 42 AA D3 8C 35 01 50 68 69 6C 14 AC 01 07 42   .B*S.5.Phil.,..B
000135A0: 75 74 74 61 66 75 6F 63 6F 14 5C 00 54 01 02 31   uttafuoco.\.T..1
000135B0: 31 35 31 33 40 5E 06 69 67 75 61 20 44 72 69 76   1513@^.igua Driv
000135C0: 15 C8 0B 02 41 75 64 75 62 10 03 B1 06 49 2C 35   .H..Audub..1.I,5
000135D0: 06 35 30 30 32 35 2D 31 34 31 20 5E 1D AC 06 08   .50025-141 ^.,..
000135E0: 31 36 30                                          160


Compressed with FastLZ:

00013412: 0E 13 00 00 00 44 00 00 00 08 80 01 00 09 03 CF   .....D.........O
00013422: 07 00 44 00 04 56 69 6E 63 65 6E 74 4C 01 90 00   ..D..VincentL...
00013432: 06 43 69 76 61 72 65 6C 6C 69 9C 01 13 18 00 0E   .Civarelli......
00013442: 34 31 30 30 20 4E 57 20 37 74 68 20 53 74 72 65   4100 NW 7th Stre
00013452: 65 12 DC 00 40 05 04 48 69 61 6C 65 61 68 4C 01   e.\.@..HialeahL.
00013462: 1C 10 00 06 46 4C 00 00 33 33 30 31 32 BB 03 55   ....FL..33012;.U
00013472: 53 41 A8 01 13 1C 00 0E 31 31 32 30 34 20 41 6E   SA(.....11204 An
00013482: 74 6C 65 72 20 4C 61 6E 65 13 74 00 30 01 03 41   tler Lane.t.0..A
00013492: 75 73 74 69 6E 24 01 1E 0C 00 07 54 58 00 00 37   ustin$.....TX..7
000134A2: 38 37 32 36 2D 24 09 00 05 31 00 00 05 12 55 52   8726-$...1....UR
000134B2: 82 0F 00 33 30 35 2D 35 35 35 2D 34 37 31 36 20   ...305-555-4716
000134C2: A4 00 00 06 01 00 05 09 B5 07 00 56 43 49 56 41   $.......5..VCIVA
000134D2: 52 45 4C 20 40 42 54 55 2E 45 44 55 15 68 01 05   REL @BTU.EDU.h..
000134E2: 01 00 55 6E 69 74 65 64 30 1E 01 61 74 65 73 17   ..United0..ates.
000134F2: 74 00 15 40 00 10 38 38 00 03 01 00 39 46 D2 07   t..@..88....9FR.
00013502: 7B 0A 43 68 61 60 01 F7 00 4B 69 6D F0 01 16 24   {.Cha`.w.Kimp..$
00013512: 00 06 36 32 33 33 20 4E 20 31 36 1E AC 06 02 42   ..6233 N 16.,..B
00013522: 6F 73 74 6F 10 03 15 05 4D 28 33 09 30 32 31 31   osto....M(3.0211
00013532: 38 2D 31 36 30 39 00 00 1D AC 06 06 32 31 31 36   8-1609...,..2116
00013542: 20 57 20 31 32 1E 98 01 10 0F AC 06 04 30 33 2D    W 12.....,..03-
00013552: 33 38 35 34 6C 35 04 55 56 9F 00 36 31 37 6C 35   3854l5.UV..617l5
00013562: 01 39 38 32 38 A8 35 09 20 00 00 06 0A 94 07 00   .9828(5. .......
00013572: 43 4B 49 4D 1E 9D 06 00 24 00 04 01 00 54 61 69   CKIM....$....Tai
00013582: 77 61 1D 65 03 01 24 04 10 44 0C 00 02 01 00 42   wa.e..$..D.....B
00013592: AA D3 8C 35 01 50 68 69 6C 14 AC 01 07 42 75 74   *S.5.Phil.,..But
000135A2: 74 61 66 75 6F 63 6F 14 5C 00 54 01 02 31 31 35   tafuoco.\.T..115
000135B2: 31 33 40 5E 06 69 67 75 61 20 44 72 69 76 15 C8   13@^.igua Driv.H
000135C2: 0B 02 41 75 64 75 62 10 03 B1 06 49 2C 35 06 35   ..Audub..1.I,5.5
000135D2: 30 30 32 35 2D 31 34 31 20 5E 1D AC 06 08 31 36   0025-141 ^.,..16
000135E2: 30                                                0

Open in new window

Correction -- I messed up the PASTE on that.  Instead, here's the three examples as text files.

CompressedFastLZ.txt
CompressedUnknown.txt
Uncompressed.txt
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Actually, the "header" is really just 0x12 bytes of uncompressed data.  Seems that this algorithm is storing the "bytes - 3" value -- but not in every case.  

Ugh.  I've now tried open source versions of FastLZ and QuickLZ (both found with Google), as well as two separate LZ77 algorithms (one from the SourceForge Basic COmpression Library).  Although all are close, none provide the exact same compression buffer.  Further, there is actually a block of 10 bytes in the beginning of the compression buffer (look at the original two data files posted) that is still unidentified.  (What's funny is that these open source algorithms usually provide a smaller result set than the one that they are using.)

Do you know of any more open source algorithms that might be worth trying out?  If not, then the only answer is likely to be rolling up my sleeves and digging into the open-source LZ implementations to such a level that I fully understand it and to where I can make better interpretations as to what they are doing, then engineer a suitable decompression function...  I don't relish THAT idea...
I've looked at some of the LZ specs and don't see any other path for you to follow than to roll up your sleeves.  I'll see if I might tweak the zone list for this question.

The one thing I don't see in the three sets of data you posted is Huffman coding.
Right -- no Huffman or any other coding to reduce the symbol set.  This makes it easier.  

I have started hacking the FastLZ algorithm (which seems to offer the closest starting point).  If you don't mind, I'll keep this open a bit longer to see if anything else develops.
This was likely an impossible question, and I wanted to award points to those who helped, even though a final solution still eludes us.
I would prepare a nice 'gift' and visit the Pervasive offices, asking for help on this.

Good luck
Already tried. They claimed IP and turned it down.
What if you signed an NDA and offered to compensate them for the IP?
Here's an idea...if they decline that offer, then offer that amount as a prize to the winner of a contest to define this compression algorithm.  It would be your version of an X-prize