Compressing Similar Directory Structure

I'm trying to compress multiple directory structures which hold same data with little changes. When I compress them together using 7-zip ultra compression the size of the archive is just slightly larger than the archive of any one directory structure. But if I add another directory structure to an existing archive then the size of the resulting archive is almost double and not just slightly larger. Why is this happening? Is there a setting I can change or a program I can use to avoid this?
zzzyAsked:
Who is Participating?
 
hatheharikenConnect With a Mentor Commented:
deflation algorithms do not work well with pre-encapsulated data. pattern matching algorithms will not work at all, in my opinion.

if you can disclose the information, i would like to know the final end result you are trying to achieve.
0
 
BillDLCommented:
Are you doing this from the command line (eg. a batch file) or using the program's interface?

If it is a batch file, could you perhaps provide a sample of the batch file so we can try and see what is happening.
0
 
zzzyAuthor Commented:
I'm eventually planning to do it through batch file. But right now I just open 7-z manager and manually create or add to archive. For settings I use Ultra, LZMA2 and 6 CPU.
0
Cloud Class® Course: Microsoft Exchange Server

The MCTS: Microsoft Exchange Server 2010 certification validates your skills in supporting the maintenance and administration of the Exchange servers in an enterprise environment. Learn everything you need to know with this course.

 
hatheharikenCommented:
try using winRAR command line.

for parameters, use update and replace, and best compression.
0
 
BillDLCommented:
Hi zzzy.  I am really sorry that I never returned to this question.  I did a massive reorganising of my emails and unfortunately the one I flagged to return here was filed away and I forgot.

I did try and replicate your scenario using 7-Zip shortly after you came back with your follow-up, but I was not able to get the same results as you were getting.  When I added new content to my Zip file, it used the Ultra compression again and didn't enlarge the existing Zip file any more than I expected.

One thing that I did notice in the 7-Zip help file (File Manager > Plugins > 7-Zip > "Add to archive dialog box") was that it states this for the "LZMA2" compression method:

"It provides better multithreading support than LZMA. But compression ratio can be worse in some cases. For best compression ratio with LZMA2 use 1 or 2 CPU threads. If you use LZMA2 with more than 2 threads, 7-zip splits data to chunks and compresses these chunks independently (2 threads per each chunk)."

I tried various permutations of Dictionary size, etc, but could still not replicate your issue.

Bill
0
 
zzzyAuthor Commented:
Hi Bill,

The folder size is approximately 900 mb (500 mb compressed). I tried using just 1 thread but I'm getting the same result. I've around ten such directories. When compressed together the resulting archive is just 700 mb. But if added one by one then each adds approximately 500 mb to the archive.

Thanks
SG
0
 
zzzyAuthor Commented:
I've repeated the same with smaller directory structures of size 100 MB with compression ration of 8% and still had the same problem. I tried combination of LZMA, LZMA2, 1-6 threads, Default to Max Dictionary Size, 4 GB & 8 GB block size.
0
 
hatheharikenCommented:
But if I add another directory structure to an existing archive then the size of the resulting archive is almost double and not just slightly larger.

Why is this happening?
it is because the files are being added again, and the archive is now holding two sets of data.

Is there a setting I can change or a program I can use to avoid this?
definitely.
instead of adding to the existing archive, you have to update the archive...

a better way to do this would be to use WinRAR command line with the switch "U"
0
 
hatheharikenCommented:
i meant to object, but i clicked on submit.
0
 
zzzyAuthor Commented:
hi hathehariken,

These are different directories, so update is not an option. Let's say 'A' and 'B' are two directories with 50 MB data each. Their data is very similar (there are some differences). Let's say compression ratio for these two directories is 50% each. So when you compress 'A' and 'B' together you get 26 MB archive. But let's say you compress 'A' first then the resulting archive is 25 MB and then if you add 'B' to the archive then the resulting archive is 50 MB.

I haven't tried WinRAR. I'll try it and let you know what happens.

Thanks
SG
0
 
zzzyAuthor Commented:
hi hathehariken,

I've tested WinRAR (GUI). Without any positive result.

Thanks
SG
0
 
hatheharikenCommented:
hmm, interesting...

'A' and 'B' are two directories with 50 MB data each
"each". so total is 100 mb

so when you compress them together in a single archive, you have 100 *50% = 50 mb + overheads.



now let me tell you an interesting thing about compression.
these algorithms, on which 7-zip and winrar and the likes are based upon, has a sub-algorithm which looks for repeating patterns
you say that the files+directory structure are more or less identical with small differences

this sub-algorithm (for the life of me, i cant remember the name) sees two identical files and compressess only one, and makes a note to decompress and make two copies....

test this out.
download a largish PDF file (like 2 mb approx) from anywhere in the net.
copy the file in the same folder with different names.
example, 001.pdf, 002.pdf, and so on and so forth

have about 20 files. we definitely know that the contents are identical.
make an archive containing 1 file, another with 2 files and another with 20 files, notice and compare the resultant size of the archives. now add a few files to the 20 file archive, and compare again

when you are adding the B directory on top of the already created archive of the A directory, the similarities between the contents of the directories are overlooked and the B directory is just added.
0
 
zzzyAuthor Commented:
A & B both are identical with few differences. So they have same compression ratios of 50%. But when compressed together total compression ratio ideally should be 25% or slightly higher. And I am getting 25% or so when I compressed them together. I get 50% when I add them individually.
0
 
hatheharikenCommented:
and thats what i told you in the last post. its normal.

only if you compress them together in the same execution, will the duplicate files be compressed once.
doing "addition" will not look for duplicates (pardon the English)

i hope i have been address your queries in a suitable manner.
please let me know.
0
 
zzzyAuthor Commented:
Hi Hathehariken,

Thanks for your reply! I am just curious how BillDL is not getting same result.

BillDL, could you please describe exactly what you are doing to not get result like mine? It would be helpful if you can tell me what format you are storing your archive in, how big the directories are, how many files, types of files, compression ratios etc.
In my case, the directory has thousands of files in various formats, approximately 900 MB, I am creating archive in "7z" format using either "LZMA" or "LZMA2" compression without any encryption, compression ratio for each directory is around 50%.

Thanks
SG
0
 
zzzyAuthor Commented:
Another observation:
If 'A' and 'B' are individually archived with "Store" setting (no compression). And then if you try to compress both archives together the compression ratio is around 50% and not around 25% as one might expect.
0
 
zzzyAuthor Commented:
Thanks hathehariken!

Here is conclusion:
No matter what settings you use, if you separately add similar directories to an archive you wouldn't get any benefits from "similarity" of the directories. You need to archive them together.
To get best results with 7-zip, in terms of compression ratio, use LZMA2 with Solid block (no limit), use biggest word size and biggest dictionary size depending on the memory available on your computer. LZMA needs 11 * Dictionary Size memory. With LZMA2 use only 2 CPUs. Because data is divided into chunks, 1 chunk for every 2 CPUs. If data is divided you won't get any benefit from the similarities between two different chunks (less compression).
0
 
hatheharikenCommented:
and thats the way the cookie crumbles.

i am glad that we could help you come to understand it.
we all gained valuable knowledge
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.