Solved

Compressing Similar Directory Structure

Posted on 2011-03-04
18
651 Views
Last Modified: 2012-06-27
I'm trying to compress multiple directory structures which hold same data with little changes. When I compress them together using 7-zip ultra compression the size of the archive is just slightly larger than the archive of any one directory structure. But if I add another directory structure to an existing archive then the size of the resulting archive is almost double and not just slightly larger. Why is this happening? Is there a setting I can change or a program I can use to avoid this?
0
Comment
Question by:zzzy
  • 9
  • 7
  • 2
18 Comments
 
LVL 38

Expert Comment

by:BillDL
ID: 35043115
Are you doing this from the command line (eg. a batch file) or using the program's interface?

If it is a batch file, could you perhaps provide a sample of the batch file so we can try and see what is happening.
0
 

Author Comment

by:zzzy
ID: 35045011
I'm eventually planning to do it through batch file. But right now I just open 7-z manager and manually create or add to archive. For settings I use Ultra, LZMA2 and 6 CPU.
0
 
LVL 12

Expert Comment

by:hathehariken
ID: 35053354
try using winRAR command line.

for parameters, use update and replace, and best compression.
0
 
LVL 38

Expert Comment

by:BillDL
ID: 35244382
Hi zzzy.  I am really sorry that I never returned to this question.  I did a massive reorganising of my emails and unfortunately the one I flagged to return here was filed away and I forgot.

I did try and replicate your scenario using 7-Zip shortly after you came back with your follow-up, but I was not able to get the same results as you were getting.  When I added new content to my Zip file, it used the Ultra compression again and didn't enlarge the existing Zip file any more than I expected.

One thing that I did notice in the 7-Zip help file (File Manager > Plugins > 7-Zip > "Add to archive dialog box") was that it states this for the "LZMA2" compression method:

"It provides better multithreading support than LZMA. But compression ratio can be worse in some cases. For best compression ratio with LZMA2 use 1 or 2 CPU threads. If you use LZMA2 with more than 2 threads, 7-zip splits data to chunks and compresses these chunks independently (2 threads per each chunk)."

I tried various permutations of Dictionary size, etc, but could still not replicate your issue.

Bill
0
 

Author Comment

by:zzzy
ID: 35266042
Hi Bill,

The folder size is approximately 900 mb (500 mb compressed). I tried using just 1 thread but I'm getting the same result. I've around ten such directories. When compressed together the resulting archive is just 700 mb. But if added one by one then each adds approximately 500 mb to the archive.

Thanks
SG
0
 

Author Comment

by:zzzy
ID: 35282752
I've repeated the same with smaller directory structures of size 100 MB with compression ration of 8% and still had the same problem. I tried combination of LZMA, LZMA2, 1-6 threads, Default to Max Dictionary Size, 4 GB & 8 GB block size.
0
 
LVL 12

Expert Comment

by:hathehariken
ID: 35294001
But if I add another directory structure to an existing archive then the size of the resulting archive is almost double and not just slightly larger.

Why is this happening?
it is because the files are being added again, and the archive is now holding two sets of data.

Is there a setting I can change or a program I can use to avoid this?
definitely.
instead of adding to the existing archive, you have to update the archive...

a better way to do this would be to use WinRAR command line with the switch "U"
0
 
LVL 12

Expert Comment

by:hathehariken
ID: 35294025
i meant to object, but i clicked on submit.
0
 

Author Comment

by:zzzy
ID: 35298181
hi hathehariken,

These are different directories, so update is not an option. Let's say 'A' and 'B' are two directories with 50 MB data each. Their data is very similar (there are some differences). Let's say compression ratio for these two directories is 50% each. So when you compress 'A' and 'B' together you get 26 MB archive. But let's say you compress 'A' first then the resulting archive is 25 MB and then if you add 'B' to the archive then the resulting archive is 50 MB.

I haven't tried WinRAR. I'll try it and let you know what happens.

Thanks
SG
0
Get up to 2TB FREE CLOUD per backup license!

An exclusive Black Friday offer just for Expert Exchange audience! Buy any of our top-rated backup solutions & get up to 2TB free cloud per system! Perform local & cloud backup in the same step, and restore instantly—anytime, anywhere. Grab this deal now before it disappears!

 

Author Comment

by:zzzy
ID: 35298772
hi hathehariken,

I've tested WinRAR (GUI). Without any positive result.

Thanks
SG
0
 
LVL 12

Expert Comment

by:hathehariken
ID: 35299427
hmm, interesting...

'A' and 'B' are two directories with 50 MB data each
"each". so total is 100 mb

so when you compress them together in a single archive, you have 100 *50% = 50 mb + overheads.



now let me tell you an interesting thing about compression.
these algorithms, on which 7-zip and winrar and the likes are based upon, has a sub-algorithm which looks for repeating patterns
you say that the files+directory structure are more or less identical with small differences

this sub-algorithm (for the life of me, i cant remember the name) sees two identical files and compressess only one, and makes a note to decompress and make two copies....

test this out.
download a largish PDF file (like 2 mb approx) from anywhere in the net.
copy the file in the same folder with different names.
example, 001.pdf, 002.pdf, and so on and so forth

have about 20 files. we definitely know that the contents are identical.
make an archive containing 1 file, another with 2 files and another with 20 files, notice and compare the resultant size of the archives. now add a few files to the 20 file archive, and compare again

when you are adding the B directory on top of the already created archive of the A directory, the similarities between the contents of the directories are overlooked and the B directory is just added.
0
 

Author Comment

by:zzzy
ID: 35300101
A & B both are identical with few differences. So they have same compression ratios of 50%. But when compressed together total compression ratio ideally should be 25% or slightly higher. And I am getting 25% or so when I compressed them together. I get 50% when I add them individually.
0
 
LVL 12

Expert Comment

by:hathehariken
ID: 35301688
and thats what i told you in the last post. its normal.

only if you compress them together in the same execution, will the duplicate files be compressed once.
doing "addition" will not look for duplicates (pardon the English)

i hope i have been address your queries in a suitable manner.
please let me know.
0
 

Author Comment

by:zzzy
ID: 35306592
Hi Hathehariken,

Thanks for your reply! I am just curious how BillDL is not getting same result.

BillDL, could you please describe exactly what you are doing to not get result like mine? It would be helpful if you can tell me what format you are storing your archive in, how big the directories are, how many files, types of files, compression ratios etc.
In my case, the directory has thousands of files in various formats, approximately 900 MB, I am creating archive in "7z" format using either "LZMA" or "LZMA2" compression without any encryption, compression ratio for each directory is around 50%.

Thanks
SG
0
 

Author Comment

by:zzzy
ID: 35306623
Another observation:
If 'A' and 'B' are individually archived with "Store" setting (no compression). And then if you try to compress both archives together the compression ratio is around 50% and not around 25% as one might expect.
0
 
LVL 12

Accepted Solution

by:
hathehariken earned 500 total points
ID: 35308010
deflation algorithms do not work well with pre-encapsulated data. pattern matching algorithms will not work at all, in my opinion.

if you can disclose the information, i would like to know the final end result you are trying to achieve.
0
 

Author Closing Comment

by:zzzy
ID: 35309932
Thanks hathehariken!

Here is conclusion:
No matter what settings you use, if you separately add similar directories to an archive you wouldn't get any benefits from "similarity" of the directories. You need to archive them together.
To get best results with 7-zip, in terms of compression ratio, use LZMA2 with Solid block (no limit), use biggest word size and biggest dictionary size depending on the memory available on your computer. LZMA needs 11 * Dictionary Size memory. With LZMA2 use only 2 CPUs. Because data is divided into chunks, 1 chunk for every 2 CPUs. If data is divided you won't get any benefit from the similarities between two different chunks (less compression).
0
 
LVL 12

Expert Comment

by:hathehariken
ID: 35313036
and thats the way the cookie crumbles.

i am glad that we could help you come to understand it.
we all gained valuable knowledge
0

Featured Post

Comprehensive Backup Solutions for Microsoft

Acronis protects the complete Microsoft technology stack: Windows Server, Windows PC, laptop and Surface data; Microsoft business applications; Microsoft Hyper-V; Azure VMs; Microsoft Windows Server 2016; Microsoft Exchange 2016 and SQL Server 2016.

Join & Write a Comment

Ever notice how you can't use a new drive in Windows without having Windows assigning a Disk Signature?  Ever have a signature collision problem (especially with Virtual Machines?)  This article is intended to help you understand what's going on and…
The Delta outage: 650 cancelled flights, more than 1200 delayed flights, thousands of frustrated customers, tens of millions of dollars in damages – plus untold reputational damage to one of the world’s most trusted airlines. All due to a catastroph…
This video teaches viewers how to encrypt an external drive that requires a password to read and edit the drive. All tasks are done in Disk Utility. Plug in the external drive you wish to encrypt: Make sure all previous data on the drive has been …
This tutorial will walk an individual through the process of configuring basic necessities in order to use the 2010 version of Data Protection Manager. These include storage, agents, and protection jobs. Launch Data Protection Manager from the deskt…

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now