Go Premium for a chance to win a PS4. Enter to Win

x
?
Solved

Compressing Similar Directory Structure

Posted on 2011-03-04
18
Medium Priority
?
677 Views
Last Modified: 2012-06-27
I'm trying to compress multiple directory structures which hold same data with little changes. When I compress them together using 7-zip ultra compression the size of the archive is just slightly larger than the archive of any one directory structure. But if I add another directory structure to an existing archive then the size of the resulting archive is almost double and not just slightly larger. Why is this happening? Is there a setting I can change or a program I can use to avoid this?
0
Comment
Question by:zzzy
  • 9
  • 7
  • 2
18 Comments
 
LVL 39

Expert Comment

by:BillDL
ID: 35043115
Are you doing this from the command line (eg. a batch file) or using the program's interface?

If it is a batch file, could you perhaps provide a sample of the batch file so we can try and see what is happening.
0
 

Author Comment

by:zzzy
ID: 35045011
I'm eventually planning to do it through batch file. But right now I just open 7-z manager and manually create or add to archive. For settings I use Ultra, LZMA2 and 6 CPU.
0
 
LVL 12

Expert Comment

by:hathehariken
ID: 35053354
try using winRAR command line.

for parameters, use update and replace, and best compression.
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
LVL 39

Expert Comment

by:BillDL
ID: 35244382
Hi zzzy.  I am really sorry that I never returned to this question.  I did a massive reorganising of my emails and unfortunately the one I flagged to return here was filed away and I forgot.

I did try and replicate your scenario using 7-Zip shortly after you came back with your follow-up, but I was not able to get the same results as you were getting.  When I added new content to my Zip file, it used the Ultra compression again and didn't enlarge the existing Zip file any more than I expected.

One thing that I did notice in the 7-Zip help file (File Manager > Plugins > 7-Zip > "Add to archive dialog box") was that it states this for the "LZMA2" compression method:

"It provides better multithreading support than LZMA. But compression ratio can be worse in some cases. For best compression ratio with LZMA2 use 1 or 2 CPU threads. If you use LZMA2 with more than 2 threads, 7-zip splits data to chunks and compresses these chunks independently (2 threads per each chunk)."

I tried various permutations of Dictionary size, etc, but could still not replicate your issue.

Bill
0
 

Author Comment

by:zzzy
ID: 35266042
Hi Bill,

The folder size is approximately 900 mb (500 mb compressed). I tried using just 1 thread but I'm getting the same result. I've around ten such directories. When compressed together the resulting archive is just 700 mb. But if added one by one then each adds approximately 500 mb to the archive.

Thanks
SG
0
 

Author Comment

by:zzzy
ID: 35282752
I've repeated the same with smaller directory structures of size 100 MB with compression ration of 8% and still had the same problem. I tried combination of LZMA, LZMA2, 1-6 threads, Default to Max Dictionary Size, 4 GB & 8 GB block size.
0
 
LVL 12

Expert Comment

by:hathehariken
ID: 35294001
But if I add another directory structure to an existing archive then the size of the resulting archive is almost double and not just slightly larger.

Why is this happening?
it is because the files are being added again, and the archive is now holding two sets of data.

Is there a setting I can change or a program I can use to avoid this?
definitely.
instead of adding to the existing archive, you have to update the archive...

a better way to do this would be to use WinRAR command line with the switch "U"
0
 
LVL 12

Expert Comment

by:hathehariken
ID: 35294025
i meant to object, but i clicked on submit.
0
 

Author Comment

by:zzzy
ID: 35298181
hi hathehariken,

These are different directories, so update is not an option. Let's say 'A' and 'B' are two directories with 50 MB data each. Their data is very similar (there are some differences). Let's say compression ratio for these two directories is 50% each. So when you compress 'A' and 'B' together you get 26 MB archive. But let's say you compress 'A' first then the resulting archive is 25 MB and then if you add 'B' to the archive then the resulting archive is 50 MB.

I haven't tried WinRAR. I'll try it and let you know what happens.

Thanks
SG
0
 

Author Comment

by:zzzy
ID: 35298772
hi hathehariken,

I've tested WinRAR (GUI). Without any positive result.

Thanks
SG
0
 
LVL 12

Expert Comment

by:hathehariken
ID: 35299427
hmm, interesting...

'A' and 'B' are two directories with 50 MB data each
"each". so total is 100 mb

so when you compress them together in a single archive, you have 100 *50% = 50 mb + overheads.



now let me tell you an interesting thing about compression.
these algorithms, on which 7-zip and winrar and the likes are based upon, has a sub-algorithm which looks for repeating patterns
you say that the files+directory structure are more or less identical with small differences

this sub-algorithm (for the life of me, i cant remember the name) sees two identical files and compressess only one, and makes a note to decompress and make two copies....

test this out.
download a largish PDF file (like 2 mb approx) from anywhere in the net.
copy the file in the same folder with different names.
example, 001.pdf, 002.pdf, and so on and so forth

have about 20 files. we definitely know that the contents are identical.
make an archive containing 1 file, another with 2 files and another with 20 files, notice and compare the resultant size of the archives. now add a few files to the 20 file archive, and compare again

when you are adding the B directory on top of the already created archive of the A directory, the similarities between the contents of the directories are overlooked and the B directory is just added.
0
 

Author Comment

by:zzzy
ID: 35300101
A & B both are identical with few differences. So they have same compression ratios of 50%. But when compressed together total compression ratio ideally should be 25% or slightly higher. And I am getting 25% or so when I compressed them together. I get 50% when I add them individually.
0
 
LVL 12

Expert Comment

by:hathehariken
ID: 35301688
and thats what i told you in the last post. its normal.

only if you compress them together in the same execution, will the duplicate files be compressed once.
doing "addition" will not look for duplicates (pardon the English)

i hope i have been address your queries in a suitable manner.
please let me know.
0
 

Author Comment

by:zzzy
ID: 35306592
Hi Hathehariken,

Thanks for your reply! I am just curious how BillDL is not getting same result.

BillDL, could you please describe exactly what you are doing to not get result like mine? It would be helpful if you can tell me what format you are storing your archive in, how big the directories are, how many files, types of files, compression ratios etc.
In my case, the directory has thousands of files in various formats, approximately 900 MB, I am creating archive in "7z" format using either "LZMA" or "LZMA2" compression without any encryption, compression ratio for each directory is around 50%.

Thanks
SG
0
 

Author Comment

by:zzzy
ID: 35306623
Another observation:
If 'A' and 'B' are individually archived with "Store" setting (no compression). And then if you try to compress both archives together the compression ratio is around 50% and not around 25% as one might expect.
0
 
LVL 12

Accepted Solution

by:
hathehariken earned 2000 total points
ID: 35308010
deflation algorithms do not work well with pre-encapsulated data. pattern matching algorithms will not work at all, in my opinion.

if you can disclose the information, i would like to know the final end result you are trying to achieve.
0
 

Author Closing Comment

by:zzzy
ID: 35309932
Thanks hathehariken!

Here is conclusion:
No matter what settings you use, if you separately add similar directories to an archive you wouldn't get any benefits from "similarity" of the directories. You need to archive them together.
To get best results with 7-zip, in terms of compression ratio, use LZMA2 with Solid block (no limit), use biggest word size and biggest dictionary size depending on the memory available on your computer. LZMA needs 11 * Dictionary Size memory. With LZMA2 use only 2 CPUs. Because data is divided into chunks, 1 chunk for every 2 CPUs. If data is divided you won't get any benefit from the similarities between two different chunks (less compression).
0
 
LVL 12

Expert Comment

by:hathehariken
ID: 35313036
and thats the way the cookie crumbles.

i am glad that we could help you come to understand it.
we all gained valuable knowledge
0

Featured Post

Get free NFR key for Veeam Availability Suite 9.5

Veeam is happy to provide a free NFR license (1 year, 2 sockets) to all certified IT Pros. The license allows for the non-production use of Veeam Availability Suite v9.5 in your home lab, without any feature limitations. It works for both VMware and Hyper-V environments

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article aims to explain the working of CircularLogArchiver. This tool was designed to solve the buildup of log file in cases where systems do not support circular logging or where circular logging is not enabled
Is your phone running out of space to hold pictures?  This article will show you quick tips on how to solve this problem.
This tutorial will show how to configure a single USB drive with a separate folder for each day of the week. This will allow each of the backups to be kept separate preventing the previous day’s backup from being overwritten. The USB drive must be s…
Two types of users will appreciate AOMEI Backupper Pro: 1 - Those with PCIe drives (and haven't found cloning software that works on them). 2 - Those who want a fast clone of their boot drive (no re-boots needed) and it can clone your drive wh…
Suggested Courses

885 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question