Link to home
Create AccountLog in
Avatar of texasreddog
texasreddog

asked on

perl solution to splitting a zip file before it reaches 2 Mb in size

I have a complex issue, where a scripted job is failing when it is compressing a zip file, because it gets over 2Mb in size and cannot zip up more than that. the client we work with will not accept gzip files or a newer version of zip (64-bit) which will zip up more than 2Mb.

So has anyone ever tried to develop a Perl script that can detect when a zip file is about to reach 2Mb, and then automatically start another zip file, without corruption?  I don't know if it can really be done or not, but I thought I would ask.  Any feedback is appreciated.
Avatar of Adam314
Adam314

Which zip version are you using?

Something you could do:  Create the zip with 1 file in it.  Continue to add 1 file to it at a time, until it's size is to big, then start a new zip file.  
Avatar of texasreddog

ASKER

The zip version we are running on Unix is zip v2.3.  I'm sure there is a much better version of zip out there, but I don't know how soon we would be able to upgrade to a better version, or if our client would be willing to let us do that.
Not sure if the problem is the actual amount of data, or the zip program having trouble with the size of the file.  You could try zipping to stdout, then redirecting that to a file.  To send output to stdout, use - as the zip file name:
    zip - files > output.zip


hmmm...that's interesting. I looked at the homepage of the command (http://www.info-zip.org/FAQ.html#limits), and there are limitations.  we seem to get between 14 and 15 thousand files (images) and then it dies.  And I was incorrect about the size of the zip file, it is 2Gb not 2Mb :)

here's the limitation, as stated in their documenation:

In practice, the real limit may be 2 GB on many systems, due to UnZip's use of the fseek() function to jump around within an archive. Because's fseek's offset argument is usually a signed long integer, on 32-bit systems UnZip will not find any file that is more than 2 GB from the beginning of the archive. And on 64-bit systems, UnZip won't find any file that's more than 4 GB from the beginning (since the zipfile format can only store offsets that big). So the last file in the archive can potentially be arbitrarily large (in theory, anyway--we haven't tested this), but the combined total of all the rest must be less than 2 GB or 4 GB, respectively.

I don't think upgrading this command is going to work.  What's worse is that we can't change to gzip or some other zip method for this client.  Coming up with a Perl solution to determine when to split the zip file before it gets to 2 GB, I don't think, will be easy.  Feedback...suggestions?
Just thought of something... did you mean 2Gb instead of 2Mb?
If so, this could be the filesystem limit, not the zip program.

If that is the case, the above redirection won't help.
You could try this, using perl to capture the data, and print it to files in 2Gb blocks:

#!/usr/bin/perl
use strict;
 
##### Set these
my $zip = '/usr/local/bin/zip';
my $output = '/path/to/output';
my $limit = 2_000_000_000;
 
 
 
my $FileCount = 0;
my $TotalBytes = $limit+1;
open(IN, "/usr/local/bin/zip - files|") or die "Could not start zip: $!\n";
 
my $OUT;
my ($buffer, $bytesread);
while($bytesread = read(IN, $buffer, 1024*1024)) {
	$TotalBytes += $bytesread;
	if($TotalBytes > $limit) {
		close($OUT) if $OUT;
		$FileCount++;
		open($OUT, "$output$FileCount.zipp") or die "Could not open output: $!\n";
		$TotalBytes = $bytesread;
	}
	print $OUT $buffer;
}
close($OUT);
close(IN);
 
		

Open in new window

hmmm...interesting.  I might have to give that a try.  thanks, Adam314!
Made my last post before reading your previous post... Not sure if the above will help.  I don't think the zip would use the fseek when writing to stdout, so it might do something different.  If so, the perl will keep the files under 2G (or whatever you set $limit to).  The limit is probably actually 2^31, not 2e9 as I have in the code above.

If it does work, you'll have to do something similar on the unzipping side (having it read from stdin, instead of from a file).  If you don't have access to change the unzipping side, I'm not sure how you'll get the original data back.

Another solution would be to get a list of all the files you have, and split them into sets such that each set is no larger than 2G (or if you are confident on a compression ratio, then 2G*compression_ratio).  Then use zip like normal to create a zip file of each set.  Then each should unzip without problems also.
OK, I looked into this a little more, and before I zip up the directory and files underneath, I wonder if I can determine the directory size to be zipped, and somehow split the directory into multiple smaller directories that are no larger than say 1.5 Gb each, then I know they will zip up.  Is it possible to run the Unix split command on one directory and split it into two or more?
If you want to keep your current directory structure, but need to keep the zip's smaller than 2G, this is what I'd do (if the above code doesn't work...)

Using perl:
1) find all files that you want in the zip, and their file sizes, and store in an array-of-array, like:
    ([filename1, filesize1], [filename2, filesize2], .....);
2) Sort this by size
3) group these into 2G groups:
while(there are files left) {
    remove the largest file that will fit in the current group from the array, and add it to the group
    if there were none, zip the group, and and clear the group
}

As long as there is no single file over 2G (there shouldn't be... if there were, it'd mean your filesystem could support them), and zipping doesn't make any files bigger (it is rare, but could be posible), this will work.
can you show me a coding solution that might work in this manner?
ASKER CERTIFIED SOLUTION
Avatar of Adam314
Adam314

Link to home
membership
Create an account to see this answer
Signing up is free. No credit card required.
Create Account
In that code, had a few things in there just for testing....

Change the $limit to 2_000_000_000, and $InitialDir to the directory you want zipped
And change line 14 to:
find(sub {return unless -f $File::Find::name;push @AllFiles, [$File::Find::name]}, $InitialDir);
As I'm sure you know, some of the more current archive utilities can do this for you (split an archive into chunks of a defined size).  Can you use a newer utility just to create the zip files, if they still maintain compatibility (open, read, extract) with your client's perferred but old utility?

Or you might consider the Perl module Archive::Zip (see http://search.cpan.org/dist/Archive-Zip/lib/Archive/Zip.pm).  You might be able to do a pure Perl solution, keeping a running total of the zip file's size, and checking each potential addition with Archive::Zip 's compressedSize() method to decide whether to add the file to the existing zip or start a new one.