Solved

Threading

Posted on 2008-10-27
6
206 Views
Last Modified: 2013-11-13
I need a little assistance with the included routine.  This routine uses an array of filenames where the files are zipped text (log) files.  Each file is unzipped into memory and searched for a list of terms.  When there are 5 files, this is fairly quick, but when there are 100 files, then I run into resource issues.  A file could be on the order of 20M uncompressed.  

it appears that this routine opens and searches all files in the array in parallel.  What I need is to control the processing.  I'm looking for help to use a configurable value to set the number of files processed at a time.  So that if configured at 5, then 5 files will be searched and when one completes, a new file is taken from the array and searched.  In this way, there will always be 5 files being processed until the array is empty.
my $in_num="23";  #A constant
 my $terms[4] = "(1\.1\.1\.1|help|perl)";
 my ($pid,$tfile,@childs);
 my $Tmpfolder = "./tmp/";
 my $jobs2run = scalar @datfile;
    for $i (0..($jobs2run-1)) {
      $pid = fork();
        if ($pid) {  #parent
          push (@childs, $pid);
        } elsif ($pid==0) {  #child
            print LOG "Forking child for IN num: $in_num - $i\n";
            $tfile = $in_num."_".$i.".tmp";
            open (TMPRES,">$Tmpfolder/$tfile");
            open F, "gunzip -c $datfile[$i] |";
              foreach my $line (<F>) {
                if ($line =~ m/$terms[4]/) {print TMPRES $datfile[$i],"::",$line,"\n"}
                }
              close F;
          print LOG "closed $datfile[$i]\t closing $Tmpfolder/$tfile\t$i\n";
              close TMPRES;
              exit 0;   #end child
 
        } else {
          print LOG "couldn't fork: $!\n";
        }
    }
    foreach (@childs) {
      waitpid($_,0);
    }
  }

Open in new window

0
Comment
Question by:mouse050297
  • 3
  • 3
6 Comments
 
LVL 39

Accepted Solution

by:
Adam314 earned 125 total points
ID: 22817361

my $MaxAllowedInParrallel = 5;
 
my $in_num="23";  #A constant
my $terms[4] = "(1\.1\.1\.1|help|perl)";
my ($pid,$tfile,@childs);
my $Tmpfolder = "./tmp/";
my $jobs2run = scalar @datfile;
my %childs;
for $i (0..($jobs2run-1)) {
	if(keys %childs >= $MaxAllowdInParrallel) {
		my $finishedpid=wait();
		delete $childs{$finishedpid};
	}
	$pid = fork();
	if ($pid) {  #parent
		$childs{$pid}=1;
	}
	elsif ($pid==0) {  #child
		print LOG "Forking child for IN num: $in_num - $i\n";
		$tfile = $in_num."_".$i.".tmp";
		open (TMPRES,">$Tmpfolder/$tfile");
		open F, "gunzip -c $datfile[$i] |";
		foreach my $line (<F>) {
			if ($line =~ m/$terms[4]/) {print TMPRES $datfile[$i],"::",$line,"\n"}
		}
		close F;
		print LOG "closed $datfile[$i]\t closing $Tmpfolder/$tfile\t$i\n";
		close TMPRES;
		exit 0;   #end child
	}
	else {
		print LOG "couldn't fork: $!\n";
	}
}
 
if(keys %childs >= 0) {
	my $finishedpid=wait();
	delete $childs{$finishedpid};
}

Open in new window

0
 

Author Comment

by:mouse050297
ID: 22833705
It appears that when I run this with a large number of large files, system memory becomes exhausted.  any tips to alleviate this situation?  Setting $MaxAllowdInParrallel=2 has the same result, it takes more time to exhaust.  
0
 
LVL 39

Expert Comment

by:Adam314
ID: 22836694
Can you confirm if it is a large number of files, or large files, that cause the problem?
0
Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 

Author Comment

by:mouse050297
ID: 22846402
The routine makes a tmp log file for each archive file it searches.  The routine exhausts all system memory at the same point whether I use $MaxAllowedInParrallel = 1 OR $MaxAllowedInParrallel = 5.  We'll say for example that it processes 32 files out of 100.  That 32nd file, when uncompressed is 30M.  The system has 8G memory.  During processing, memory use only increments, I never see a decrement.  I confirm that it is the number of files along with their additive size that causes the problem.  
0
 
LVL 39

Expert Comment

by:Adam314
ID: 22847554
Your lines 23-25:
    foreach my $line (<F>) {
        if ($line =~ m/$terms[4]/) {print TMPRES $datfile[$i],"::",$line,"\n"}
    }
might be better as:
    while(my $line = <F>) {
        if ($line =~ m/$terms[4]/) {print TMPRES $datfile[$i],"::",$line,"\n"}
    }

Another possibility is that both the parent and child are using the LOG filehandle.  I don't think this is a problem, but you could try not having the child use this - if you need it to log, have the child open it's own log file (maybe use flock so multiple children aren't writing at the same time).
0
 

Author Comment

by:mouse050297
ID: 22852483
Changing the process line from a 'foreach' to a 'while' stmt has great success.  
0

Featured Post

Networking for the Cloud Era

Join Microsoft and Riverbed for a discussion and demonstration of enhancements to SteelConnect:
-One-click orchestration and cloud connectivity in Azure environments
-Tight integration of SD-WAN and WAN optimization capabilities
-Scalability and resiliency equal to a data center

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Whether you’re a college noob or a soon-to-be pro, these tips are sure to help you in your journey to becoming a programming ninja and stand out from the crowd.
Although it can be difficult to imagine, someday your child will have a career of his or her own. He or she will likely start a family, buy a home and start having their own children. So, while being a kid is still extremely important, it’s also …

829 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question