Solved

Threading

Posted on 2008-10-27
6
199 Views
Last Modified: 2013-11-13
I need a little assistance with the included routine.  This routine uses an array of filenames where the files are zipped text (log) files.  Each file is unzipped into memory and searched for a list of terms.  When there are 5 files, this is fairly quick, but when there are 100 files, then I run into resource issues.  A file could be on the order of 20M uncompressed.  

it appears that this routine opens and searches all files in the array in parallel.  What I need is to control the processing.  I'm looking for help to use a configurable value to set the number of files processed at a time.  So that if configured at 5, then 5 files will be searched and when one completes, a new file is taken from the array and searched.  In this way, there will always be 5 files being processed until the array is empty.
my $in_num="23";  #A constant

 my $terms[4] = "(1\.1\.1\.1|help|perl)";

 my ($pid,$tfile,@childs);

 my $Tmpfolder = "./tmp/";

 my $jobs2run = scalar @datfile;

    for $i (0..($jobs2run-1)) {

      $pid = fork();

        if ($pid) {  #parent

          push (@childs, $pid);

        } elsif ($pid==0) {  #child

            print LOG "Forking child for IN num: $in_num - $i\n";

            $tfile = $in_num."_".$i.".tmp";

            open (TMPRES,">$Tmpfolder/$tfile");

            open F, "gunzip -c $datfile[$i] |";

              foreach my $line (<F>) {

                if ($line =~ m/$terms[4]/) {print TMPRES $datfile[$i],"::",$line,"\n"}

                }

              close F;

          print LOG "closed $datfile[$i]\t closing $Tmpfolder/$tfile\t$i\n";

              close TMPRES;

              exit 0;   #end child
 

        } else {

          print LOG "couldn't fork: $!\n";

        }

    }

    foreach (@childs) {

      waitpid($_,0);

    }

  }

Open in new window

0
Comment
Question by:mouse050297
  • 3
  • 3
6 Comments
 
LVL 39

Accepted Solution

by:
Adam314 earned 125 total points
ID: 22817361

my $MaxAllowedInParrallel = 5;

 

my $in_num="23";  #A constant

my $terms[4] = "(1\.1\.1\.1|help|perl)";

my ($pid,$tfile,@childs);

my $Tmpfolder = "./tmp/";

my $jobs2run = scalar @datfile;

my %childs;

for $i (0..($jobs2run-1)) {

	if(keys %childs >= $MaxAllowdInParrallel) {

		my $finishedpid=wait();

		delete $childs{$finishedpid};

	}

	$pid = fork();

	if ($pid) {  #parent

		$childs{$pid}=1;

	}

	elsif ($pid==0) {  #child

		print LOG "Forking child for IN num: $in_num - $i\n";

		$tfile = $in_num."_".$i.".tmp";

		open (TMPRES,">$Tmpfolder/$tfile");

		open F, "gunzip -c $datfile[$i] |";

		foreach my $line (<F>) {

			if ($line =~ m/$terms[4]/) {print TMPRES $datfile[$i],"::",$line,"\n"}

		}

		close F;

		print LOG "closed $datfile[$i]\t closing $Tmpfolder/$tfile\t$i\n";

		close TMPRES;

		exit 0;   #end child

	}

	else {

		print LOG "couldn't fork: $!\n";

	}

}
 

if(keys %childs >= 0) {

	my $finishedpid=wait();

	delete $childs{$finishedpid};

}

Open in new window

0
 

Author Comment

by:mouse050297
ID: 22833705
It appears that when I run this with a large number of large files, system memory becomes exhausted.  any tips to alleviate this situation?  Setting $MaxAllowdInParrallel=2 has the same result, it takes more time to exhaust.  
0
 
LVL 39

Expert Comment

by:Adam314
ID: 22836694
Can you confirm if it is a large number of files, or large files, that cause the problem?
0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 

Author Comment

by:mouse050297
ID: 22846402
The routine makes a tmp log file for each archive file it searches.  The routine exhausts all system memory at the same point whether I use $MaxAllowedInParrallel = 1 OR $MaxAllowedInParrallel = 5.  We'll say for example that it processes 32 files out of 100.  That 32nd file, when uncompressed is 30M.  The system has 8G memory.  During processing, memory use only increments, I never see a decrement.  I confirm that it is the number of files along with their additive size that causes the problem.  
0
 
LVL 39

Expert Comment

by:Adam314
ID: 22847554
Your lines 23-25:
    foreach my $line (<F>) {
        if ($line =~ m/$terms[4]/) {print TMPRES $datfile[$i],"::",$line,"\n"}
    }
might be better as:
    while(my $line = <F>) {
        if ($line =~ m/$terms[4]/) {print TMPRES $datfile[$i],"::",$line,"\n"}
    }

Another possibility is that both the parent and child are using the LOG filehandle.  I don't think this is a problem, but you could try not having the child use this - if you need it to log, have the child open it's own log file (maybe use flock so multiple children aren't writing at the same time).
0
 

Author Comment

by:mouse050297
ID: 22852483
Changing the process line from a 'foreach' to a 'while' stmt has great success.  
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
endX challenge 2 63
word0 challenge 3 77
base64 decode encode 12 119
Programatically extract date from website 8 65
I have been pestered over the years to produce and distribute regular data extracts, and often the request have explicitly requested the data be emailed as an Excel attachement; specifically Excel, as it appears: CSV files confuse (no Red or Green h…
A short article about problems I had with the new location API and permissions in Marshmallow
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …

919 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now