• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 677
  • Last Modified:

Perl regex optimization

I'm filtering-out strings from large (20-30Mb) text files using regular expressions. However, some of the regex I use seems to be very slow and uses nonlinear times to complete, especially those with non-word boundaries (\b).

Basically, my script does this:
open file, read the file content in $text, apply several regex substitutions.

A regex subst such as

$text =~ s/hello//igm;

completes very quickly (<1 second) on a 30Mb file. Time increases linearly with file size. However, a non-word boundary substitution such as

$text =~ s/hello\b//igm;

takes several seconds (10-15) to complete, and time increases nonlinearly with file size. For a 100Mb file it took so long (and took so much RAM) that I gave up and split the file in smaller chunks.

using "study $text" doesn't change much the outcome.

See below for a simplified version of the script.

regex.txt is a list of quite simple regular expressions, usually simple words with \b boundaries:
#######
hello\b
domain\.com
^Received\s*:.*\n
#######

Now the question: Why is s/hello\b//igm this slow compared to s/hello//igm, and how can I optimize that so it completes in linear time (and linear memory usage)?
open (FILE, $ARGV[0]) or die "can't open file";
my @lines = <FILE>;
close FILE;
my $text = join "", @lines;

open REGEX, "regex.txt";
my @regexes = <REGEX>;
close REGEX;

#study $text;

foreach my $regex (@regexes)
{
   $text =~ s/$regex//igm;
}

open OUT, ">".$ARGV[0].".filtered";
print OUT $text;
close OUT;

Open in new window

0
Niversoft
Asked:
Niversoft
  • 4
  • 2
  • 2
2 Solutions
 
ozoCommented:
try pre compiling the regular expressions
my @regexes = map qr/$_/,<REGEX>;

Did you mean you have the \n at the end of each line as part of the regexes in
my @regexes = <REGEX>;
0
 
NiversoftAuthor Commented:
In the real script, the \n is removed from each @regexes before being matched. I forgot that part in the simplified script.

So forget the script above, the question focuses on the case:

$text =~ s/hello\b//igm;
being incredibly slow compared to
$text =~ s/hello//igm;

when $text contains several Mb of text.

The time saved by pre-compiling the regex would be minimal compared to the time taken to match them. I would likely pre-compile them if each regex would be used more than once, but this isn't the case here.
0
 
vikaskhoriaCommented:
I am not very sure about the \b (boundary case), but one thing I can suggest you in the script is that don't read entire file at once. That is usually not good, specially when you have huge files.
Better process the file line by line. Something like this pseudo code:

open input file
open output file
read line from input till EOF
substitute for this line
write new line in output file
end loop
close files.

This way your RAM usage will also be low, as it will not store the entire file in a location.
Try this if it helps and do let me know.
0
Cloud Class® Course: Microsoft Windows 7 Basic

This introductory course to Windows 7 environment will teach you about working with the Windows operating system. You will learn about basic functions including start menu; the desktop; managing files, folders, and libraries.

 
vikaskhoriaCommented:
Also as suggested by ozo, pre-compiling will really help.
from my experiences, it will greatly improve performance.
0
 
NiversoftAuthor Commented:
This was my first approach, but unfortunately the regex sometimes has to match several lines.

I added pre-compliation to my script, and it actually reduces the performance.
Pre-compiling helps when you re-use a regex several times (such as if I used your line-by-line approach).

The question is really about the reason behind the \b performance issue, and how to optimize it, and not about the example script approach or anything else.
0
 
ozoCommented:
I have tried this, and cannot measure the difference in time between
$text =~ s/hello//igm;
and
$text =~ s/hello\b//igm;
on the same text
(unless there are a lot if strings matching /hello/ but not /hello\b/
in which case $text =~ s/hello\b//igm is a little faster just because it has fewer substitutions to make)
0
 
NiversoftAuthor Commented:
The time difference begins to appear when the size of $text reaches 8-10Mb
0
 
NiversoftAuthor Commented:
Well, I know realizes the whole question is flawed - vikaskhoria's first answer is probably correct, though, in my case, not lines but larger, logical parts should be fed to the precompiled regexes (thanks to ozo for the tip) instead of the whole big file at once (but that wasn't part of the question)

The \b performance question remains, but only as a rhetorical question. I will split and award points, and if someone find a reason about why \b performance time and memory use becomes nonlinear when the text size increase, the answer would still be interesting to know.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Get expert help—faster!

Need expert help—fast? Use the Help Bell for personalized assistance getting answers to your important questions.

  • 4
  • 2
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now