asked on

Perl regex optimization

I'm filtering-out strings from large (20-30Mb) text files using regular expressions. However, some of the regex I use seems to be very slow and uses nonlinear times to complete, especially those with non-word boundaries (\b).

Basically, my script does this:
open file, read the file content in $text, apply several regex substitutions.

A regex subst such as

$text =~ s/hello//igm;

completes very quickly (<1 second) on a 30Mb file. Time increases linearly with file size. However, a non-word boundary substitution such as

$text =~ s/hello\b//igm;

takes several seconds (10-15) to complete, and time increases nonlinearly with file size. For a 100Mb file it took so long (and took so much RAM) that I gave up and split the file in smaller chunks.

using "study $text" doesn't change much the outcome.

See below for a simplified version of the script.

regex.txt is a list of quite simple regular expressions, usually simple words with \b boundaries:
#######
hello\b
domain\.com
^Received\s*:.*\n
#######

Now the question: Why is s/hello\b//igm this slow compared to s/hello//igm, and how can I optimize that so it completes in linear time (and linear memory usage)?

open (FILE, $ARGV[0]) or die "can't open file";
my @lines = <FILE>;
close FILE;
my $text = join "", @lines;

open REGEX, "regex.txt";
my @regexes = <REGEX>;
close REGEX;

#study $text;

foreach my $regex (@regexes)
{
   $text =~ s/$regex//igm;
}

open OUT, ">".$ARGV[0].".filtered";
print OUT $text;
close OUT;

Open in new window

ASKER CERTIFIED SOLUTION

ozo

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Niversoft

ASKER

In the real script, the \n is removed from each @regexes before being matched. I forgot that part in the simplified script.

So forget the script above, the question focuses on the case:

$text =~ s/hello\b//igm;
being incredibly slow compared to
$text =~ s/hello//igm;

when $text contains several Mb of text.

The time saved by pre-compiling the regex would be minimal compared to the time taken to match them. I would likely pre-compile them if each regex would be used more than once, but this isn't the case here.

SOLUTION

vikaskhoria

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

vikaskhoria

Also as suggested by ozo, pre-compiling will really help.
from my experiences, it will greatly improve performance.

Niversoft

ASKER

This was my first approach, but unfortunately the regex sometimes has to match several lines.

I added pre-compliation to my script, and it actually reduces the performance.
Pre-compiling helps when you re-use a regex several times (such as if I used your line-by-line approach).

The question is really about the reason behind the \b performance issue, and how to optimize it, and not about the example script approach or anything else.

ozo

I have tried this, and cannot measure the difference in time between
$text =~ s/hello//igm;
and
$text =~ s/hello\b//igm;
on the same text
(unless there are a lot if strings matching /hello/ but not /hello\b/
in which case $text =~ s/hello\b//igm is a little faster just because it has fewer substitutions to make)

Niversoft

ASKER

The time difference begins to appear when the size of $text reaches 8-10Mb

Niversoft

ASKER

Well, I know realizes the whole question is flawed - vikaskhoria's first answer is probably correct, though, in my case, not lines but larger, logical parts should be fed to the precompiled regexes (thanks to ozo for the tip) instead of the whole big file at once (but that wasn't part of the question)

The \b performance question remains, but only as a rhetorical question. I will split and award points, and if someone find a reason about why \b performance time and memory use becomes nonlinear when the text size increase, the answer would still be interesting to know.