[Last Call] Learn how to a build a cloud-first strategyRegister Now


Perl regex optimization

Posted on 2010-01-03
Medium Priority
Last Modified: 2012-05-08
I'm filtering-out strings from large (20-30Mb) text files using regular expressions. However, some of the regex I use seems to be very slow and uses nonlinear times to complete, especially those with non-word boundaries (\b).

Basically, my script does this:
open file, read the file content in $text, apply several regex substitutions.

A regex subst such as

$text =~ s/hello//igm;

completes very quickly (<1 second) on a 30Mb file. Time increases linearly with file size. However, a non-word boundary substitution such as

$text =~ s/hello\b//igm;

takes several seconds (10-15) to complete, and time increases nonlinearly with file size. For a 100Mb file it took so long (and took so much RAM) that I gave up and split the file in smaller chunks.

using "study $text" doesn't change much the outcome.

See below for a simplified version of the script.

regex.txt is a list of quite simple regular expressions, usually simple words with \b boundaries:

Now the question: Why is s/hello\b//igm this slow compared to s/hello//igm, and how can I optimize that so it completes in linear time (and linear memory usage)?
open (FILE, $ARGV[0]) or die "can't open file";
my @lines = <FILE>;
close FILE;
my $text = join "", @lines;

open REGEX, "regex.txt";
my @regexes = <REGEX>;
close REGEX;

#study $text;

foreach my $regex (@regexes)
   $text =~ s/$regex//igm;

open OUT, ">".$ARGV[0].".filtered";
print OUT $text;
close OUT;

Open in new window

Question by:Niversoft
  • 4
  • 2
  • 2
LVL 85

Accepted Solution

ozo earned 500 total points
ID: 26167073
try pre compiling the regular expressions
my @regexes = map qr/$_/,<REGEX>;

Did you mean you have the \n at the end of each line as part of the regexes in
my @regexes = <REGEX>;

Author Comment

ID: 26167102
In the real script, the \n is removed from each @regexes before being matched. I forgot that part in the simplified script.

So forget the script above, the question focuses on the case:

$text =~ s/hello\b//igm;
being incredibly slow compared to
$text =~ s/hello//igm;

when $text contains several Mb of text.

The time saved by pre-compiling the regex would be minimal compared to the time taken to match them. I would likely pre-compile them if each regex would be used more than once, but this isn't the case here.

Assisted Solution

vikaskhoria earned 500 total points
ID: 26169023
I am not very sure about the \b (boundary case), but one thing I can suggest you in the script is that don't read entire file at once. That is usually not good, specially when you have huge files.
Better process the file line by line. Something like this pseudo code:

open input file
open output file
read line from input till EOF
substitute for this line
write new line in output file
end loop
close files.

This way your RAM usage will also be low, as it will not store the entire file in a location.
Try this if it helps and do let me know.
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!


Expert Comment

ID: 26169038
Also as suggested by ozo, pre-compiling will really help.
from my experiences, it will greatly improve performance.

Author Comment

ID: 26169050
This was my first approach, but unfortunately the regex sometimes has to match several lines.

I added pre-compliation to my script, and it actually reduces the performance.
Pre-compiling helps when you re-use a regex several times (such as if I used your line-by-line approach).

The question is really about the reason behind the \b performance issue, and how to optimize it, and not about the example script approach or anything else.
LVL 85

Expert Comment

ID: 26169212
I have tried this, and cannot measure the difference in time between
$text =~ s/hello//igm;
$text =~ s/hello\b//igm;
on the same text
(unless there are a lot if strings matching /hello/ but not /hello\b/
in which case $text =~ s/hello\b//igm is a little faster just because it has fewer substitutions to make)

Author Comment

ID: 26169230
The time difference begins to appear when the size of $text reaches 8-10Mb

Author Comment

ID: 26169463
Well, I know realizes the whole question is flawed - vikaskhoria's first answer is probably correct, though, in my case, not lines but larger, logical parts should be fed to the precompiled regexes (thanks to ozo for the tip) instead of the whole big file at once (but that wasn't part of the question)

The \b performance question remains, but only as a rhetorical question. I will split and award points, and if someone find a reason about why \b performance time and memory use becomes nonlinear when the text size increase, the answer would still be interesting to know.

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I have been reconstructing a PHP-based application that has grown into a full blown interface system over the last ten years by a developer that has now gone into business for himself building websites. I am not incredibly fond of writing PHP code o…
A year or so back I was asked to have a play with MongoDB; within half an hour I had downloaded (http://www.mongodb.org/downloads),  installed and started the daemon, and had a console window open. After an hour or two of playing at the command …
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Suggested Courses

830 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question