I'm filtering-out strings from large (20-30Mb) text files using regular expressions. However, some of the regex I use seems to be very slow and uses nonlinear times to complete, especially those with non-word boundaries (\b).
Basically, my script does this:
open file, read the file content in $text, apply several regex substitutions.
A regex subst such as
$text =~ s/hello//igm;
completes very quickly (<1 second) on a 30Mb file. Time increases linearly with file size. However, a non-word boundary substitution such as
$text =~ s/hello\b//igm;
takes several seconds (10-15) to complete, and time increases nonlinearly with file size. For a 100Mb file it took so long (and took so much RAM) that I gave up and split the file in smaller chunks.
using "study $text" doesn't change much the outcome.
See below for a simplified version of the script.
regex.txt is a list of quite simple regular expressions, usually simple words with \b boundaries:
Now the question: Why is s/hello\b//igm this slow compared to s/hello//igm, and how can I optimize that so it completes in linear time (and linear memory usage)?
open (FILE, $ARGV) or die "can't open file";
my @lines = <FILE>;
my $text = join "", @lines;
open REGEX, "regex.txt";
my @regexes = <REGEX>;
foreach my $regex (@regexes)
$text =~ s/$regex//igm;
open OUT, ">".$ARGV.".filtered";
print OUT $text;