sprockston
asked on
Simple perl regular expression help: Searching between a new line?
I am designing an HTML optimizer program, and I want to remove all extra unneeded <b> and </b>'s.
When $html =
<b>foo</b> <b>bar</b>
<b>test</b><b>cool</b>
<b>foo2</b>
<b>bar2</b>
And my code snippet is:
while ($html =~ s!</b>(|\s+)<b>! !i)
{
print $html;
}
My output looks like:
<b>foo bar</b>
<b>test cool</b>
... but why was "<b>foo2</b>
<b>bar2</b>" ignored? Doesn't the /s+ mean that it searches for all blank spaces, new lines, etc?
When $html =
<b>foo</b> <b>bar</b>
<b>test</b><b>cool</b>
<b>foo2</b>
<b>bar2</b>
And my code snippet is:
while ($html =~ s!</b>(|\s+)<b>! !i)
{
print $html;
}
My output looks like:
<b>foo bar</b>
<b>test cool</b>
... but why was "<b>foo2</b>
<b>bar2</b>" ignored? Doesn't the /s+ mean that it searches for all blank spaces, new lines, etc?
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
...is the way I am reading in the file to a variable the reason why it doesn't work?
Assuming you left $/ with it's default value of 1
$html will not contain
<b>foo</b> <b>bar</b>
<b>test</b><b>cool</b>
<b>foo2</b>
<b>bar2</b>
when you read the first line of <FILE>, $html will contain
<b>foo</b> <b>bar</b>
then after
$html =~ s!</b>(|\s+)<b>! !i
it will become
<b>foo bar</b>
when you read the second line of <FILE>
$html will contain
<b>test</b><b>cool</b>
then after
$html =~ s!</b>(|\s+)<b>! !i
it will become
<b>test cool</b>
when you read the third line of <FILE>
$html will contain
<b>foo2</b>
and
$html =~ s!</b>(|\s+)<b>! !i
will fail
and on the last line of
<FILE>
$html will contain
<b>bar2</b>
and again
$html =~ s!</b>(|\s+)<b>! !i
will fail
$html will not contain
<b>foo</b> <b>bar</b>
<b>test</b><b>cool</b>
<b>foo2</b>
<b>bar2</b>
when you read the first line of <FILE>, $html will contain
<b>foo</b> <b>bar</b>
then after
$html =~ s!</b>(|\s+)<b>! !i
it will become
<b>foo bar</b>
when you read the second line of <FILE>
$html will contain
<b>test</b><b>cool</b>
then after
$html =~ s!</b>(|\s+)<b>! !i
it will become
<b>test cool</b>
when you read the third line of <FILE>
$html will contain
<b>foo2</b>
and
$html =~ s!</b>(|\s+)<b>! !i
will fail
and on the last line of
<FILE>
$html will contain
<b>bar2</b>
and again
$html =~ s!</b>(|\s+)<b>! !i
will fail
ASKER
open (FILE, "testfile.html") || die "Could not open the file <$!>";
while (my $html = <FILE>)
{
# Remove all blank (empty) lines from the html file.
$html =~ s/(^|\n)[\n\s]*/$1/g;
while ($html =~ s!</b>(|\s+)<b>! !i)
{
print $html;
}
}
close (FILE);