asked on

Pattern matching question

I have a script that does not quite work the way I want. Here is the part of the code that is a problem:

my $file = `prg TEST.XML`;
$file =~ s/.*(Content.*<\/html>).*/$1/gsi;

I want $file to have everything after the first occurance of Content including content. What is the correct syntax.

Example:
start of data:

This is some garbage at the top of the data

Content-type: text/html

<html><head></head><body>This is a test</body></html>

end of data

The problem with $file =~ s/.*(Content.*<\/html>).*/$1/gsi; is that it fails in the following situations:

Example:
start of data:

This is some garbage at the top of the data

Content-type: text/html

<html><head></head><body>This is a test of Content-type: text/html</body></html>

end of data

This example only returns everything after the second Content.

The other problem is if the </html> is not in the file, it returns everything including the garbage.

Thanks, Troy

manav_mathur

To your first problem,

$file =~ s/.*?(Content.*<\/html>).*/$1/gsi;

I think a better solution would be

$file =~ s/.*?(Content.*)$/$1/;

Manav

manav_mathur

The above is assuming that you have no content after the </HTML>

Few points
- A regex always matches to the largest string that it can grab. To override this, use a ? after the regex.
- In your example, I havent seen anything after </HTML>. Thats why I've assumed the above. The above regex will give everything from Content to end of data.

Manav

troyd1

ASKER

Will your change return everything after the first occurance of 'Content'. The other one would always return the last occurance to the </html>. Also, your assumption about the </html> being last is correct.

troyd1

ASKER

I tried your solution and it does not work, it returns everything, even the stuff before the first content. Any ideas?

troyd1

ASKER

I did this and it gave me everything after the line that Content is in.

$file =~ s/.*?(Content.*)$/$1/gsi;

inq123

Hi troyd1,

The problem of your original regex is that regex match is greedy and would match as much as possible, that's why manav added a "?" which stops greedy match. The regex I would suggest is $file =~ s/.*?(Content)/$1/si; A little faster and more correct (Note that manav's second regex would fail in most situations because it misses the /s, multi-line match. Missing /i also could be fatal).

Cheers!

manav_mathur

Guys,
Sorry, completely missed the s. But I dont thing the i is really nice, coz I can see the author wants to parse thru standard TCP/IP headers.

Manav

manav_mathur

Actually,
when you are under half a bottle of scotch, you tend to gorget the s. ;)

Manav

ozo

TCP/IP headers?

MIME Content-Type: headers are case insensitive.

manav_mathur

Right Ozo. Thanx for the info.

manav_mathur

troyd1,

If you have no concerns about preserving the blank lines in your file,
try this at the command prompt.

sed '/^$/d' < file5.dat | perl -00 -ne '$_ =~ s/.*?(Content)/$1/si ; print ;'
where file5.dat is your input file.

Manav

ASKER CERTIFIED SOLUTION

manav_mathur

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

psr1729

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

troyd1

ASKER

I have tried more than a few of these answers and many seem to work. I will have to dig a little further. Does anyone have a good online reference for the syntax of this and the different flag settings.

SOLUTION

ozo

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

godspropy

I just tried this using the examples and it returned the correct text:

/^.*?(Content.*(<\/html>)?).*$/si

First, as stated before the first .*? must not be greedy as to catch the first Content. Secondly, by adding the parentheses followed by ? around </html> it makes it optional but is still included in $1. This will get everything after the first occurance of Content up to the last occurance of </html> or the end of the string. If you want to only get to the first </html> you can try:

/^.*?(Content.*?(<\/html>)?).*$/si

troyd1

ASKER

While searching for perlre, I found this nice quick reference for re.
http://www.erudil.com/preqr.pdf