Link to home
Start Free TrialLog in
Avatar of troyd1
troyd1

asked on

Pattern matching question

I have a script that does not quite work the way I want. Here is the part of the code that is a problem:

  my $file = `prg TEST.XML`;
  $file =~ s/.*(Content.*<\/html>).*/$1/gsi;

I want $file to have everything after the first occurance of Content including content. What is the correct syntax.


Example:
start of data:

This is some garbage at the top of the data

Content-type: text/html

<html><head></head><body>This is a test</body></html>

end of data

The problem with  $file =~ s/.*(Content.*<\/html>).*/$1/gsi; is that it fails in the following situations:

Example:
start of data:

This is some garbage at the top of the data

Content-type: text/html

<html><head></head><body>This is a test of Content-type: text/html</body></html>

end of data

This example only returns everything after the second Content.

The other problem is if the </html> is not in the file, it returns everything including the garbage.

Thanks, Troy
Avatar of manav_mathur
manav_mathur

To  your first problem,

$file =~ s/.*?(Content.*<\/html>).*/$1/gsi;

I think a better solution would be

$file =~ s/.*?(Content.*)$/$1/;

Manav
The above is assuming that you have no content after the </HTML>

Few points
- A regex always matches to the largest string that it can grab. To override this, use a ? after the regex.
- In your example, I havent seen anything after </HTML>. Thats why I've assumed the above. The above regex will give everything from Content to end of data.

Manav
Avatar of troyd1

ASKER

Will your change return everything after the first occurance of 'Content'. The other one would always return the last occurance to the </html>. Also, your assumption about the </html> being last is correct.
Avatar of troyd1

ASKER

I tried your solution and it does not work, it returns everything, even the stuff before the first content. Any ideas?
Avatar of troyd1

ASKER

I did this and it gave me everything after the line that Content is in.

$file =~ s/.*?(Content.*)$/$1/gsi;
Hi troyd1,

The problem of  your original regex is that regex match is greedy and would match as much as possible, that's why manav added a "?" which stops greedy match.  The regex I would suggest is $file =~ s/.*?(Content)/$1/si;  A little faster and more correct (Note that manav's second regex would fail in most situations because it misses the /s, multi-line match.  Missing /i also could be fatal).

Cheers!
Guys,
Sorry, completely missed the s. But I dont thing the i is really nice, coz I can see the author wants to parse thru standard TCP/IP headers.

Manav
Actually,
when you are under half a bottle of scotch, you tend to gorget the s.  ;)

Manav

Avatar of ozo
TCP/IP headers?

MIME Content-Type: headers are case insensitive.

Right Ozo. Thanx for the info.
troyd1,

If you have no concerns about preserving the blank lines in your file,
try this at the command prompt.

sed '/^$/d' < file5.dat | perl -00 -ne '$_ =~ s/.*?(Content)/$1/si ; print ;'
where file5.dat is your input file.

Manav
ASKER CERTIFIED SOLUTION
Avatar of manav_mathur
manav_mathur

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of troyd1

ASKER

I have tried more than a few of these answers and many seem to work. I will have to dig a little further. Does anyone have a good online reference for the syntax of this and the different flag settings.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I just tried this using the examples and it returned the correct text:

/^.*?(Content.*(<\/html>)?).*$/si

First, as stated before the first .*? must not be greedy as to catch the first Content. Secondly, by adding the parentheses followed by ? around </html> it makes it optional but is still included in $1. This will get everything after the first occurance of Content up to the last occurance of </html> or the end of the string. If you want to only get to the first </html> you can try:

/^.*?(Content.*?(<\/html>)?).*$/si
Avatar of troyd1

ASKER

While searching for perlre, I found this nice quick reference for re.
http://www.erudil.com/preqr.pdf