[Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 262
  • Last Modified:

Pattern matching question

I have a script that does not quite work the way I want. Here is the part of the code that is a problem:

  my $file = `prg TEST.XML`;
  $file =~ s/.*(Content.*<\/html>).*/$1/gsi;

I want $file to have everything after the first occurance of Content including content. What is the correct syntax.


Example:
start of data:

This is some garbage at the top of the data

Content-type: text/html

<html><head></head><body>This is a test</body></html>

end of data

The problem with  $file =~ s/.*(Content.*<\/html>).*/$1/gsi; is that it fails in the following situations:

Example:
start of data:

This is some garbage at the top of the data

Content-type: text/html

<html><head></head><body>This is a test of Content-type: text/html</body></html>

end of data

This example only returns everything after the second Content.

The other problem is if the </html> is not in the file, it returns everything including the garbage.

Thanks, Troy
0
troyd1
Asked:
troyd1
  • 7
  • 5
  • 2
  • +3
3 Solutions
 
manav_mathurCommented:
To  your first problem,

$file =~ s/.*?(Content.*<\/html>).*/$1/gsi;

I think a better solution would be

$file =~ s/.*?(Content.*)$/$1/;

Manav
0
 
manav_mathurCommented:
The above is assuming that you have no content after the </HTML>

Few points
- A regex always matches to the largest string that it can grab. To override this, use a ? after the regex.
- In your example, I havent seen anything after </HTML>. Thats why I've assumed the above. The above regex will give everything from Content to end of data.

Manav
0
 
troyd1Author Commented:
Will your change return everything after the first occurance of 'Content'. The other one would always return the last occurance to the </html>. Also, your assumption about the </html> being last is correct.
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
troyd1Author Commented:
I tried your solution and it does not work, it returns everything, even the stuff before the first content. Any ideas?
0
 
troyd1Author Commented:
I did this and it gave me everything after the line that Content is in.

$file =~ s/.*?(Content.*)$/$1/gsi;
0
 
inq123Commented:
Hi troyd1,

The problem of  your original regex is that regex match is greedy and would match as much as possible, that's why manav added a "?" which stops greedy match.  The regex I would suggest is $file =~ s/.*?(Content)/$1/si;  A little faster and more correct (Note that manav's second regex would fail in most situations because it misses the /s, multi-line match.  Missing /i also could be fatal).

Cheers!
0
 
manav_mathurCommented:
Guys,
Sorry, completely missed the s. But I dont thing the i is really nice, coz I can see the author wants to parse thru standard TCP/IP headers.

Manav
0
 
manav_mathurCommented:
Actually,
when you are under half a bottle of scotch, you tend to gorget the s.  ;)

Manav

0
 
ozoCommented:
TCP/IP headers?

MIME Content-Type: headers are case insensitive.

0
 
manav_mathurCommented:
Right Ozo. Thanx for the info.
0
 
manav_mathurCommented:
troyd1,

If you have no concerns about preserving the blank lines in your file,
try this at the command prompt.

sed '/^$/d' < file5.dat | perl -00 -ne '$_ =~ s/.*?(Content)/$1/si ; print ;'
where file5.dat is your input file.

Manav
0
 
manav_mathurCommented:
troy1d,

Just found out this one. -0777 slurps entire files

perl -0777 -ne 's/.*?(Content)/$1/si ; print ;'

should work for you.
Otherwise use
$file =~ s/.*?(Content)/$1/si;

Manav
0
 
psr1729Commented:
This should do what you are looking for.


 my $file = `prg TEST.XML`;
 #Search for everything after the first occurance of Content including content.
 #This will also make sure it catches the first occurence of 'Content'
 $file =~ /[^Content](Content?.*).*/s;

 #Now replace $file with $1
 $file=$1;
 print $file;


OUTPUT:
Content-type: text/html

<html><head></head><body>This is a test of Content-type: text/html</body></html>


end of data
0
 
troyd1Author Commented:
I have tried more than a few of these answers and many seem to work. I will have to dig a little further. Does anyone have a good online reference for the syntax of this and the different flag settings.
0
 
ozoCommented:
perldoc perlre
0
 
godspropyCommented:
I just tried this using the examples and it returned the correct text:

/^.*?(Content.*(<\/html>)?).*$/si

First, as stated before the first .*? must not be greedy as to catch the first Content. Secondly, by adding the parentheses followed by ? around </html> it makes it optional but is still included in $1. This will get everything after the first occurance of Content up to the last occurance of </html> or the end of the string. If you want to only get to the first </html> you can try:

/^.*?(Content.*?(<\/html>)?).*$/si
0
 
troyd1Author Commented:
While searching for perlre, I found this nice quick reference for re.
http://www.erudil.com/preqr.pdf
0

Featured Post

Vote for the Most Valuable Expert

It’s time to recognize experts that go above and beyond with helpful solutions and engagement on site. Choose from the top experts in the Hall of Fame or on the right rail of your favorite topic page. Look for the blue “Nominate” button on their profile to vote.

  • 7
  • 5
  • 2
  • +3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now