Pattern matching question

I have a script that does not quite work the way I want. Here is the part of the code that is a problem:

  my $file = `prg TEST.XML`;
  $file =~ s/.*(Content.*<\/html>).*/$1/gsi;

I want $file to have everything after the first occurance of Content including content. What is the correct syntax.


Example:
start of data:

This is some garbage at the top of the data

Content-type: text/html

<html><head></head><body>This is a test</body></html>

end of data

The problem with  $file =~ s/.*(Content.*<\/html>).*/$1/gsi; is that it fails in the following situations:

Example:
start of data:

This is some garbage at the top of the data

Content-type: text/html

<html><head></head><body>This is a test of Content-type: text/html</body></html>

end of data

This example only returns everything after the second Content.

The other problem is if the </html> is not in the file, it returns everything including the garbage.

Thanks, Troy
troyd1Asked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

manav_mathurCommented:
To  your first problem,

$file =~ s/.*?(Content.*<\/html>).*/$1/gsi;

I think a better solution would be

$file =~ s/.*?(Content.*)$/$1/;

Manav
0
manav_mathurCommented:
The above is assuming that you have no content after the </HTML>

Few points
- A regex always matches to the largest string that it can grab. To override this, use a ? after the regex.
- In your example, I havent seen anything after </HTML>. Thats why I've assumed the above. The above regex will give everything from Content to end of data.

Manav
0
troyd1Author Commented:
Will your change return everything after the first occurance of 'Content'. The other one would always return the last occurance to the </html>. Also, your assumption about the </html> being last is correct.
0
Cloud Class® Course: Ruby Fundamentals

This course will introduce you to Ruby, as well as teach you about classes, methods, variables, data structures, loops, enumerable methods, and finishing touches.

troyd1Author Commented:
I tried your solution and it does not work, it returns everything, even the stuff before the first content. Any ideas?
0
troyd1Author Commented:
I did this and it gave me everything after the line that Content is in.

$file =~ s/.*?(Content.*)$/$1/gsi;
0
inq123Commented:
Hi troyd1,

The problem of  your original regex is that regex match is greedy and would match as much as possible, that's why manav added a "?" which stops greedy match.  The regex I would suggest is $file =~ s/.*?(Content)/$1/si;  A little faster and more correct (Note that manav's second regex would fail in most situations because it misses the /s, multi-line match.  Missing /i also could be fatal).

Cheers!
0
manav_mathurCommented:
Guys,
Sorry, completely missed the s. But I dont thing the i is really nice, coz I can see the author wants to parse thru standard TCP/IP headers.

Manav
0
manav_mathurCommented:
Actually,
when you are under half a bottle of scotch, you tend to gorget the s.  ;)

Manav

0
ozoCommented:
TCP/IP headers?

MIME Content-Type: headers are case insensitive.

0
manav_mathurCommented:
Right Ozo. Thanx for the info.
0
manav_mathurCommented:
troyd1,

If you have no concerns about preserving the blank lines in your file,
try this at the command prompt.

sed '/^$/d' < file5.dat | perl -00 -ne '$_ =~ s/.*?(Content)/$1/si ; print ;'
where file5.dat is your input file.

Manav
0
manav_mathurCommented:
troy1d,

Just found out this one. -0777 slurps entire files

perl -0777 -ne 's/.*?(Content)/$1/si ; print ;'

should work for you.
Otherwise use
$file =~ s/.*?(Content)/$1/si;

Manav
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
psr1729Commented:
This should do what you are looking for.


 my $file = `prg TEST.XML`;
 #Search for everything after the first occurance of Content including content.
 #This will also make sure it catches the first occurence of 'Content'
 $file =~ /[^Content](Content?.*).*/s;

 #Now replace $file with $1
 $file=$1;
 print $file;


OUTPUT:
Content-type: text/html

<html><head></head><body>This is a test of Content-type: text/html</body></html>


end of data
0
troyd1Author Commented:
I have tried more than a few of these answers and many seem to work. I will have to dig a little further. Does anyone have a good online reference for the syntax of this and the different flag settings.
0
ozoCommented:
perldoc perlre
0
godspropyCommented:
I just tried this using the examples and it returned the correct text:

/^.*?(Content.*(<\/html>)?).*$/si

First, as stated before the first .*? must not be greedy as to catch the first Content. Secondly, by adding the parentheses followed by ? around </html> it makes it optional but is still included in $1. This will get everything after the first occurance of Content up to the last occurance of </html> or the end of the string. If you want to only get to the first </html> you can try:

/^.*?(Content.*?(<\/html>)?).*$/si
0
troyd1Author Commented:
While searching for perlre, I found this nice quick reference for re.
http://www.erudil.com/preqr.pdf
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Perl

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.