troyd1
asked on
Pattern matching question
I have a script that does not quite work the way I want. Here is the part of the code that is a problem:
my $file = `prg TEST.XML`;
$file =~ s/.*(Content.*<\/html>).*/ $1/gsi;
I want $file to have everything after the first occurance of Content including content. What is the correct syntax.
Example:
start of data:
This is some garbage at the top of the data
Content-type: text/html
<html><head></head><body>T his is a test</body></html>
end of data
The problem with $file =~ s/.*(Content.*<\/html>).*/ $1/gsi; is that it fails in the following situations:
Example:
start of data:
This is some garbage at the top of the data
Content-type: text/html
<html><head></head><body>T his is a test of Content-type: text/html</body></html>
end of data
This example only returns everything after the second Content.
The other problem is if the </html> is not in the file, it returns everything including the garbage.
Thanks, Troy
my $file = `prg TEST.XML`;
$file =~ s/.*(Content.*<\/html>).*/
I want $file to have everything after the first occurance of Content including content. What is the correct syntax.
Example:
start of data:
This is some garbage at the top of the data
Content-type: text/html
<html><head></head><body>T
end of data
The problem with $file =~ s/.*(Content.*<\/html>).*/
Example:
start of data:
This is some garbage at the top of the data
Content-type: text/html
<html><head></head><body>T
end of data
This example only returns everything after the second Content.
The other problem is if the </html> is not in the file, it returns everything including the garbage.
Thanks, Troy
The above is assuming that you have no content after the </HTML>
Few points
- A regex always matches to the largest string that it can grab. To override this, use a ? after the regex.
- In your example, I havent seen anything after </HTML>. Thats why I've assumed the above. The above regex will give everything from Content to end of data.
Manav
Few points
- A regex always matches to the largest string that it can grab. To override this, use a ? after the regex.
- In your example, I havent seen anything after </HTML>. Thats why I've assumed the above. The above regex will give everything from Content to end of data.
Manav
ASKER
Will your change return everything after the first occurance of 'Content'. The other one would always return the last occurance to the </html>. Also, your assumption about the </html> being last is correct.
ASKER
I tried your solution and it does not work, it returns everything, even the stuff before the first content. Any ideas?
ASKER
I did this and it gave me everything after the line that Content is in.
$file =~ s/.*?(Content.*)$/$1/gsi;
$file =~ s/.*?(Content.*)$/$1/gsi;
Hi troyd1,
The problem of your original regex is that regex match is greedy and would match as much as possible, that's why manav added a "?" which stops greedy match. The regex I would suggest is $file =~ s/.*?(Content)/$1/si; A little faster and more correct (Note that manav's second regex would fail in most situations because it misses the /s, multi-line match. Missing /i also could be fatal).
Cheers!
The problem of your original regex is that regex match is greedy and would match as much as possible, that's why manav added a "?" which stops greedy match. The regex I would suggest is $file =~ s/.*?(Content)/$1/si; A little faster and more correct (Note that manav's second regex would fail in most situations because it misses the /s, multi-line match. Missing /i also could be fatal).
Cheers!
Guys,
Sorry, completely missed the s. But I dont thing the i is really nice, coz I can see the author wants to parse thru standard TCP/IP headers.
Manav
Sorry, completely missed the s. But I dont thing the i is really nice, coz I can see the author wants to parse thru standard TCP/IP headers.
Manav
Actually,
when you are under half a bottle of scotch, you tend to gorget the s. ;)
Manav
when you are under half a bottle of scotch, you tend to gorget the s. ;)
Manav
TCP/IP headers?
MIME Content-Type: headers are case insensitive.
MIME Content-Type: headers are case insensitive.
Right Ozo. Thanx for the info.
troyd1,
If you have no concerns about preserving the blank lines in your file,
try this at the command prompt.
sed '/^$/d' < file5.dat | perl -00 -ne '$_ =~ s/.*?(Content)/$1/si ; print ;'
where file5.dat is your input file.
Manav
If you have no concerns about preserving the blank lines in your file,
try this at the command prompt.
sed '/^$/d' < file5.dat | perl -00 -ne '$_ =~ s/.*?(Content)/$1/si ; print ;'
where file5.dat is your input file.
Manav
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
I have tried more than a few of these answers and many seem to work. I will have to dig a little further. Does anyone have a good online reference for the syntax of this and the different flag settings.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
I just tried this using the examples and it returned the correct text:
/^.*?(Content.*(<\/html>)? ).*$/si
First, as stated before the first .*? must not be greedy as to catch the first Content. Secondly, by adding the parentheses followed by ? around </html> it makes it optional but is still included in $1. This will get everything after the first occurance of Content up to the last occurance of </html> or the end of the string. If you want to only get to the first </html> you can try:
/^.*?(Content.*?(<\/html>) ?).*$/si
/^.*?(Content.*(<\/html>)?
First, as stated before the first .*? must not be greedy as to catch the first Content. Secondly, by adding the parentheses followed by ? around </html> it makes it optional but is still included in $1. This will get everything after the first occurance of Content up to the last occurance of </html> or the end of the string. If you want to only get to the first </html> you can try:
/^.*?(Content.*?(<\/html>)
ASKER
While searching for perlre, I found this nice quick reference for re.
http://www.erudil.com/preqr.pdf
http://www.erudil.com/preqr.pdf
$file =~ s/.*?(Content.*<\/html>).*
I think a better solution would be
$file =~ s/.*?(Content.*)$/$1/;
Manav