Xponex
asked on
Regexp help - getting content between two html tags
I need to get the content between an opening an closing H3 tag. Here is an example source:
<h3 class="post-title">
<a href="http://www.example.com">
Legal Support Services
</a>
</h3>
I was using this regexp: <h3 class="post-title">([^<]*? )</h3>
It worked fine until I realized that there were occasionally HTML tags within the H3 tags. I have changed the regexp to: <h3 class="post-title">(.*?)</ h3>
But it never works.
:'(
<h3 class="post-title">
<a href="http://www.example.com">
Legal Support Services
</a>
</h3>
I was using this regexp: <h3 class="post-title">([^<]*?
It worked fine until I realized that there were occasionally HTML tags within the H3 tags. I have changed the regexp to: <h3 class="post-title">(.*?)</
But it never works.
:'(
or
<h3 class="post-title".*</h3>
<h3 class="post-title".*</h3>
Simple Perl to find center tag content
$foo = <<EOF;
<h3 class="post-title">
<a href="http://www.example.com">
Legal Support Services
</a>
</h3>
EOF
while ($foo =~ /\<(\w+)(.*?)\>(.*?)\<(\/\1)\>/si)
{
$foo = $3;
}
print "Center match is \"$foo\"\r\n";
ASKER
This is being coded in ASP. Thanks for the code beezleinc but there will be a lot more text before (and after) the H3 tag. I need to isolate the text just between the opening H3 and closing H3 tag.
do you have nested <h3> tags?
ASKER
Never. It will always be:
Lots of html
<h3 class="post-title>
some text and MAYBE an <a> tag
</h3>
lots more html
Lots of html
<h3 class="post-title>
some text and MAYBE an <a> tag
</h3>
lots more html
then "<h3 .+?>(.+?)</h3>" should work.
make sure your regex call is case insensitive if need be and can span multiple lines. Not too familiar with ASP syntax but regex expressions are pretty universal.
you may have to escape the "/", "<" and ">" characters in the regex string... i.e. "\<h3.+?\>(.+?)\<\/h3\>"
it is not perfect and if you have multiple <h3></h3> tag sets in the input it will just return the first match (or should)
Also watch out for additional whitespace that can screw up the regex if it is not accounted for. i.e. "</h3>" will not match "</h3 >"
make sure your regex call is case insensitive if need be and can span multiple lines. Not too familiar with ASP syntax but regex expressions are pretty universal.
you may have to escape the "/", "<" and ">" characters in the regex string... i.e. "\<h3.+?\>(.+?)\<\/h3\>"
it is not perfect and if you have multiple <h3></h3> tag sets in the input it will just return the first match (or should)
Also watch out for additional whitespace that can screw up the regex if it is not accounted for. i.e. "</h3>" will not match "</h3 >"
@beezleinc
>> ASP syntax but regex expressions are pretty universal.
LOL. In which universe? ;)
You have to enable single-line mode for the dot to match newlines. Also, your pattern will not catch tags like:
<h3>...</h3>
This would be a better option:
>> ASP syntax but regex expressions are pretty universal.
LOL. In which universe? ;)
You have to enable single-line mode for the dot to match newlines. Also, your pattern will not catch tags like:
<h3>...</h3>
This would be a better option:
(?s)<h3[^>]*>.+?</h3>
ASKER
@kaufmed
So should I have multiline on or off? And... I don't want to match ALL h3's, just the one with class="post-title" attribute and nothing more. So the "[^>]* would be counter-productive I think...
So should I have multiline on or off? And... I don't want to match ALL h3's, just the one with class="post-title" attribute and nothing more. So the "[^>]* would be counter-productive I think...
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Ah ha! That's what I was looking for: [\s\S]
That did the trick! Thanks!
That did the trick! Thanks!
NP. Glad to help :)
<h3 .*</h3>