Link to home
Start Free TrialLog in
Avatar of Xponex
Xponex

asked on

Regexp help - getting content between two html tags

I need to get the content between an opening an closing H3 tag. Here is an example source:

 <h3 class="post-title">
       <a href="http://www.example.com">
       Legal Support Services
       </a>
</h3>

I was using this regexp: <h3 class="post-title">([^<]*?)</h3>

It worked fine until I realized that there were occasionally HTML tags within the H3 tags. I have changed the regexp to: <h3 class="post-title">(.*?)</h3>

But it never works.

:'(
Avatar of hprasad123
hprasad123
Flag of United States of America image

This would work for you:
<h3 .*</h3>
or
<h3 class="post-title".*</h3>
Simple Perl to find center tag content
$foo = <<EOF;
<h3 class="post-title">
       <a href="http://www.example.com">
       Legal Support Services
       </a>
</h3>
EOF

while ($foo =~ /\<(\w+)(.*?)\>(.*?)\<(\/\1)\>/si)
{
        $foo = $3;
}

print "Center match is \"$foo\"\r\n";

Open in new window

Avatar of Xponex
Xponex

ASKER

This is being coded in ASP. Thanks for the code beezleinc but there will be a lot more text before (and after) the H3 tag. I need to isolate the text just between the opening H3 and closing H3 tag.
do you have nested <h3> tags?  
Avatar of Xponex

ASKER

Never. It will always be:

Lots of html
<h3 class="post-title>
some text and MAYBE an <a> tag
</h3>
lots more html
then   "<h3 .+?>(.+?)</h3>"  should work.

make sure your regex call is case insensitive if need be and can span multiple lines.  Not too familiar with ASP syntax but regex expressions are pretty universal.

you may have to escape the "/",  "<" and ">" characters in the regex string... i.e.  "\<h3.+?\>(.+?)\<\/h3\>"

it is not perfect and if you have multiple <h3></h3> tag sets in the input it will just return the first match (or should)  

Also watch out for additional whitespace that can screw up the regex if it is not accounted for.  i.e. "</h3>" will not match "</h3  >"



Avatar of kaufmed
@beezleinc

>>  ASP syntax but regex expressions are pretty universal.

LOL.  In which universe?  ;)


You have to enable single-line mode for the dot to match newlines. Also, your pattern will not catch tags like:

    <h3>...</h3>

This would be a better option:
(?s)<h3[^>]*>.+?</h3>

Open in new window

Avatar of Xponex

ASKER

@kaufmed

So should I have multiline on or off? And... I don't want to match ALL h3's, just the one with class="post-title" attribute and nothing more. So the "[^>]* would be counter-productive I think...
ASKER CERTIFIED SOLUTION
Avatar of kaufmed
kaufmed
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Xponex

ASKER

Ah ha! That's what I was looking for: [\s\S]

That did the trick! Thanks!
NP. Glad to help  :)