Go Premium for a chance to win a PS4. Enter to Win

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 280
  • Last Modified:

Regexp help - getting content between two html tags

I need to get the content between an opening an closing H3 tag. Here is an example source:

 <h3 class="post-title">
       <a href="http://www.example.com">
       Legal Support Services
       </a>
</h3>

I was using this regexp: <h3 class="post-title">([^<]*?)</h3>

It worked fine until I realized that there were occasionally HTML tags within the H3 tags. I have changed the regexp to: <h3 class="post-title">(.*?)</h3>

But it never works.

:'(
0
Xponex
Asked:
Xponex
  • 4
  • 3
  • 3
  • +1
1 Solution
 
hprasad123Commented:
This would work for you:
<h3 .*</h3>
0
 
hprasad123Commented:
or
<h3 class="post-title".*</h3>
0
 
beezleincCommented:
Simple Perl to find center tag content
$foo = <<EOF;
<h3 class="post-title">
       <a href="http://www.example.com">
       Legal Support Services
       </a>
</h3>
EOF

while ($foo =~ /\<(\w+)(.*?)\>(.*?)\<(\/\1)\>/si)
{
        $foo = $3;
}

print "Center match is \"$foo\"\r\n";

Open in new window

0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
XponexAuthor Commented:
This is being coded in ASP. Thanks for the code beezleinc but there will be a lot more text before (and after) the H3 tag. I need to isolate the text just between the opening H3 and closing H3 tag.
0
 
beezleincCommented:
do you have nested <h3> tags?  
0
 
XponexAuthor Commented:
Never. It will always be:

Lots of html
<h3 class="post-title>
some text and MAYBE an <a> tag
</h3>
lots more html
0
 
beezleincCommented:
then   "<h3 .+?>(.+?)</h3>"  should work.

make sure your regex call is case insensitive if need be and can span multiple lines.  Not too familiar with ASP syntax but regex expressions are pretty universal.

you may have to escape the "/",  "<" and ">" characters in the regex string... i.e.  "\<h3.+?\>(.+?)\<\/h3\>"

it is not perfect and if you have multiple <h3></h3> tag sets in the input it will just return the first match (or should)  

Also watch out for additional whitespace that can screw up the regex if it is not accounted for.  i.e. "</h3>" will not match "</h3  >"



0
 
käµfm³d 👽Commented:
@beezleinc

>>  ASP syntax but regex expressions are pretty universal.

LOL.  In which universe?  ;)


You have to enable single-line mode for the dot to match newlines. Also, your pattern will not catch tags like:

    <h3>...</h3>

This would be a better option:
(?s)<h3[^>]*>.+?</h3>

Open in new window

0
 
XponexAuthor Commented:
@kaufmed

So should I have multiline on or off? And... I don't want to match ALL h3's, just the one with class="post-title" attribute and nothing more. So the "[^>]* would be counter-productive I think...
0
 
käµfm³d 👽Commented:
I believe you will need to set Global to true.

Multiline affects the behavior of ^ and $ in a regex. It will not benefit you here.

I think you are using the 5.5 regex library in your code. There is no dot matches newline option that I can see. You can circumvent this by using the pattern below:
<h3\s+class=""post-title"">[\s\S]+?</h3>

Open in new window

0
 
XponexAuthor Commented:
Ah ha! That's what I was looking for: [\s\S]

That did the trick! Thanks!
0
 
käµfm³d 👽Commented:
NP. Glad to help  :)
0

Featured Post

New feature and membership benefit!

New feature! Upgrade and increase expert visibility of your issues with Priority Questions.

  • 4
  • 3
  • 3
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now