Solved

Regexp help - getting content between two html tags

Posted on 2010-09-22
12
263 Views
Last Modified: 2012-05-10
I need to get the content between an opening an closing H3 tag. Here is an example source:

 <h3 class="post-title">
       <a href="http://www.example.com">
       Legal Support Services
       </a>
</h3>

I was using this regexp: <h3 class="post-title">([^<]*?)</h3>

It worked fine until I realized that there were occasionally HTML tags within the H3 tags. I have changed the regexp to: <h3 class="post-title">(.*?)</h3>

But it never works.

:'(
0
Comment
Question by:Xponex
  • 4
  • 3
  • 3
  • +1
12 Comments
 

Expert Comment

by:hprasad123
ID: 33736879
This would work for you:
<h3 .*</h3>
0
 

Expert Comment

by:hprasad123
ID: 33736896
or
<h3 class="post-title".*</h3>
0
 
LVL 3

Expert Comment

by:beezleinc
ID: 33736960
Simple Perl to find center tag content
$foo = <<EOF;
<h3 class="post-title">
       <a href="http://www.example.com">
       Legal Support Services
       </a>
</h3>
EOF

while ($foo =~ /\<(\w+)(.*?)\>(.*?)\<(\/\1)\>/si)
{
        $foo = $3;
}

print "Center match is \"$foo\"\r\n";

Open in new window

0
 

Author Comment

by:Xponex
ID: 33737035
This is being coded in ASP. Thanks for the code beezleinc but there will be a lot more text before (and after) the H3 tag. I need to isolate the text just between the opening H3 and closing H3 tag.
0
 
LVL 3

Expert Comment

by:beezleinc
ID: 33737105
do you have nested <h3> tags?  
0
 

Author Comment

by:Xponex
ID: 33737125
Never. It will always be:

Lots of html
<h3 class="post-title>
some text and MAYBE an <a> tag
</h3>
lots more html
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 3

Expert Comment

by:beezleinc
ID: 33737253
then   "<h3 .+?>(.+?)</h3>"  should work.

make sure your regex call is case insensitive if need be and can span multiple lines.  Not too familiar with ASP syntax but regex expressions are pretty universal.

you may have to escape the "/",  "<" and ">" characters in the regex string... i.e.  "\<h3.+?\>(.+?)\<\/h3\>"

it is not perfect and if you have multiple <h3></h3> tag sets in the input it will just return the first match (or should)  

Also watch out for additional whitespace that can screw up the regex if it is not accounted for.  i.e. "</h3>" will not match "</h3  >"



0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 33738018
@beezleinc

>>  ASP syntax but regex expressions are pretty universal.

LOL.  In which universe?  ;)


You have to enable single-line mode for the dot to match newlines. Also, your pattern will not catch tags like:

    <h3>...</h3>

This would be a better option:
(?s)<h3[^>]*>.+?</h3>

Open in new window

0
 

Author Comment

by:Xponex
ID: 33738068
@kaufmed

So should I have multiline on or off? And... I don't want to match ALL h3's, just the one with class="post-title" attribute and nothing more. So the "[^>]* would be counter-productive I think...
0
 
LVL 74

Accepted Solution

by:
käµfm³d   👽 earned 500 total points
ID: 33738202
I believe you will need to set Global to true.

Multiline affects the behavior of ^ and $ in a regex. It will not benefit you here.

I think you are using the 5.5 regex library in your code. There is no dot matches newline option that I can see. You can circumvent this by using the pattern below:
<h3\s+class=""post-title"">[\s\S]+?</h3>

Open in new window

0
 

Author Comment

by:Xponex
ID: 33738281
Ah ha! That's what I was looking for: [\s\S]

That did the trick! Thanks!
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 33738383
NP. Glad to help  :)
0

Featured Post

Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

Join & Write a Comment

Suggested Solutions

I would like to start this tip/trick by saying Thank You, to all who said that this could not be done, as it forced me to make sure that it could be accomplished. :) To start, I want to make sure everyone understands the importance of utilizing p…
As most anyone who uses or has come across them can attest to, regular expressions (regex) are a complicated bit of magic. Packed so succinctly within their cryptic syntax lies a great deal of power. It's not the "take over the world" kind of power,…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

14 Experts available now in Live!

Get 1:1 Help Now