Solved

Regexp help - getting content between two html tags

Posted on 2010-09-22
12
271 Views
Last Modified: 2012-05-10
I need to get the content between an opening an closing H3 tag. Here is an example source:

 <h3 class="post-title">
       <a href="http://www.example.com">
       Legal Support Services
       </a>
</h3>

I was using this regexp: <h3 class="post-title">([^<]*?)</h3>

It worked fine until I realized that there were occasionally HTML tags within the H3 tags. I have changed the regexp to: <h3 class="post-title">(.*?)</h3>

But it never works.

:'(
0
Comment
Question by:Xponex
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 3
  • 3
  • +1
12 Comments
 

Expert Comment

by:hprasad123
ID: 33736879
This would work for you:
<h3 .*</h3>
0
 

Expert Comment

by:hprasad123
ID: 33736896
or
<h3 class="post-title".*</h3>
0
 
LVL 3

Expert Comment

by:beezleinc
ID: 33736960
Simple Perl to find center tag content
$foo = <<EOF;
<h3 class="post-title">
       <a href="http://www.example.com">
       Legal Support Services
       </a>
</h3>
EOF

while ($foo =~ /\<(\w+)(.*?)\>(.*?)\<(\/\1)\>/si)
{
        $foo = $3;
}

print "Center match is \"$foo\"\r\n";

Open in new window

0
Salesforce Has Never Been Easier

Improve and reinforce salesforce training & adoption using WalkMe's digital adoption platform. Start saving on costly employee training by creating fast intuitive Walk-Thrus for Salesforce. Claim your Free Account Now

 

Author Comment

by:Xponex
ID: 33737035
This is being coded in ASP. Thanks for the code beezleinc but there will be a lot more text before (and after) the H3 tag. I need to isolate the text just between the opening H3 and closing H3 tag.
0
 
LVL 3

Expert Comment

by:beezleinc
ID: 33737105
do you have nested <h3> tags?  
0
 

Author Comment

by:Xponex
ID: 33737125
Never. It will always be:

Lots of html
<h3 class="post-title>
some text and MAYBE an <a> tag
</h3>
lots more html
0
 
LVL 3

Expert Comment

by:beezleinc
ID: 33737253
then   "<h3 .+?>(.+?)</h3>"  should work.

make sure your regex call is case insensitive if need be and can span multiple lines.  Not too familiar with ASP syntax but regex expressions are pretty universal.

you may have to escape the "/",  "<" and ">" characters in the regex string... i.e.  "\<h3.+?\>(.+?)\<\/h3\>"

it is not perfect and if you have multiple <h3></h3> tag sets in the input it will just return the first match (or should)  

Also watch out for additional whitespace that can screw up the regex if it is not accounted for.  i.e. "</h3>" will not match "</h3  >"



0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 33738018
@beezleinc

>>  ASP syntax but regex expressions are pretty universal.

LOL.  In which universe?  ;)


You have to enable single-line mode for the dot to match newlines. Also, your pattern will not catch tags like:

    <h3>...</h3>

This would be a better option:
(?s)<h3[^>]*>.+?</h3>

Open in new window

0
 

Author Comment

by:Xponex
ID: 33738068
@kaufmed

So should I have multiline on or off? And... I don't want to match ALL h3's, just the one with class="post-title" attribute and nothing more. So the "[^>]* would be counter-productive I think...
0
 
LVL 75

Accepted Solution

by:
käµfm³d   👽 earned 500 total points
ID: 33738202
I believe you will need to set Global to true.

Multiline affects the behavior of ^ and $ in a regex. It will not benefit you here.

I think you are using the 5.5 regex library in your code. There is no dot matches newline option that I can see. You can circumvent this by using the pattern below:
<h3\s+class=""post-title"">[\s\S]+?</h3>

Open in new window

0
 

Author Comment

by:Xponex
ID: 33738281
Ah ha! That's what I was looking for: [\s\S]

That did the trick! Thanks!
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 33738383
NP. Glad to help  :)
0

Featured Post

VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

by Batuhan Cetin Regular expression is a language that we use to edit a string or retrieve sub-strings that meets specific rules from a text. A regular expression can be applied to a set of string variables. There are many RegEx engines for u…
Have you ever needed to get an ASP script to wait for a while? I have, just to let something else happen. Or in my case, to allow other stuff to happen while I was murdering my MySQL database with an update. The Original Issue This was written…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

621 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question