Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
?
Solved

Simple RegEx to get only content between tags over multiple lines

Posted on 2012-09-06
9
Medium Priority
?
6,763 Views
Last Modified: 2012-09-06
Hi,

I'm new to regular expressions and believe this is really easy to someone that is good at it =)


I'm searching a html document for a title that's inside <h1> tags.

it could look something like this:
text text text
text <h1>Here is the title I want
and it could span over multiple lines</h1>
more text

Open in new window


I tried this simple RegEx: <h1>(.*)</h1>

But it doesn't work over multiple lines. And I'm also worried that it will match from the first <h1> tag in document to the last (if multiple) </h1>?

And, I only want the result to be: Here is the title I want and it could span over multiple lines, not including the <h1> and </h1>

My code is:
var titleMatch = new Regex("<h1>(.*)</h1>", RegexOptions.IgnoreCase).Match(htmlInput);

Open in new window


Thanks for any help.
0
Comment
Question by:jimmieandersson
  • 3
  • 3
  • 2
  • +1
9 Comments
 
LVL 31

Expert Comment

by:GwynforWeb
ID: 38372750
Escape out the symbols that meaning in Js or HTMl using a \,  ie

new Regex("\<h1\>(.*)\<\/h1\>"
0
 
LVL 93

Expert Comment

by:Patrick Matthews
ID: 38372775
How about:

"<h1>[^<]*</h1>"
0
 
LVL 31

Expert Comment

by:GwynforWeb
ID: 38372797
Actually I got this to work, escaping is not needed. I am testing the text in the Div:-

eg Use "i" for case insensitivity

<script>
var titleMatch = new RegExp("<h1>(.*)</h1>","i");
</script>

<div id="myDiv">
<p>some text </p>
<h1>Some text in the tag
 and some more and
 some more</h1>
<b>
<p>Some text outside</b> </p>
<h1>Some more text in the tag and some more and some more</h1>
<p>Some randomstuff </p>
</div>

<script>
str=document.getElementById("myDiv").innerHTML
alert(str.match(titleMatch))
</script>
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 75

Assisted Solution

by:käµfm³d 👽
käµfm³d   👽 earned 2000 total points
ID: 38373242
Turn on single-line mode:

var titleMatch = new Regex("<h1>(.*?)</h1>", RegexOptions.IgnoreCase | RegexOptions.SingleLine).Match(htmlInput);

Open in new window

0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 38373261
P.S.

Do note that my suggestion is only going to work as intended if the <h1>'s are balanaced--meaning each <h1> has a corresponding </h1>. If they are not balanced, then you will get unexpected results. Regex is not really good for parsing HTML.
0
 
LVL 31

Expert Comment

by:GwynforWeb
ID: 38373858
You are far better using  document.getElementsByTagName("h1"), then pulling out the innerHTML of each tag occurrence. eg something like this:-

<body>
  <p>some text </p>
  <h1>Some first text in the tag</h1>
  <p><b>Some text outside</b> </p>
  <h1>Some second text in the tag </h1>
  <p>Some randomstuff outside tags </p>
  <h1>Some third text in the tag </h1>

  <script>
    strings=document.getElementsByTagName("h1")
    for (i=0;i<strings.length;i++)
       alert(strings[i].innerHTML)
  </script>

</body>

Open in new window

0
 

Author Comment

by:jimmieandersson
ID: 38374086
Thank you everyone.

I need to do this server side in .net so I haven't tried the javascript methods.

But I tried the following suggestions:
\<h1\>(.*)\<\/h1\> (NOT WORKING, ENDS AT LAST </h1>)
<h1>[^<]*</h1> (WORKING)

and this one with SingleLine.
var titleMatch = new Regex("<h1>(.*?)</h1>", RegexOptions.IgnoreCase | RegexOptions.SingleLine).Match(htmlInput); (WORKING)

Open in new window


However, those I wrote WORKING on, gives me all
<h1>Here is the title I want and it could span over multiple lines</h1>

I would only like to get
Here is the title I want and it could span over multiple lines

This must be possible, right? (do not want to replace or remove result afterwords)
0
 
LVL 75

Accepted Solution

by:
käµfm³d   👽 earned 2000 total points
ID: 38374339
I don't like method chaining in this particular instance, but as a test you can try:

var titleMatch = new Regex("<h1>(.*?)</h1>", RegexOptions.IgnoreCase | RegexOptions.SingleLine).Match(htmlInput).Groups[1].Value;

Open in new window

0
 

Author Closing Comment

by:jimmieandersson
ID: 38374998
It works! Thank you
0

Featured Post

Get your problem seen by more experts

Be seen. Boost your question’s priority for more expert views and faster solutions

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Finding original email is quite difficult due to their duplicates. From this article, you will come to know why multiple duplicates of same emails appear and how to delete duplicate emails from Outlook securely and instantly while vital emails remai…
Simulator games are perfect for generating sample realistic data streams, especially for learning data analysis. It is even useful for demoing offerings such as Azure stream analytics, PowerBI etc.
In this tutorial viewers will learn how add a scalable full-width header using CSS3. Create a new HTML document with an internal stylesheet. Set a tiled background.:  Create a new div and name it Header. Position it with position:absolute at the top…
In this tutorial viewers will learn how to style a corner ribbon overlay for an image using CSS Create a new class by typing ".Ribbon":  Define the class' "display:" as "inline-block": Define its "position:" as "relative": Define its "overflow:" as …
Suggested Courses

571 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question