Link to home
Start Free TrialLog in
Avatar of jimmieandersson
jimmieandersson

asked on

Simple RegEx to get only content between tags over multiple lines

Hi,

I'm new to regular expressions and believe this is really easy to someone that is good at it =)


I'm searching a html document for a title that's inside <h1> tags.

it could look something like this:
text text text
text <h1>Here is the title I want
and it could span over multiple lines</h1>
more text

Open in new window


I tried this simple RegEx: <h1>(.*)</h1>

But it doesn't work over multiple lines. And I'm also worried that it will match from the first <h1> tag in document to the last (if multiple) </h1>?

And, I only want the result to be: Here is the title I want and it could span over multiple lines, not including the <h1> and </h1>

My code is:
var titleMatch = new Regex("<h1>(.*)</h1>", RegexOptions.IgnoreCase).Match(htmlInput);

Open in new window


Thanks for any help.
Avatar of GwynforWeb
GwynforWeb
Flag of Canada image

Escape out the symbols that meaning in Js or HTMl using a \,  ie

new Regex("\<h1\>(.*)\<\/h1\>"
Avatar of Patrick Matthews
How about:

"<h1>[^<]*</h1>"
Actually I got this to work, escaping is not needed. I am testing the text in the Div:-

eg Use "i" for case insensitivity

<script>
var titleMatch = new RegExp("<h1>(.*)</h1>","i");
</script>

<div id="myDiv">
<p>some text </p>
<h1>Some text in the tag
 and some more and
 some more</h1>
<b>
<p>Some text outside</b> </p>
<h1>Some more text in the tag and some more and some more</h1>
<p>Some randomstuff </p>
</div>

<script>
str=document.getElementById("myDiv").innerHTML
alert(str.match(titleMatch))
</script>
SOLUTION
Avatar of kaufmed
kaufmed
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
P.S.

Do note that my suggestion is only going to work as intended if the <h1>'s are balanaced--meaning each <h1> has a corresponding </h1>. If they are not balanced, then you will get unexpected results. Regex is not really good for parsing HTML.
You are far better using  document.getElementsByTagName("h1"), then pulling out the innerHTML of each tag occurrence. eg something like this:-

<body>
  <p>some text </p>
  <h1>Some first text in the tag</h1>
  <p><b>Some text outside</b> </p>
  <h1>Some second text in the tag </h1>
  <p>Some randomstuff outside tags </p>
  <h1>Some third text in the tag </h1>

  <script>
    strings=document.getElementsByTagName("h1")
    for (i=0;i<strings.length;i++)
       alert(strings[i].innerHTML)
  </script>

</body>

Open in new window

Avatar of jimmieandersson
jimmieandersson

ASKER

Thank you everyone.

I need to do this server side in .net so I haven't tried the javascript methods.

But I tried the following suggestions:
\<h1\>(.*)\<\/h1\> (NOT WORKING, ENDS AT LAST </h1>)
<h1>[^<]*</h1> (WORKING)

and this one with SingleLine.
var titleMatch = new Regex("<h1>(.*?)</h1>", RegexOptions.IgnoreCase | RegexOptions.SingleLine).Match(htmlInput); (WORKING)

Open in new window


However, those I wrote WORKING on, gives me all
<h1>Here is the title I want and it could span over multiple lines</h1>

I would only like to get
Here is the title I want and it could span over multiple lines

This must be possible, right? (do not want to replace or remove result afterwords)
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
It works! Thank you