stephenwilde
asked on
.net c# regex for extractingh H1=5 header tag content from HTML string
Looking to extract from a HTML string using Regex in .net c# all the Header H1 to H5 tags and their inner text
In the H1 tags they can be like
<h1>normal h1 heading </h1>
<H1>upper case h1 heading </H1>
< h1>heading with space before h1</h1>
<h1 class=etc>heading with class reference or other string</h1>
<h1 >with space after h1</h1>
regex would have to cope with extracing all h1,h2,h3,h4,h5 tags.
Any help would be appreciated
In the H1 tags they can be like
<h1>normal h1 heading </h1>
<H1>upper case h1 heading </H1>
< h1>heading with space before h1</h1>
<h1 class=etc>heading with class reference or other string</h1>
<h1 >with space after h1</h1>
regex would have to cope with extracing all h1,h2,h3,h4,h5 tags.
Any help would be appreciated
ASKER
Thanks trying to use regex first as it will deal with all Header tages h1,h2,h3,h4,h5 content in one code line rather than running several lines and more complex as I want to list header content in the order they are in the HTML string
plus regex will deal with malformed or variances as outlined in my original question, if it is well formed.
I just don't have the knowledge of the syntax of regex to achieve the desired result.
plus regex will deal with malformed or variances as outlined in my original question, if it is well formed.
I just don't have the knowledge of the syntax of regex to achieve the desired result.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thanks very much it looks like it has worked on all variations
Not a problem, glad to help.
There are html parsers such as HTMLAgilityPack which does such job for you. ( http://htmlagilitypack.codeplex.com)