asked on

Using Regex Class to eliminate html tags

hi,

I want to eliminate imgae and iframe tags from html (which is stored in string). i am trying to do it for past 3 days but i am failing all times.. along wiht that please tell me a proper way to use it...

Eggpatch

what script langauge are you using?.. and please show some code...

Yogesh_Agarwal

ASKER

i am using vb.net..

(<p class=.+>) - used it to remove p class.. i want to remove iframe, image, and scripts..

ASKER CERTIFIED SOLUTION

Eggpatch

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Yogesh_Agarwal

ASKER

i have already read that page.. after reading that only i thought to try it by myself.. but i am failing..

oobayly

You might get some unwanted side effects when using
"(<p class=.+>)" as it will match all the following text, which isn't derirable

<p class='foo'>Some Text</p><a href='./index.aspx'>Home</a><p>More Text</p>

Try something like this, it uses the less-greedy operator (.+?), and makes sure it ends at the closing tag
(<(?<tag>p) class=.+?(\k<tag>)>)

Given this string "<p class='foo'>Some Text</p><a href='./index.aspx'>Home</a><p>More Text</p>", it matches this:
<p class='foo'>Some Text</p>

Self closing tags like img are far simpler:
(<img .+?>)

Yogesh_Agarwal

ASKER

perfect i got it but wat shall i do if i want to extract href links from a html page?

example:

str= "<html>
<h3 class=r><a href="http://en.wikipedia.org/wiki/Hello_world_program" class=l>
<h3 class=r><a href="http://en.wikipedia.org/wiki/Hello_world_program" class=l>
<h3 class=r><a href="http://en.wikipedia.org/wiki/Hello_world_program" class=l>
</html>

now i want the values present in href="http://www.xxxxxxxxx.com" i also wat to specify that it shud be under <h3 class=r>... there are 26 classes (a-z).. so i need to tell that it shud extract the links in h3 heading with class 3 and wat to store it in array.. can u please guide me..

oobayly

Use the Groups Luke :-)

<a.+?href=["'](?<href>.+?)["'].+?>

A couple of things to note:
I've haven't assumed that the href attribute does follow the tag.
I've allowed use of single & double quotes for the href attribute.

Yogesh_Agarwal

ASKER

there are lots of unwanted href tags in the page.. as i said there are 26 classes with 6 heading tag.. in that i want to extract only class=r with h3 heading.. will the above code work for it? i don think so.. :-) i am noob in this.. please help me out..

oobayly

I appreciate that this is new to you, but it doesn't take a great deal of thought to extend this to what you need:

<h3 class=r><a.+?href=["'](?<href>.+?)["'].+?></h3>

See what I've done here?

Yogesh_Agarwal

ASKER

first it will check <h3 class=r>
then <a . -> Previous operator + -> previous thing plus ? -> if it matches previous exp. [""] -> match anything between " "..

Yogesh_Agarwal

ASKER

can u give me full code? :-) like how to use match group?

oobayly

This is where you can find all the information about the Regex class:
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.aspx

System.Text.RegularExpressions.Regex re = new System.Text.RegularExpressions.Regex("<h3 class=r><a.+?href=[\"'](?<href>.+?)[\"'].+?></h3>");
System.Text.RegularExpressions.Match m = re.Match("<h3 class=r><a href=\"http://en.wikipedia.org/wiki/Hello_world_program\" class=l></h3>");
string href = m.Groups["href"].Value;

Open in new window

Yogesh_Agarwal

ASKER

re.Match("<h3 class=r><a href=\"http://en.wikipedia.org/wiki/Hello_world_program\" class=l></h3>");

in the above statement, it will extract only that particular url or it will extract all urls?

oobayly

If you had looked through the documentation, you'd have seen that the Regex object doesn't just have a Match method, but also (amongst others) a Matches method, which returns a MatchCollection:
http://msdn.microsoft.com/en-us/library/e7sf90t3.aspx

Yogesh_Agarwal

ASKER

yeah i am sucessful in extracting the

<h3 class=r><a href="http://en.wikipedia.org/wiki/Hello_world_program" class=l>

the above one, i knwo by using loop i have to extract the links.. but i don knwo what and how to proceed..

oobayly

Regex re = new Regex("Your Regular Expression");
foreach (Match m in re.Matches("Text to run Regex on")) {
// Do stuff with matches
}

Yogesh_Agarwal

ASKER

nice..