Using Regex Class to eliminate html tags

hi,

I want to eliminate imgae and iframe tags from html (which is stored in string). i am trying to do it for past 3 days but i am failing all times.. along wiht that please tell me a proper way to use it...
Yogesh_AgarwalAsked:
Who is Participating?
 
EggpatchCommented:
0
 
EggpatchCommented:
what script langauge are you using?.. and please show some code...
0
 
Yogesh_AgarwalAuthor Commented:
i am using vb.net..

(<p class=.+>)  - used it to remove p class.. i want to remove iframe, image, and scripts..
0
Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
Yogesh_AgarwalAuthor Commented:
i have already read that page.. after reading that only i thought to try it by myself.. but i am failing..
0
 
oobaylyCommented:
You might get some unwanted side effects when using
"(<p class=.+>)" as it will match all the following text, which isn't derirable

<p class='foo'>Some Text</p><a href='./index.aspx'>Home</a><p>More Text</p>

Try something like this, it uses the less-greedy operator (.+?), and makes sure it ends at the closing tag
(<(?<tag>p) class=.+?(\k<tag>)>)

Given this string "<p class='foo'>Some Text</p><a href='./index.aspx'>Home</a><p>More Text</p>", it matches this:
<p class='foo'>Some Text</p>

Self closing tags like img are far simpler:
(<img .+?>)
0
 
Yogesh_AgarwalAuthor Commented:
perfect i got it but wat shall i do if i want to extract href links from a html page?

example:

str= "<html>
<h3 class=r><a href="http://en.wikipedia.org/wiki/Hello_world_program" class=l>
<h3 class=r><a href="http://en.wikipedia.org/wiki/Hello_world_program" class=l>
<h3 class=r><a href="http://en.wikipedia.org/wiki/Hello_world_program" class=l>
</html>

now i want the values present in href="http://www.xxxxxxxxx.com" i also wat to specify that it shud be under <h3 class=r>... there are 26 classes (a-z).. so i need to tell that it shud extract the links in h3 heading with class 3 and wat to store it in array.. can u please guide me..
0
 
oobaylyCommented:
Use the Groups Luke :-)

<a.+?href=["'](?<href>.+?)["'].+?>

A couple of things to note:
I've haven't assumed that the href attribute does follow the tag.
I've allowed use of single & double quotes for the href attribute.
0
 
Yogesh_AgarwalAuthor Commented:
there are lots of unwanted href tags in the page.. as i said there are 26 classes with 6 heading tag.. in that i want to extract only class=r with h3 heading.. will the above code work for it? i don think so.. :-) i am noob in this.. please help me out..
0
 
oobaylyCommented:
I appreciate that this is new to you, but it doesn't take a great deal of thought to extend this to what you need:

<h3 class=r><a.+?href=["'](?<href>.+?)["'].+?></h3>

See what I've done here?
0
 
Yogesh_AgarwalAuthor Commented:
first it will check <h3 class=r>
then <a . -> Previous operator + -> previous thing plus ? -> if it matches previous exp.  [""]  -> match anything between  " "..
0
 
Yogesh_AgarwalAuthor Commented:
can u give me full code? :-) like how to use match group?
0
 
oobaylyCommented:
This is where you can find all the information about the Regex class:
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.aspx
System.Text.RegularExpressions.Regex re = new System.Text.RegularExpressions.Regex("<h3 class=r><a.+?href=[\"'](?<href>.+?)[\"'].+?></h3>");
System.Text.RegularExpressions.Match m = re.Match("<h3 class=r><a href=\"http://en.wikipedia.org/wiki/Hello_world_program\" class=l></h3>");
string href = m.Groups["href"].Value;

Open in new window

0
 
Yogesh_AgarwalAuthor Commented:
re.Match("<h3 class=r><a href=\"http://en.wikipedia.org/wiki/Hello_world_program\" class=l></h3>");

in the above statement, it will extract only that particular url or it will extract all urls?
0
 
oobaylyCommented:
If you had looked through the documentation, you'd have seen that the Regex object doesn't just have a Match method, but also (amongst others) a Matches method, which returns a MatchCollection:
http://msdn.microsoft.com/en-us/library/e7sf90t3.aspx
0
 
Yogesh_AgarwalAuthor Commented:
yeah i am sucessful in extracting the

 <h3 class=r><a href="http://en.wikipedia.org/wiki/Hello_world_program" class=l>

the above one, i knwo by using loop i have to extract the links.. but i don knwo what and how to proceed..
0
 
oobaylyCommented:
     Regex re = new Regex("Your Regular Expression");
      foreach (Match m in re.Matches("Text to run Regex on")) {
        // Do stuff with matches
      }
0
 
Yogesh_AgarwalAuthor Commented:
nice..
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.