Link to home
Start Free TrialLog in
Avatar of Yogesh_Agarwal
Yogesh_Agarwal

asked on

Using Regex Class to eliminate html tags

hi,

I want to eliminate imgae and iframe tags from html (which is stored in string). i am trying to do it for past 3 days but i am failing all times.. along wiht that please tell me a proper way to use it...
Avatar of Eggpatch
Eggpatch

what script langauge are you using?.. and please show some code...
Avatar of Yogesh_Agarwal

ASKER

i am using vb.net..

(<p class=.+>)  - used it to remove p class.. i want to remove iframe, image, and scripts..
ASKER CERTIFIED SOLUTION
Avatar of Eggpatch
Eggpatch

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
i have already read that page.. after reading that only i thought to try it by myself.. but i am failing..
Avatar of oobayly
You might get some unwanted side effects when using
"(<p class=.+>)" as it will match all the following text, which isn't derirable

<p class='foo'>Some Text</p><a href='./index.aspx'>Home</a><p>More Text</p>

Try something like this, it uses the less-greedy operator (.+?), and makes sure it ends at the closing tag
(<(?<tag>p) class=.+?(\k<tag>)>)

Given this string "<p class='foo'>Some Text</p><a href='./index.aspx'>Home</a><p>More Text</p>", it matches this:
<p class='foo'>Some Text</p>

Self closing tags like img are far simpler:
(<img .+?>)
perfect i got it but wat shall i do if i want to extract href links from a html page?

example:

str= "<html>
<h3 class=r><a href="http://en.wikipedia.org/wiki/Hello_world_program" class=l>
<h3 class=r><a href="http://en.wikipedia.org/wiki/Hello_world_program" class=l>
<h3 class=r><a href="http://en.wikipedia.org/wiki/Hello_world_program" class=l>
</html>

now i want the values present in href="http://www.xxxxxxxxx.com" i also wat to specify that it shud be under <h3 class=r>... there are 26 classes (a-z).. so i need to tell that it shud extract the links in h3 heading with class 3 and wat to store it in array.. can u please guide me..
Use the Groups Luke :-)

<a.+?href=["'](?<href>.+?)["'].+?>

A couple of things to note:
I've haven't assumed that the href attribute does follow the tag.
I've allowed use of single & double quotes for the href attribute.
there are lots of unwanted href tags in the page.. as i said there are 26 classes with 6 heading tag.. in that i want to extract only class=r with h3 heading.. will the above code work for it? i don think so.. :-) i am noob in this.. please help me out..
I appreciate that this is new to you, but it doesn't take a great deal of thought to extend this to what you need:

<h3 class=r><a.+?href=["'](?<href>.+?)["'].+?></h3>

See what I've done here?
first it will check <h3 class=r>
then <a . -> Previous operator + -> previous thing plus ? -> if it matches previous exp.  [""]  -> match anything between  " "..
can u give me full code? :-) like how to use match group?
This is where you can find all the information about the Regex class:
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.aspx
System.Text.RegularExpressions.Regex re = new System.Text.RegularExpressions.Regex("<h3 class=r><a.+?href=[\"'](?<href>.+?)[\"'].+?></h3>");
System.Text.RegularExpressions.Match m = re.Match("<h3 class=r><a href=\"http://en.wikipedia.org/wiki/Hello_world_program\" class=l></h3>");
string href = m.Groups["href"].Value;

Open in new window

re.Match("<h3 class=r><a href=\"http://en.wikipedia.org/wiki/Hello_world_program\" class=l></h3>");

in the above statement, it will extract only that particular url or it will extract all urls?
If you had looked through the documentation, you'd have seen that the Regex object doesn't just have a Match method, but also (amongst others) a Matches method, which returns a MatchCollection:
http://msdn.microsoft.com/en-us/library/e7sf90t3.aspx
yeah i am sucessful in extracting the

 <h3 class=r><a href="http://en.wikipedia.org/wiki/Hello_world_program" class=l>

the above one, i knwo by using loop i have to extract the links.. but i don knwo what and how to proceed..
     Regex re = new Regex("Your Regular Expression");
      foreach (Match m in re.Matches("Text to run Regex on")) {
        // Do stuff with matches
      }
nice..