Yogesh_Agarwal
asked on
Using Regex Class to eliminate html tags
hi,
I want to eliminate imgae and iframe tags from html (which is stored in string). i am trying to do it for past 3 days but i am failing all times.. along wiht that please tell me a proper way to use it...
I want to eliminate imgae and iframe tags from html (which is stored in string). i am trying to do it for past 3 days but i am failing all times.. along wiht that please tell me a proper way to use it...
what script langauge are you using?.. and please show some code...
ASKER
i am using vb.net..
(<p class=.+>) - used it to remove p class.. i want to remove iframe, image, and scripts..
(<p class=.+>) - used it to remove p class.. i want to remove iframe, image, and scripts..
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
i have already read that page.. after reading that only i thought to try it by myself.. but i am failing..
You might get some unwanted side effects when using
"(<p class=.+>)" as it will match all the following text, which isn't derirable
<p class='foo'>Some Text</p><a href='./index.aspx'>Home</ a><p>More Text</p>
Try something like this, it uses the less-greedy operator (.+?), and makes sure it ends at the closing tag
(<(?<tag>p) class=.+?(\k<tag>)>)
Given this string "<p class='foo'>Some Text</p><a href='./index.aspx'>Home</ a><p>More Text</p>", it matches this:
<p class='foo'>Some Text</p>
Self closing tags like img are far simpler:
(<img .+?>)
"(<p class=.+>)" as it will match all the following text, which isn't derirable
<p class='foo'>Some Text</p><a href='./index.aspx'>Home</
Try something like this, it uses the less-greedy operator (.+?), and makes sure it ends at the closing tag
(<(?<tag>p) class=.+?(\k<tag>)>)
Given this string "<p class='foo'>Some Text</p><a href='./index.aspx'>Home</
<p class='foo'>Some Text</p>
Self closing tags like img are far simpler:
(<img .+?>)
ASKER
perfect i got it but wat shall i do if i want to extract href links from a html page?
example:
str= "<html>
<h3 class=r><a href="http://en.wikipedia.org/wiki/Hello_world_program" class=l>
<h3 class=r><a href="http://en.wikipedia.org/wiki/Hello_world_program" class=l>
<h3 class=r><a href="http://en.wikipedia.org/wiki/Hello_world_program" class=l>
</html>
now i want the values present in href="http://www.xxxxxxxxx.com" i also wat to specify that it shud be under <h3 class=r>... there are 26 classes (a-z).. so i need to tell that it shud extract the links in h3 heading with class 3 and wat to store it in array.. can u please guide me..
example:
str= "<html>
<h3 class=r><a href="http://en.wikipedia.org/wiki/Hello_world_program" class=l>
<h3 class=r><a href="http://en.wikipedia.org/wiki/Hello_world_program" class=l>
<h3 class=r><a href="http://en.wikipedia.org/wiki/Hello_world_program" class=l>
</html>
now i want the values present in href="http://www.xxxxxxxxx.com" i also wat to specify that it shud be under <h3 class=r>... there are 26 classes (a-z).. so i need to tell that it shud extract the links in h3 heading with class 3 and wat to store it in array.. can u please guide me..
Use the Groups Luke :-)
<a.+?href=["'](?<href>.+?) ["'].+?>
A couple of things to note:
I've haven't assumed that the href attribute does follow the tag.
I've allowed use of single & double quotes for the href attribute.
<a.+?href=["'](?<href>.+?)
A couple of things to note:
I've haven't assumed that the href attribute does follow the tag.
I've allowed use of single & double quotes for the href attribute.
ASKER
there are lots of unwanted href tags in the page.. as i said there are 26 classes with 6 heading tag.. in that i want to extract only class=r with h3 heading.. will the above code work for it? i don think so.. :-) i am noob in this.. please help me out..
I appreciate that this is new to you, but it doesn't take a great deal of thought to extend this to what you need:
<h3 class=r><a.+?href=["'](?<h ref>.+?)[" '].+?></h3 >
See what I've done here?
<h3 class=r><a.+?href=["'](?<h
See what I've done here?
ASKER
first it will check <h3 class=r>
then <a . -> Previous operator + -> previous thing plus ? -> if it matches previous exp. [""] -> match anything between " "..
then <a . -> Previous operator + -> previous thing plus ? -> if it matches previous exp. [""] -> match anything between " "..
ASKER
can u give me full code? :-) like how to use match group?
This is where you can find all the information about the Regex class:
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.aspx
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.aspx
System.Text.RegularExpressions.Regex re = new System.Text.RegularExpressions.Regex("<h3 class=r><a.+?href=[\"'](?<href>.+?)[\"'].+?></h3>");
System.Text.RegularExpressions.Match m = re.Match("<h3 class=r><a href=\"http://en.wikipedia.org/wiki/Hello_world_program\" class=l></h3>");
string href = m.Groups["href"].Value;
ASKER
re.Match("<h3 class=r><a href=\"http://en.wikipedia.org/wiki/Hello_world_program\" class=l></h3>");
in the above statement, it will extract only that particular url or it will extract all urls?
in the above statement, it will extract only that particular url or it will extract all urls?
If you had looked through the documentation, you'd have seen that the Regex object doesn't just have a Match method, but also (amongst others) a Matches method, which returns a MatchCollection:
http://msdn.microsoft.com/en-us/library/e7sf90t3.aspx
http://msdn.microsoft.com/en-us/library/e7sf90t3.aspx
ASKER
yeah i am sucessful in extracting the
<h3 class=r><a href="http://en.wikipedia.org/wiki/Hello_world_program" class=l>
the above one, i knwo by using loop i have to extract the links.. but i don knwo what and how to proceed..
<h3 class=r><a href="http://en.wikipedia.org/wiki/Hello_world_program" class=l>
the above one, i knwo by using loop i have to extract the links.. but i don knwo what and how to proceed..
Regex re = new Regex("Your Regular Expression");
foreach (Match m in re.Matches("Text to run Regex on")) {
// Do stuff with matches
}
foreach (Match m in re.Matches("Text to run Regex on")) {
// Do stuff with matches
}
ASKER
nice..