Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
?
Solved

Using Regex Class to eliminate html tags

Posted on 2009-04-17
17
Medium Priority
?
447 Views
Last Modified: 2013-12-19
hi,

I want to eliminate imgae and iframe tags from html (which is stored in string). i am trying to do it for past 3 days but i am failing all times.. along wiht that please tell me a proper way to use it...
0
Comment
Question by:Yogesh_Agarwal
  • 9
  • 6
  • 2
17 Comments
 
LVL 1

Expert Comment

by:Eggpatch
ID: 24174100
what script langauge are you using?.. and please show some code...
0
 

Author Comment

by:Yogesh_Agarwal
ID: 24174104
i am using vb.net..

(<p class=.+>)  - used it to remove p class.. i want to remove iframe, image, and scripts..
0
 
LVL 1

Accepted Solution

by:
Eggpatch earned 1500 total points
ID: 24174146
0
Become an Android App Developer

Ready to kick start your career in 2018? Learn how to build an Android app in January’s Course of the Month and open the door to new opportunities.

 

Author Comment

by:Yogesh_Agarwal
ID: 24174396
i have already read that page.. after reading that only i thought to try it by myself.. but i am failing..
0
 
LVL 15

Expert Comment

by:oobayly
ID: 24174925
You might get some unwanted side effects when using
"(<p class=.+>)" as it will match all the following text, which isn't derirable

<p class='foo'>Some Text</p><a href='./index.aspx'>Home</a><p>More Text</p>

Try something like this, it uses the less-greedy operator (.+?), and makes sure it ends at the closing tag
(<(?<tag>p) class=.+?(\k<tag>)>)

Given this string "<p class='foo'>Some Text</p><a href='./index.aspx'>Home</a><p>More Text</p>", it matches this:
<p class='foo'>Some Text</p>

Self closing tags like img are far simpler:
(<img .+?>)
0
 

Author Comment

by:Yogesh_Agarwal
ID: 24175379
perfect i got it but wat shall i do if i want to extract href links from a html page?

example:

str= "<html>
<h3 class=r><a href="http://en.wikipedia.org/wiki/Hello_world_program" class=l>
<h3 class=r><a href="http://en.wikipedia.org/wiki/Hello_world_program" class=l>
<h3 class=r><a href="http://en.wikipedia.org/wiki/Hello_world_program" class=l>
</html>

now i want the values present in href="http://www.xxxxxxxxx.com" i also wat to specify that it shud be under <h3 class=r>... there are 26 classes (a-z).. so i need to tell that it shud extract the links in h3 heading with class 3 and wat to store it in array.. can u please guide me..
0
 
LVL 15

Expert Comment

by:oobayly
ID: 24175412
Use the Groups Luke :-)

<a.+?href=["'](?<href>.+?)["'].+?>

A couple of things to note:
I've haven't assumed that the href attribute does follow the tag.
I've allowed use of single & double quotes for the href attribute.
0
 

Author Comment

by:Yogesh_Agarwal
ID: 24175431
there are lots of unwanted href tags in the page.. as i said there are 26 classes with 6 heading tag.. in that i want to extract only class=r with h3 heading.. will the above code work for it? i don think so.. :-) i am noob in this.. please help me out..
0
 
LVL 15

Expert Comment

by:oobayly
ID: 24175982
I appreciate that this is new to you, but it doesn't take a great deal of thought to extend this to what you need:

<h3 class=r><a.+?href=["'](?<href>.+?)["'].+?></h3>

See what I've done here?
0
 

Author Comment

by:Yogesh_Agarwal
ID: 24176023
first it will check <h3 class=r>
then <a . -> Previous operator + -> previous thing plus ? -> if it matches previous exp.  [""]  -> match anything between  " "..
0
 

Author Comment

by:Yogesh_Agarwal
ID: 24176068
can u give me full code? :-) like how to use match group?
0
 
LVL 15

Expert Comment

by:oobayly
ID: 24179993
This is where you can find all the information about the Regex class:
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.aspx
System.Text.RegularExpressions.Regex re = new System.Text.RegularExpressions.Regex("<h3 class=r><a.+?href=[\"'](?<href>.+?)[\"'].+?></h3>");
System.Text.RegularExpressions.Match m = re.Match("<h3 class=r><a href=\"http://en.wikipedia.org/wiki/Hello_world_program\" class=l></h3>");
string href = m.Groups["href"].Value;

Open in new window

0
 

Author Comment

by:Yogesh_Agarwal
ID: 24185602
re.Match("<h3 class=r><a href=\"http://en.wikipedia.org/wiki/Hello_world_program\" class=l></h3>");

in the above statement, it will extract only that particular url or it will extract all urls?
0
 
LVL 15

Expert Comment

by:oobayly
ID: 24185708
If you had looked through the documentation, you'd have seen that the Regex object doesn't just have a Match method, but also (amongst others) a Matches method, which returns a MatchCollection:
http://msdn.microsoft.com/en-us/library/e7sf90t3.aspx
0
 

Author Comment

by:Yogesh_Agarwal
ID: 24185751
yeah i am sucessful in extracting the

 <h3 class=r><a href="http://en.wikipedia.org/wiki/Hello_world_program" class=l>

the above one, i knwo by using loop i have to extract the links.. but i don knwo what and how to proceed..
0
 
LVL 15

Expert Comment

by:oobayly
ID: 24185872
     Regex re = new Regex("Your Regular Expression");
      foreach (Match m in re.Matches("Text to run Regex on")) {
        // Do stuff with matches
      }
0
 

Author Closing Comment

by:Yogesh_Agarwal
ID: 31571740
nice..
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article shows how to deploy dynamic backgrounds to computers depending on the aspect ratio of display
Hello there! As a developer I have modified and refactored the unit tests which was written by fellow developers in the past. On the course, I have gone through various misconceptions and technical challenges when it comes to implementation. I would…
The purpose of this video is to demonstrate how to create a Printer Friendly PDF on a WordPress Page. This will be demonstrated using a Windows 8 PC. Tools Used are Photoshop, Awesome Screenshot” Google Chrome Extension, and SmallPDF.com Log…
The purpose of this video is to demonstrate how to set up an RSS Feed on a WordPress Website. This will be demonstrated using a Windows 8 PC. Feedburner will be used for this demonstration. Go to your WordPress login page. This will look like the…

564 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question