Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 534
  • Last Modified:

How do I remove all HTML tags in a string using regular expression?

$mystring contains:

<div align="center"><a href="http://www.domain.com/">DOMAIN NAME</a></div>Some random text.<br><a href="http://www.domain2.com/">ANOTHER DOMAIN NAME</a>

I want to use Perl and regular expression to manipulate $mystring so that it removes all the HTML elements and hyperlinks, so that $mystring contains only "Some random text."

How can this be done?
  • 3
  • 2
1 Solution
perldoc -q "How do I remove HTML from a string"
Found in perlfaq9.pod
       How do I remove HTML from a string?

       The most correct way (albeit not the fastest) is to use HTML::Parser
       from CPAN.  Another mostly correct way is to use HTML::FormatText which
       not only removes HTML but also attempts to do a little simple
       formatting of the resulting plain text.

       Many folks attempt a simple-minded regular expression approach, like
       "s/<.*?>//g", but that fails in many cases because the tags may
       continue over line breaks, they may contain quoted angle-brackets, or
       HTML comment may be present.  Plus, folks forget to convert
       entities--like "&lt;" for example.

       Here's one "simple-minded" approach, that works for most files:

           #!/usr/bin/perl -p0777

       If you want a more complete solution, see the 3-stage striphtml program
       in http://www.cpan.org/authors/Tom_Christiansen/scripts/striphtml.gz .

       Here are some tricky cases that you should think about when picking a

           <IMG SRC = "foo.gif" ALT = "A > B">

           <IMG SRC = "foo.gif"
                ALT = "A > B">

           <!-- <A comment> -->

           <script>if (a<b && a>c)</script>

           <# Just data #>

           <![INCLUDE CDATA [ >>>>>>>>>>>> ]]>

       If HTML comments include other tags, those solutions would also break
       on text like this:

           <!-- This section commented out.
               <B>You can't see me!</B>
jay28leeAuthor Commented:
I found a piece of code in my original script (which my previous programmer wrote), I'm suspecting this is what causing an error for my current situation of HTML removal.

s/\G($C*?)(?:  +|($X)(-)|(-)(?=$X)|($X)(?=[+=\w(])|([+=\w)])(?=$X)|(\))(?=\S)|(\S)(?=\())/$1$2$4$5$6$7$8$s[!$3]/g;

Can you tell me if there's something wrong with the above code?  And what does it do?

Should I replace it with what you mentioned?


What are the values of $C and $X?
jay28leeAuthor Commented:
there's also the following code before the regular expression

   @s=(' - ',' ');

the above code was commented as handling for Chinese Big-5 charset.
jay28leeAuthor Commented:
btw, ozo, could you help me look at another of my questions as of the following, a related question from what you've answered back in 2005.


btw, the solution works for me using: s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

i'll simply ignore what was previously written by the original programmer of my script.

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

  • 3
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now