How do I remove all HTML tags in a string using regular expression?

Posted on 2011-05-08
Last Modified: 2013-12-25
$mystring contains:

<div align="center"><a href="">DOMAIN NAME</a></div>Some random text.<br><a href="">ANOTHER DOMAIN NAME</a>

I want to use Perl and regular expression to manipulate $mystring so that it removes all the HTML elements and hyperlinks, so that $mystring contains only "Some random text."

How can this be done?
Question by:jay28lee
    LVL 84

    Accepted Solution

    perldoc -q "How do I remove HTML from a string"
    Found in perlfaq9.pod
           How do I remove HTML from a string?

           The most correct way (albeit not the fastest) is to use HTML::Parser
           from CPAN.  Another mostly correct way is to use HTML::FormatText which
           not only removes HTML but also attempts to do a little simple
           formatting of the resulting plain text.

           Many folks attempt a simple-minded regular expression approach, like
           "s/<.*?>//g", but that fails in many cases because the tags may
           continue over line breaks, they may contain quoted angle-brackets, or
           HTML comment may be present.  Plus, folks forget to convert
           entities--like "&lt;" for example.

           Here's one "simple-minded" approach, that works for most files:

               #!/usr/bin/perl -p0777

           If you want a more complete solution, see the 3-stage striphtml program
           in .

           Here are some tricky cases that you should think about when picking a

               <IMG SRC = "foo.gif" ALT = "A > B">

               <IMG SRC = "foo.gif"
                    ALT = "A > B">

               <!-- <A comment> -->

               <script>if (a<b && a>c)</script>

               <# Just data #>

               <![INCLUDE CDATA [ >>>>>>>>>>>> ]]>

           If HTML comments include other tags, those solutions would also break
           on text like this:

               <!-- This section commented out.
                   <B>You can't see me!</B>

    Author Comment

    I found a piece of code in my original script (which my previous programmer wrote), I'm suspecting this is what causing an error for my current situation of HTML removal.

    s/\G($C*?)(?:  +|($X)(-)|(-)(?=$X)|($X)(?=[+=\w(])|([+=\w)])(?=$X)|(\))(?=\S)|(\S)(?=\())/$1$2$4$5$6$7$8$s[!$3]/g;

    Can you tell me if there's something wrong with the above code?  And what does it do?

    Should I replace it with what you mentioned?


    LVL 84

    Expert Comment

    What are the values of $C and $X?

    Author Comment

    there's also the following code before the regular expression

       @s=(' - ',' ');

    the above code was commented as handling for Chinese Big-5 charset.

    Author Comment

    btw, ozo, could you help me look at another of my questions as of the following, a related question from what you've answered back in 2005.

    btw, the solution works for me using: s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

    i'll simply ignore what was previously written by the original programmer of my script.

    Write Comment

    Please enter a first name

    Please enter a last name

    We will never share this with anyone.

    Featured Post

    How to run any project with ease

    Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
    - Combine task lists, docs, spreadsheets, and chat in one
    - View and edit from mobile/offline
    - Cut down on emails

    There are many situations when we need to display the data in sorted order. For example: Student details by name or by rank or by total marks etc. If you are working on data driven based projects then you will use sorting techniques very frequently.…
    This article is meant to give a basic understanding of how to use R Sweave as a way to merge LaTeX and R code seamlessly into one presentable document.
    The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…
    In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…

    737 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    23 Experts available now in Live!

    Get 1:1 Help Now