asked on

How do I remove all HTML tags in a string using regular expression?

$mystring contains:

<div align="center"><a href="http://www.domain.com/">DOMAIN NAME</a></div>Some random text.<br><a href="http://www.domain2.com/">ANOTHER DOMAIN NAME</a>

I want to use Perl and regular expression to manipulate $mystring so that it removes all the HTML elements and hyperlinks, so that $mystring contains only "Some random text."

How can this be done?

ASKER CERTIFIED SOLUTION

ozo

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

jay28lee

ASKER

I found a piece of code in my original script (which my previous programmer wrote), I'm suspecting this is what causing an error for my current situation of HTML removal.

s/\G($C*?)(?: +|($X)(-)|(-)(?=$X)|($X)(?=[+=\w(])|([+=\w)])(?=$X)|(\))(?=\S)|(\S)(?=\())/$1$2$4$5$6$7$8$s[!$3]/g;

Can you tell me if there's something wrong with the above code? And what does it do?

Should I replace it with what you mentioned?

s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

Thanks.

ozo

What are the values of $C and $X?

jay28lee

ASKER

there's also the following code before the regular expression

$X=qr/[\x81-\xFE][\x40-\x7E\x80-\xFE]/;
$C=qr/$X|[^\x81-\xFE]/;
@s=(' - ',' ');

the above code was commented as handling for Chinese Big-5 charset.

jay28lee

ASKER

btw, ozo, could you help me look at another of my questions as of the following, a related question from what you've answered back in 2005.

https://www.experts-exchange.com/questions/27024825/string-manupulation-big5-characters-now-needs-HTML-Entity-support.html

btw, the solution works for me using: s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

i'll simply ignore what was previously written by the original programmer of my script.