Link to home
Start Free TrialLog in
Avatar of jay28lee
jay28lee

asked on

How do I remove all HTML tags in a string using regular expression?

$mystring contains:

<div align="center"><a href="http://www.domain.com/">DOMAIN NAME</a></div>Some random text.<br><a href="http://www.domain2.com/">ANOTHER DOMAIN NAME</a>

I want to use Perl and regular expression to manipulate $mystring so that it removes all the HTML elements and hyperlinks, so that $mystring contains only "Some random text."

How can this be done?
ASKER CERTIFIED SOLUTION
Avatar of ozo
ozo
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of jay28lee
jay28lee

ASKER

I found a piece of code in my original script (which my previous programmer wrote), I'm suspecting this is what causing an error for my current situation of HTML removal.

s/\G($C*?)(?:  +|($X)(-)|(-)(?=$X)|($X)(?=[+=\w(])|([+=\w)])(?=$X)|(\))(?=\S)|(\S)(?=\())/$1$2$4$5$6$7$8$s[!$3]/g;

Can you tell me if there's something wrong with the above code?  And what does it do?

Should I replace it with what you mentioned?

s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

Thanks.
What are the values of $C and $X?
there's also the following code before the regular expression

   $X=qr/[\x81-\xFE][\x40-\x7E\x80-\xFE]/;
   $C=qr/$X|[^\x81-\xFE]/;
   @s=(' - ',' ');

the above code was commented as handling for Chinese Big-5 charset.
btw, ozo, could you help me look at another of my questions as of the following, a related question from what you've answered back in 2005.

https://www.experts-exchange.com/questions/27024825/string-manupulation-big5-characters-now-needs-HTML-Entity-support.html

btw, the solution works for me using: s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

i'll simply ignore what was previously written by the original programmer of my script.