Link to home
Start Free TrialLog in
Avatar of killer455
killer455

asked on

strip tags preg expression (regular expressions)

Hi,

Im using this:
      $allowed = "b|i|u|a";

      $str = preg_replace ( "/<((?!\/?($allowed)\b)[^>]*>)/xis", "", $str );
      $str = preg_replace ( "/<($allowed).*?>/i", "<\\1>", $str );

However allowing the link tag (a href) does not work because its not recognizing the href part.

I want this to filter out EVERYTHING but $allowed tags.
Avatar of BogoJoker
BogoJoker

Have you seen strip_tags() in the php docs?
http://www.php.net/manual/en/function.strip-tags.php

The optional parameter is allowable tags.  It looks like this is what you would want:
strip_tags($str, '<b><i><u><a>');

Joe P
There is a slight flaw to the strip_tags() function...if you have a greater-than symbol (>) inside of the ALT attribute of an IMG tag, it will not strip it properly. Also, if you want to remove embedded JavaScript or CSS, you should do so separately using preg_replace(). But after handling these three cases, go ahead and use strip_tags() to handle everything else as BogoJoker suggested, it would be most efficient.

Article about stripping the IMG tag with regex:
http://haacked.com/archive/2004/10/25/1471.aspx

What you should do, probably:

$str = preg_replace('@<script[^>]*?>.*?</script\s*>@si', '', $str);
$str = preg_replace('@<style[^>]*?>.*?</style\s*>@si', '', $str);
$str = preg_replace('@</?img((\s+\w+(\s*=\s*(?:".*?"|\'.*?\'|[^\'">\s]+))?)+\s*|\s*)/?>@i', '', $str);
$str = strip_tags($str, '<b><i><u><a>');
Avatar of killer455

ASKER

Will strip tags correctly handle the a href issue though? What about other attributes to links such as target?
<?php
// Get Google into a string
$google = file_get_contents("http://www.google.com");
// Strip all tags but <a> tags
$google = strip_tags($google, '<a>');
// Print
echo $google;
?>

Worked for <a> tags, the href still worked.  The only issue is if there is a > sign in a tag in double/single quotes.  And that should never happen in a url (to the best of my knowledge).  I don't even think that is a legal character in a url!  It would probably be some unicode trick like %20 for space...  If you have any concerns read the article that soapergem posted, it is very informative on the subject.

I feel strip_tags will work perfectly for you.
Joe P
Yes, Joe is right, strip_tags() will not run into any problems just because of the href-attributes. It strips all tags, except the ones you allow. And this function only focuses on the tag itself--not the attributes, so it will leave all of the attributes intact on the allowable tags.
Ok thanks.  I noticed soapergem you are removing <script> and <style> tags seperately.  Why doesnt the strip_tags function get these?  Are there others strip_tags doesnt cover? such as <applet> <iframe>, etc.

Script tags and Style tags would be removed perfectly by the strip_tags() function, but any content between the tags would not. This may or may not be important to you. Example: let's suppose this is the HTML that you are filtering:

<script type="text/javascript">
function say_hi()
{
    alert('Hi!');
}
window.onload = function(){say_hi();}
</script>

Using the strip_tags() function will remove the script tags, but none of the content in between, so you would end up with this:

function say_hi()
{
    alert('Hi!');
}
window.onload = function(){say_hi();}

Now that would be interpreted as plain text by any browser at all, so it wouldn't actually execute it as JavaScript--but I'd assume that you wouldn't want that text to be there at all. Same thing goes for the <style> tag, that will often have CSS content in between the opening and closing tag.
With that in mind what about t he other tags I mentioned such as <applet>?

here is my own function

you can get an idea
function VISION_TO_PRINT_FILTER($content)
{
      $content = str_replace(array('<code>', '</code>'), array('<span>', '</span>'), $content);
      $content = str_replace(array('<pre>', '</pre>'), array('<span>', '</span>'), $content);
      $fonts = str_replace(array('<font ', '</font>'), array('<span ', '</span>'), $content);
      $content = preg_replace('#color="(.*?)"#', 'style="color: \\1"', $fonts);
/*
      $content = eregi_replace( "<div[^>]*>", "", $content );
      $content = eregi_replace( "</div>", "", $content );
*/
      $content = eregi_replace( "javascript[^>]*>", "", $content );
      $content = preg_replace("/\<applet[^>]*\>(.+?)\<\/applet\>/is","",$content);
      $content = preg_replace("/\<embed[^>]*\>(.+?)\<\/embed\>/is","",$content);
      $content = preg_replace("/\<object[^>]*\>(.+?)\<\/object\>/is","",$content);
      $content = preg_replace("/\<form[^>]*\>(.+?)\<\/form\>/is","",$content);
      $content = preg_replace("/\<script[^>]*\>(.+?)\<\/script\>/is","",$content);

return $content;
}

i use it as filter but you can reverse and use it for allowed ...
feha
 www.vision.to

killer455, sorry I didn't get back to you earlier. The <applet> tag and all those others are fine; the code embedded there is just stored in the attributes.

e.g.
<applet codebase="blah" blahblah="somethingsomething" etc >

So you needn't worry about that. feha also posted a lot of stuff that really isn't necessary for the same reason. <object>s and <embed>s are, likewise, self-contained, so you don't need to handle them specially, either. Just strip use the code I originally posted, stripping the JavaScript, CSS, and image tags, and then using strip_tags() to wipe out all the rest (except the allowed ones, of course).

feha, I noticed you also modified my code to strip everything in between <form> and </form>. Even if the user is trying to maliciously create a form, this is probably still a bad idea, since *a lot* of content can be contained between <form> tags, much of which could be perfectly legitimate. So I think it'd be fine just to use strip_tags()--it would eliminate the opening and closing tags of the <form> as well as all the elements (<input>s, <button>s, <select>s, and <textarea>s, all automatically), while leaving behind any text that should have been there. So I vote to use strip_tags() for as much as you can. Just my two cents.
Let me say what we are trying to do and then if there are any closing comments let me know.   I have a form which submits a description (textarea field) to a database.  It would be nice to let the user submit a lot of valid HTML such as text formatting, images, and maybe some tables.  However the idea is to not let this be a security hole.

Okay. Two ways to do it. I removed the line to strip images, though, since it seems you'll allow them.

METHOD 1: Strip everything except allowed tags

<?php

$description = trim($_POST['description']);
$description = preg_replace('@<script[^>]*?>.*?</script\s*>@si', '', $description);
$description = preg_replace('@<style[^>]*?>.*?</style\s*>@si', '', $description);

$allowed = '<p><div><b><i><u><a><s><table><tr><td><th><colspan><col><img><strong><em>';
$str = strip_tags($description, $allowed);

?>

METHOD 2: Strip only the disallowed tags

<?php

$description = trim($_POST['description']);
$description = preg_replace('@<script[^>]*?>.*?</script\s*>@si', '', $description);
$description = preg_replace('@<style[^>]*?>.*?</style\s*>@si', '', $description);

$disallowed = array('applet', 'bgsound', 'blink', 'body', 'button', 'embed', 'form', 'frame', 'frameset', 'head',
    'html', 'iframe', 'ilayer', 'img', 'input', 'keygen', 'label', 'layer', 'link', 'object', 'optgroup', 'option', 'marquee',
    'meta', 'noframes', 'nolayer', 'noscript', 'param', 'script', 'select');

foreach ($disallowed as $tag)
{
    $description = preg_replace('@</?' . $tag . '[^>]*>@i', '', $description);
}

?>
Whoops, I used $str instead of $description toward the end of the first method. (Change it for consistency.)
Oh, and remove 'img' from the list of disallowed tags.
"feha, I noticed you also modified my code to strip everything in between "
sorry boy i did not modified "your code" this function is being in used in my CMS project long time ago ...
http://www.vision.to/CMS_BE/Home/ 

This is supposed to prevent entering form code within form ...

using
foreach ($disallowed as $tag)
{
    $description = preg_replace('@</?' . $tag . '[^>]*>@i', '', $description);
}

in your code defeats the role of reg_ex  :-(


Here is more info about ...

http://se2.php.net/strip_tags
> sorry boy i did not modified "your code"

"Boy"? Let's show a little respect, please. But getting back to the topic at hand, the whole point of what I was saying is that it's probably a bad idea to strip *absolutely everything* between <form> tags--there can be lots of legitimate content there! (text and whatnot)

So rather than stripping the content between the <form> tags, it's better, in practice, to just strip the <form> tags themselves, along with any <input> tags, <textarea>s, and so on. You could argue that by stripping the content between <form>s, this problem would be solved.....but you'd be wrong. There is no guarantee that a malicious user would write valid, well-formed code--in fact, they probably won't! They could simply omit the <form> tags altogether and include only the <input>s, in which case the code you posted would fail (you have not accounted for <input>s that fall outside of <form>s!).

Thus, the only safe catch-all is to actually strip these tags specifically: this includes <input>, <select>, <textarea>, <button>, <option>, and <optgroup>. You also want to handle the other malicious ones, like <iframe> and <ilayer>, which can enable a user to include pretty much anything they want. Plus all the others as well.

> ...defeats the role of reg_ex

I disagree. But to avoid argument on the subject, here's a revision to my "method 2," to appease the masses:

<?php

$description = trim($_POST['description']);
$description = preg_replace('@<script[^>]*?>.*?</script\s*>@si', '', $description);
$description = preg_replace('@<style[^>]*?>.*?</style\s*>@si', '', $description);

$regex  = '@</?(';
$regex .= 'applet|bgsound|blink|body|button|embed|form|frame|frameset|head|';
$regex .= 'html|iframe|ilayer|input|keygen|label|layer|link|object|optgroup|option|marquee|';
$regex .= 'meta|noframes|nolayer|noscript|param|select';
$regex .= ')[^>]*>@i';

$description = preg_replace($regex, '', $description);

?>
Thanks for the detailed response.  Then after that im assuming htmlentities() would be best before insertion into the DB?  Would you still want to addslashes() or would htmlentities() be all you need?

Bumped points up a small bit for this extra question.

here i found simple function for your needs:

function my_remove_slashes($st) {
  if (get_magic_quotes_gpc()) { return stripslashes($st); }
  else { return $st; }
}

function security_filter($st) {
   $attribs = 'javascript:|onclick|ondblclick|onmousedown|onmouseup|onmouseover|'.
              'onmousemove|onmouseout|onkeypress|onkeydown|onkeyup';

   $st = my_remove_slashes($st);
   $st = stripslashes(preg_replace("/$attribs/i", 'forbidden', $st));
   $st = strip_tags($st);
   $st = htmlentities($st);
   return $st;
}

hope this helps
:-)
you can add your own attributes
feha,

You are using stripslashes, shouldnt this be addslashes before putting it into DB? to escape and single/double quotes?


hi killer
stripslashes is conditional by

if (get_magic_quotes_gpc()) { return stripslashes($st); }

just try it you will see if it works for you :-)
I know but you are removing the slashes from the string?  I thought you wanted to escape quotes before database insertion?
that is what magic quotes does automatically. Magic isnt it =)
feha, I want to give you fair warning about one of your replacements. You used this:

    preg_replace("/$attribs/i", 'forbidden', $st)

...to strip things like "javascript:", "onclick", "ondblclick", etc. However, with such a simple regex you are simply stripping any occurences of those pieces of text, regardless of whether they appear inside an HTML tag or not. This may be a very effective method for thwarting would-be hackers, but it may also be overkill. What if the user needed to write a post that does not actually include any JavaScript, but *talks about* JavaScript? This would then filter to much. Here's an example. Let's say the user included a post like this:

    To add Javascript to a button, you would use the onclick event, like this:
    &lt;input type="button" onclick="custom_function();" value="Go!" /&gt;

This is a totally legitimate use of those pieces of text; however, your filter would change it into this:

    To add Javascript to a button, you would use the forbidden event, like this:
    &lt;input type="button" forbidden="custom_function();" value="Go!" /&gt;

This is clearly not desired. Thus it is better, in practice, to use a regular expression that is a little more complex to not remove what could be legitimate content. Something like this should work fine:

    $description = preg_replace("/(<\w+.*?) ($attribs)(=['\"]?.*?['\"]?)?(.*?/?>)/i", '$1$3', $description);

This would force the condition that these attributes be contained inside of an actual HTML tag, and not just intersparsed in the text (where they are harmless). Oh, and feel free to revise my regex usage there, experts, if you can think of a more efficient method to do it.
Whoops, change the replacement from $1$3 to $1$4
feha,

Yes I understand what magic quotes does.  But in your function, if magic quotes is off, you are not adding slashes anywhere.

----

soapergem,

What is your opinion on the add slashes above?  Also in your last post you have one line
$description = preg_replace("/(<\w+.*?) ($attribs)(=['\"]?.*?['\"]?)?(.*?/?>)/i", '$1$4', $description);

Which is not in your "method 1" or "method 2" above, should it be added?

thnks
Methods 1 and 2 are for stripping the tags you don't want, but there's still the possibility of a malicious user adding dangerous attributes to the allowed tags, so that would be for stripping the dangerous attributes from the allowed tags. feha offered a solution that would work, but I pointed out that it might work "too much" and strip some legitimate content as well. However, to be honest, I'm not 100% that my solution right there is *perfect*, which is why I said, "Oh, and feel free to revise my regex usage there, experts..." I am about 90% sure it will work, though, since it requires that the bad attributes be contained inside an HTML tag.

But yes, assuming it's perfect, you would add it to whichever method you were going with (1 or 2).

Remember that $attribs would be a string that looks like this (each separated by a pipe character), as given by feha:
    $attribs  = 'javascript:|onclick|ondblclick|onmousedown|onmouseup|onmouseover|';
    $attribs .= 'onmousemove|onmouseout|onkeypress|onkeydown|onkeyup';
Okay, I ran some tests on that attribute-stripping regex and found a better way to do it. The one I gave you will strip the attributes, but will only handle each tag once. In other words, any tag that contains more than one of the "dangerous" attributes will have one of them stripped, but the rest left intact. This is a bug (and why I said I wasn't 100%). feha's solution would never run into this problem, but like I said, it was overkill. So I revised my code and came up with a method that *will* work; this will completely strip all dangerous attributes from the allowed tags.

$regex = "@(<\w+.*?)\s+(?:$attribs)(?:=\"[^\"]*\"|='[^']*'|=[^>'\"\s]*)?(.*?/?>)@i";
while ( preg_match($regex, $description) )
{
      $description = preg_replace($regex, '$1$2', $description);
}
Oh, and we need to handle "hidden" javascript specially, as well with much of the same method employed by the attribute stripping. So, to summarize, here's a complete solution.

<?php

$description = trim($_POST['description']);

//  disallowed tags
$tags  = 'applet|bgsound|blink|body|button|embed|form|frame|frameset|head|';
$tags .= 'html|iframe|ilayer|input|keygen|label|layer|link|object|optgroup|option|marquee|';
$tags .= 'meta|noframes|nolayer|noscript|param|select';

//  disallowed attributes
$attribs  = 'onclick|ondblclick|onmousedown|onmouseup|onmouseover|';
$attribs .= 'onmousemove|onmouseout|onkeypress|onkeydown|onkeyup';

//  strip javascript, css, and tags
$regex = array('@<script[^>]*?>.*?</script\s*>@si', '@<style[^>]*?>.*?</style\s*>@si', "@</?($tags)[^>]*>@i");
$description = preg_replace($regex, '', $description);

//  strip attributes
$regex = "@(<\w+.*?)\s+(?:$attribs)(?:=\"[^\"]*\"|='[^']*'|=[^>'\"\s]*)?(.*?/?>)@i";
while ( preg_match($regex, $description) )
{
     $description = preg_replace($regex, '$1$2', $description);
}

//  strip hidden javascript
$regex = '@(<\w+.*?)(?:\s+\w+=(?:[^>\'"\s]*javascript:[^>\'"\s]*|\'[^\']*javascript:[^\']*\'|"[^"]*javascript:[^"]*"))(.*?/?>)@i';
while ( preg_match($regex, $html) )
{
      $html = preg_replace($regex, '$1$2', $html);
}

?>
Whoops, change any occurences of $html to $description, that was just left over from my testing.
Hi... points increased.  So for your last solution we are moving away from stripping all tags but the defined ones to stripping only the tags defined?  Is there an easy way to through this into a function to allow either/or?

ASKER CERTIFIED SOLUTION
Avatar of soapergem
soapergem
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
So you still have to define what attributes to strip?  I thought maybe a regex would be used to find these between the < and > of a tag.



Yes, and yes. You still need to define which attributes to strip. And a regex *is* used to find them! The first set of regex's and the strip_tags() command will eliminate any tags that you do not want, leaving only what you allow. However, there still may be some bad attributes hiding in the good tags, and you need to handle these separately. The best way to handle this is to simply remove any "bad attributes" directly, as above. I really can't think of any reasonably efficient way to strip "all attributes except good ones"--really the only way you can do it is to just strip the bad ones specifically.
soapergem,

Ok then last question once everything is stripped from the $description should i be using addslashes or even better html_entities before adding to the database?

You may not need to use addslashes(), as most installations of PHP have magic_quotes_sybase turned on by default, which means that if this data is coming from either a GET form or a POST form, then it will be automatically escaped for you already. So you can just check if it's turned on, and if *not*, then use addslashes, like this:

    if ( !get_magic_quotes_gpc() )
    {
        $description = addslashes($description);
    }

More reading on that here:
http://www.php.net/addslashes

As far as htmlentities, if you're embedding HTML in this, then you probably *do not* want to convert < and > to &lt; and &gt; because this would mess up your output. So I'd recommend something like this:

    $description = preg_replace(array('/&(\s)/', '/"/'), array('&amp;$1', '&quot;'), $description);
    $description = preg_replace('/[^\x09\x0A\x0D\x20-\x7F]/e', '"&#".ord("$0").";"', $description);

That should handle & and ", and would also take care of any non-standard characters (such as é, ç, ö, etc.).



Right im wanting to allow only certain HTML to be used.  

I justed noticed this: http://directory.fsf.org/all/SafeHTML.html

Will this do the same thing?  I like the code you have given me.  I am just paranoid on security.  I upped the points again for you if you could perhaps compare this code to see if there is an overall solution.

I took a glance at that code; it does a lot of extra stuff, too, like closing unclosed <p> tags and whatnot. It probably handles the security stuff just as fine as mine, but it still has the overall method of grabbing predefined "bad" attributes; just set those properly and my code will function just the same. You might want to duplicate the hidden javascript: line I gave you to handle vbscript: as well.

On another topic--listen, with all due respect, this question is worth much more than 200 points at this point. This is about the equivalent of a 1,500-point question; unfortunately 500 is the max. (I think it's about time you close the question, and ask another one, separately, if you still need more.)
Sorry, ill give you my last 30 points also.  I guess im still a bit new here and did not realize point value :(  I apologize.

If this is what you mean... I hope is correct -- So I will add this then for vbscript.

//  strip hidden vbscript
$regex = '@(<\w+.*?)(?:\s+\w+=(?:[^>\'"\s]*vbscript:[^>\'"\s]*|\'[^\']*vbscript:[^\']*\'|"[^"]*vbscript:[^"]*"))(.*?/?>)@i';
while ( preg_match($regex, $description) )
{
     $description = preg_replace($regex, '$1$2', $description);
}

Exactly right.
Are these taking care of the single quotes also seen in htmlspecialchars()?

    $description = preg_replace(array('/&(\s)/', '/"/'), array('&amp;$1', '&quot;'), $description);
    $description = preg_replace('/[^\x09\x0A\x0D\x20-\x7F]/e', '"&#".ord("$0").";"', $description);

Also I guess you might not want to strip the double quotes because of attributes? such as width="100%"?
Is there anyway to use a regex to determine if it is a attribute " or not?