Link to home
Start Free TrialLog in
Avatar of Fernanditos
Fernanditos

asked on

Function to REMOVE any special character except LETTERS and NUMBERS.

Hi,

I have a search form and I save all search terms.

I need a function CleanTerm() to clean my string and definitively REMOVE all non alphabetical characters and special characters with no exceptions... and some predefined words"

For example:

$badwords = "http,www,cache,death"

$str = "http://www.this-is-a-test/ácido )  .    --´    (?¿¿!><"''''_:::`+`+[]¨{mañana";

Result should be:

echo CleanTerm($str);

Result: "This is a test ácido mañana"

So, it will be removed any special character except letter and numbers.

Please help me out with a good function to handle this.
Avatar of ienaxxx
ienaxxx
Flag of Italy image

<?php

function Evil2Good($string){
 $badchars="[^\w\d]*";
 $replaceWith="";
 return preg_replace($badchars,$replaceWith,$string);
}

?>

Open in new window


it reads:

what are the bad chars? Everything is NOT a word (\w) nor a digit (\d)
Replace with what? a null string ("")
GO


You can add any additional char you want to ALLOW  to the negative class, to modify the behaviour.

Avatar of Marco Gasi
I had to use 3 steps:

$subject = "http://www.this-is-a-test/ácido )  .    --´    (?¿¿!><"''''_:::`+`+[]¨{mañana";

$result1 = preg_replace('/\W+|_/', ' ', $subject); //this gives "http www this is a test ácido   mañana"

$result2 = preg_replace('/\s{2,}/', ' ', $result1); //this gives "http www this is a test ácido mañana"

$result3 = preg_replace('/http|www|cache|death/', '', $result2); //this gives "  this is a test ácido mañana"

As you see the last result have 2 spaces before the first word, so you can add

$result4 = trim($result3);

Cheers
Oh, sry...
i didn't notice the www.part.

Well, you can add additional behaviours using arrays like that:
<?php

function Evil2Good($string){
 $badchars[]="[^\w\d]*";
$badchars[]="cache";
$badchars[]="www";
$badchars[]="http";
$badchars[]="death";
$badchars[]="[\s]{2,}";
 $replaceWith=array_fill(0, 5, " ");
 $replaceWith[]=" ";
 
 return preg_replace($badchars,$replaceWith,$string);
}

?>

Open in new window


it reads:
replace everything i don't like with a space, then replace multiple spaces with a single.
no, sry...
here it is:
<?php

function Evil2Good($string){
 $badchars[]="[^\w\d]*";
$badchars[]="cache";
$badchars[]="www";
$badchars[]="http";
$badchars[]="death";
 $replaceWith=array_fill(0, 5, " ");
$res1 = preg_replace($badchars,$replaceWith,$string);
$multispace="[\s]{2,}";
 $replaceWith=" ";

 return preg_replace($multispace,$replaceWith,$string);
}

?>

Open in new window

Avatar of Fernanditos
Fernanditos

ASKER

@Lena, I get this error:
Warning: preg_replace() [function.preg-replace]: Unknown modifier '*'
Warning: preg_replace() [function.preg-replace]: Unknown modifier '{'

@marqusG: Your solution is close but it is removing the aphabetical characters ñ and á
Yes, I saw testing code and I thought it was something about character encoding: do you use utf-8?
Now I have to go, but I'll woark about later. I post now my solution enclosed in a function: later I'll try to solve last problems:

function CleanString($str){
    $str = utf8_decode($str);
    $str = preg_replace('/\W+|_/', ' ', $str);
    $str = preg_replace('/\s{2,}/', ' ', $str);
    $str = preg_replace('/http|www|cache|death/', '', $str); //this gives "  this is a test ácido mañana"
     return trim($str);
}

Cheers
Yes I use utf-8. Thank you for enclosing it in a Function as I requests. Now we only need to allow letters like áÑ, etc.

Thank you
IIRC, you can use letters like áÑ in the regex string - just put them in there.  But beware of UTF-8 collisions.

This, on the other hand, is a fools errand.
$badwords = "http,www,cache,death"

Let me try to explain why this strategy does not work.  Let's say you do not want some jerk to post "penis enlargement" ads in your web page.  So you look for the words and exclude them.  But the obvious variants on the theme are too many for you to stop: pen1s, p3nis, member, manhood, johnson, the list goes on and on.  You can never get rid of them all.  The design strategy that most sites use today is something with a button that says, "report this post."  When it is fired, the report button logs the id of the post in a data base table, where it can be checked manually.

So you can get rid of http,www,cache,death but you probably cannot think of everything that you will ever want to get rid of.  This is not like blocking an IP address.  There are just too many ways for people to be unpleasant.  While most people will not post noxious stuff into your pages, some will, and certainly the 'bots will do this in abundance.  A good line of defense goes like this...

1. Use CAPTCHA
2. Log the IP address with every post
3. Implement the "report this post" strategy
4. Review the report logs regularly.
5. Ban each IP address that causes trouble

Best of luck with it, ~Ray
Ray, Thank you for your reply. Yes, I understand what you mean with the $badwords errand. It is not intended to exclude spam words, just needed to exclude a few words that I really don't want inserted on database for particular reasons.

At this point, @Marqus solution is close but his solution it is removing á Ñ and any other alphabetical characters used in Spanish, German, Italian.

Could you please provide your Function too ? I really do not understand what you mean with "you can use letters like áÑ in the regex string"

Thank you for your help!
ASKER CERTIFIED SOLUTION
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I get this output:: This is a test à cido     maà ana

Why it works on your server and not in mine ? I tried on my localhost and live server and get always "This is a test à cido     maà ana".

Any idea why this ?
Possibly a UTF-8 collision of some sort.  PHP and UTF-8 do not work very well together, yet.  And you would not want to be using Unicode for western letters anyway.  See the first sentence on this page.
http://php.net/manual/en/language.types.string.php
See also the note here:
http://php.net/manual/en/intro.strings.php

This article might be interesting to you.
http://www.joelonsoftware.com/articles/Unicode.html
Try this, Fernanditos:

function CleanString($str){
    $str = utf8_decode($str);
    $str = preg_replace('/\W+|_|á|Ñ/', ' ', $str);
    $str = preg_replace('/\s{2,}/', ' ', $str);
    $str = preg_replace('/http|www|cache|death/', '', $str); //this gives "  this is a test ácido mañana"
     return trim($str);
}

I added letters á and Ñ to regex in the first preg_replace. See if it works...
Thank you Ray, I will read the articles. I thought this wouldn't be so complicated. :(

I tried your last solution but still not working, still getting: "this is a test cido ma ana"

@Ray Why a UTF-8 collision? Why does you file works and mine not ?, Do you some idea on how to correct this?
You'll have a better feel for it when you read Joel's article.  You might also find this helpful:
http://www.unicode.org/
http://www.utf-8.com/
http://en.wikipedia.org/wiki/UTF-8
http://rfc-ref.org/RFC-TEXTS/2279/index.html

There are over a hundred different encodings of the character sets above #127.  And that is where all the accented characters reside.

Sorry, but if you want  to use UTF-8 you gotta know this stuff.  I don't use UTF-8, so I avoid the problems.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thank you marqus! I will give up with this since I can't change to ISO-8859-1 right now .

I simply will change the idea:

I will request a function less strict. A function that remove following characters from string:
":/?¿\&%"!()=¡*.<>; and remove $badwords too.

That will solve my problem.

I thank Marqus and Ray for the great support and I will split points since both solutions were good and I learn something new.

Thank you.