Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
?
Solved

Function to REMOVE any special character except LETTERS and NUMBERS.

Posted on 2011-10-26
18
Medium Priority
?
519 Views
Last Modified: 2012-06-27
Hi,

I have a search form and I save all search terms.

I need a function CleanTerm() to clean my string and definitively REMOVE all non alphabetical characters and special characters with no exceptions... and some predefined words"

For example:

$badwords = "http,www,cache,death"

$str = "http://www.this-is-a-test/ácido )  .    --´    (?¿¿!><"''''_:::`+`+[]¨{mañana";

Result should be:

echo CleanTerm($str);

Result: "This is a test ácido mañana"

So, it will be removed any special character except letter and numbers.

Please help me out with a good function to handle this.
0
Comment
Question by:Fernanditos
  • 6
  • 5
  • 4
  • +1
18 Comments
 
LVL 10

Expert Comment

by:ienaxxx
ID: 37032063
<?php

function Evil2Good($string){
 $badchars="[^\w\d]*";
 $replaceWith="";
 return preg_replace($badchars,$replaceWith,$string);
}

?>

Open in new window


it reads:

what are the bad chars? Everything is NOT a word (\w) nor a digit (\d)
Replace with what? a null string ("")
GO


You can add any additional char you want to ALLOW  to the negative class, to modify the behaviour.

0
 
LVL 31

Expert Comment

by:Marco Gasi
ID: 37032117
I had to use 3 steps:

$subject = "http://www.this-is-a-test/ácido )  .    --´    (?¿¿!><"''''_:::`+`+[]¨{mañana";

$result1 = preg_replace('/\W+|_/', ' ', $subject); //this gives "http www this is a test ácido   mañana"

$result2 = preg_replace('/\s{2,}/', ' ', $result1); //this gives "http www this is a test ácido mañana"

$result3 = preg_replace('/http|www|cache|death/', '', $result2); //this gives "  this is a test ácido mañana"

As you see the last result have 2 spaces before the first word, so you can add

$result4 = trim($result3);

Cheers
0
 
LVL 10

Expert Comment

by:ienaxxx
ID: 37032127
Oh, sry...
i didn't notice the www.part.

Well, you can add additional behaviours using arrays like that:
<?php

function Evil2Good($string){
 $badchars[]="[^\w\d]*";
$badchars[]="cache";
$badchars[]="www";
$badchars[]="http";
$badchars[]="death";
$badchars[]="[\s]{2,}";
 $replaceWith=array_fill(0, 5, " ");
 $replaceWith[]=" ";
 
 return preg_replace($badchars,$replaceWith,$string);
}

?>

Open in new window


it reads:
replace everything i don't like with a space, then replace multiple spaces with a single.
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
LVL 10

Expert Comment

by:ienaxxx
ID: 37032163
no, sry...
here it is:
<?php

function Evil2Good($string){
 $badchars[]="[^\w\d]*";
$badchars[]="cache";
$badchars[]="www";
$badchars[]="http";
$badchars[]="death";
 $replaceWith=array_fill(0, 5, " ");
$res1 = preg_replace($badchars,$replaceWith,$string);
$multispace="[\s]{2,}";
 $replaceWith=" ";

 return preg_replace($multispace,$replaceWith,$string);
}

?>

Open in new window

0
 

Author Comment

by:Fernanditos
ID: 37032356
@Lena, I get this error:
Warning: preg_replace() [function.preg-replace]: Unknown modifier '*'
Warning: preg_replace() [function.preg-replace]: Unknown modifier '{'

@marqusG: Your solution is close but it is removing the aphabetical characters ñ and á
0
 
LVL 31

Expert Comment

by:Marco Gasi
ID: 37032367
Yes, I saw testing code and I thought it was something about character encoding: do you use utf-8?
0
 
LVL 31

Expert Comment

by:Marco Gasi
ID: 37032392
Now I have to go, but I'll woark about later. I post now my solution enclosed in a function: later I'll try to solve last problems:

function CleanString($str){
    $str = utf8_decode($str);
    $str = preg_replace('/\W+|_/', ' ', $str);
    $str = preg_replace('/\s{2,}/', ' ', $str);
    $str = preg_replace('/http|www|cache|death/', '', $str); //this gives "  this is a test ácido mañana"
     return trim($str);
}

Cheers
0
 

Author Comment

by:Fernanditos
ID: 37032580
Yes I use utf-8. Thank you for enclosing it in a Function as I requests. Now we only need to allow letters like áÑ, etc.

Thank you
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 37032696
IIRC, you can use letters like áÑ in the regex string - just put them in there.  But beware of UTF-8 collisions.

This, on the other hand, is a fools errand.
$badwords = "http,www,cache,death"

Let me try to explain why this strategy does not work.  Let's say you do not want some jerk to post "penis enlargement" ads in your web page.  So you look for the words and exclude them.  But the obvious variants on the theme are too many for you to stop: pen1s, p3nis, member, manhood, johnson, the list goes on and on.  You can never get rid of them all.  The design strategy that most sites use today is something with a button that says, "report this post."  When it is fired, the report button logs the id of the post in a data base table, where it can be checked manually.

So you can get rid of http,www,cache,death but you probably cannot think of everything that you will ever want to get rid of.  This is not like blocking an IP address.  There are just too many ways for people to be unpleasant.  While most people will not post noxious stuff into your pages, some will, and certainly the 'bots will do this in abundance.  A good line of defense goes like this...

1. Use CAPTCHA
2. Log the IP address with every post
3. Implement the "report this post" strategy
4. Review the report logs regularly.
5. Ban each IP address that causes trouble

Best of luck with it, ~Ray
0
 

Author Comment

by:Fernanditos
ID: 37032798
Ray, Thank you for your reply. Yes, I understand what you mean with the $badwords errand. It is not intended to exclude spam words, just needed to exclude a few words that I really don't want inserted on database for particular reasons.

At this point, @Marqus solution is close but his solution it is removing á Ñ and any other alphabetical characters used in Spanish, German, Italian.

Could you please provide your Function too ? I really do not understand what you mean with "you can use letters like áÑ in the regex string"

Thank you for your help!
0
 
LVL 111

Accepted Solution

by:
Ray Paseur earned 1000 total points
ID: 37032877
http://www.laprbass.com/RAY_temp_fernanditos.php
Outputs: SUCCESS
<?php // RAY_temp_fernanditos.php
error_reporting(E_ALL);

// THE TEST DATA
$str = "http://www.this-is-a-test/ácido )  .    --´    (?¿¿!><\"''''_:::`+`+[]¨{mañana";

// THE DESIRED RESULT
$out = 'This is a test ácido mañana';

// MAKE A TEST
$new = cleanTerm($str);

if ($new == $out) echo "SUCCESS";

// THE FUNCTION
function cleanterm($str)
{
    // CREATE THE REGEX ON THE FIRST ENTRY TO THE FUNCTION
    static $regexp;
    if (empty($regexp))
    {
        $english = range('A', 'Z');
        $numbers = range('0', '9');
        $accents = range(chr(192), chr(255));
        $merged  = array_merge($english, $numbers, $accents);
        $regexp  = '#[^' . implode(NULL, $merged) . ']#i';
    }

    // GET RID OF BAD CHARACTERS
    $new = preg_replace($regexp, ' ', $str);

    // GET RID OF THE BAD WORDS (A FOOLS ERRAND BUT WE WILL SHOW IT ANYWAY)
    $badwords = "http,www,cache,death";
    $words    = explode(',', $badwords);
    foreach ($words as $w)
    {
        $new = preg_replace( '#' . preg_quote($w) . '#i', NULL, $new);
    }

    // GET RID OF EXCESS WHITESPACE
    $new = trim(preg_replace('/\s+/', ' ', $new));

    // CAPITALIZE THE FIRST LETTER
    return ucfirst($new);
}

Open in new window

0
 

Author Comment

by:Fernanditos
ID: 37032995
I get this output:: This is a test à cido     maà ana

Why it works on your server and not in mine ? I tried on my localhost and live server and get always "This is a test à cido     maà ana".

Any idea why this ?
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 37033098
Possibly a UTF-8 collision of some sort.  PHP and UTF-8 do not work very well together, yet.  And you would not want to be using Unicode for western letters anyway.  See the first sentence on this page.
http://php.net/manual/en/language.types.string.php
See also the note here:
http://php.net/manual/en/intro.strings.php

This article might be interesting to you.
http://www.joelonsoftware.com/articles/Unicode.html
0
 
LVL 31

Expert Comment

by:Marco Gasi
ID: 37033135
Try this, Fernanditos:

function CleanString($str){
    $str = utf8_decode($str);
    $str = preg_replace('/\W+|_|á|Ñ/', ' ', $str);
    $str = preg_replace('/\s{2,}/', ' ', $str);
    $str = preg_replace('/http|www|cache|death/', '', $str); //this gives "  this is a test ácido mañana"
     return trim($str);
}

I added letters á and Ñ to regex in the first preg_replace. See if it works...
0
 

Author Comment

by:Fernanditos
ID: 37033202
Thank you Ray, I will read the articles. I thought this wouldn't be so complicated. :(

I tried your last solution but still not working, still getting: "this is a test cido ma ana"

@Ray Why a UTF-8 collision? Why does you file works and mine not ?, Do you some idea on how to correct this?
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 37033640
You'll have a better feel for it when you read Joel's article.  You might also find this helpful:
http://www.unicode.org/
http://www.utf-8.com/
http://en.wikipedia.org/wiki/UTF-8
http://rfc-ref.org/RFC-TEXTS/2279/index.html

There are over a hundred different encodings of the character sets above #127.  And that is where all the accented characters reside.

Sorry, but if you want  to use UTF-8 you gotta know this stuff.  I don't use UTF-8, so I avoid the problems.
0
 
LVL 31

Assisted Solution

by:Marco Gasi
Marco Gasi earned 1000 total points
ID: 37033783
I'm not Ray, Fernanditos, I'm Marco.
I got the right output but to do so I had to change default charset of the html page to ISO-8859-1. Having this done, I have to use always utf8_decode() function to get the correct output. I post
the whole page I used for testing this code:

<?php
error_reporting(E_ALL);
ini_set('display_errors', 'On');
function CleanString($str){
    $str = utf8_decode($str);
    $str = preg_replace('/\W+|_|á|ñ/', ' ', $str);
    $str = preg_replace('/\s{2,}/', ' ', $str);
    $str = preg_replace('/http|www|cache|death/', '', $str); //this gives "  this is a test ácido mañana"
     return trim($str);
}
?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>AVM - Amministrazione Vendite e Magazzino</title>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
<meta name="description" content="" />
<meta name="keywords" content="" />
</head>
<body>
<div>
<?php
$str= "http://www.this-is-a-test/ácido )  .    --´    (?¿¿!><\"''''_:::`+`+[]¨{mañana";
echo CleanString($str);
echo "<br>baño"; //wrong output
echo "<br>" . utf8_decode('baño'); //right output
?>
</div>
</body>
</html>

Cheers
0
 

Author Comment

by:Fernanditos
ID: 37034249
Thank you marqus! I will give up with this since I can't change to ISO-8859-1 right now .

I simply will change the idea:

I will request a function less strict. A function that remove following characters from string:
":/?¿\&%"!()=¡*.<>; and remove $badwords too.

That will solve my problem.

I thank Marqus and Ray for the great support and I will split points since both solutions were good and I learn something new.

Thank you.
0

Featured Post

Get expert help—faster!

Need expert help—fast? Use the Help Bell for personalized assistance getting answers to your important questions.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Password hashing is better than message digests or encryption, and you should be using it instead of message digests or encryption.  Find out why and how in this article, which supplements the original article on PHP Client Registration, Login, Logo…
This holiday season, we’re giving away the gift of knowledge—tech knowledge, that is. Keep reading to see what hacks, tips, and trends we have wrapped and waiting for you under the tree.
The viewer will learn how to dynamically set the form action using jQuery.
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…
Suggested Courses

564 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question