• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 928
  • Last Modified:

PHP Function to Clean a String

Trying to make sense of the encoding functions in PHP, and failing.

What I need in my PHP script is this: given any string, convert it to a string containing only [A-Za-z0-9_]. BUT! If there are any diacriticals in the string, I don't want to lose the base character, so "é" should be retained as "e", "ç" as "c" etc.

Is there a standard function for this?
<?php
   $dirtyString = "á è_î ö u !!!";
   $cleanString = function_im_looking_for($dirtyString);
   echo $cleanString; // displays "ae_iou";
?>

Open in new window

0
FlorisMK
Asked:
FlorisMK
  • 7
  • 3
  • 2
  • +1
2 Solutions
 
tillgeffkenCommented:
Maybe iconv() is what you're looking for.

Open in new window

<?php
   $dirtyString = "á è_î ö u !!!";
   $cleanString = iconv("UTF-8", 'US-ASCII//TRANSLIT', $dirtyString);
   echo $cleanString; // displays "ae_iou";
?>

Open in new window

0
 
gizmolaCommented:
Assuming you know the encoding of the original string, mb_convert_encoding or Iconv may be what you're looking for.
0
 
FlorisMKAuthor Commented:
Hmmm... so there's no way around getting into this whole encoding business? In that case, is there a way to determine the encoding of the original string?

I'm guessing that determining the encoding depends on the source of the string. The source can be either an HTML form or a MySQL database.

And given that I know the source encoding, how do I then go about converting the string to the desired "clean" encoding?
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
tillgeffkenCommented:
Have a look at the code snippet i posted.
0
 
FlorisMKAuthor Commented:
Strike that last question; tillgeffken and the PHP help both provide clear instructions on the how.
0
 
drakesheCommented:
You can add as many other characters you want to the function
<?php
function conv($var){
 $chars = array(
  "é" => "e",
  "è" => "e",
  "ê" => "e",
  "à" => "a",
  "á" => "a",
  "â" => "a",
  "ì" => "i",
  "í" => "i",
  "î" => "i",
  "ò" => "o",
  "ó" => "o",
  "ô" => "o");
  "ö" => "o");
  "!" => "");
  " " => "");
  return str_replace(array_keys($chars),$chars,$var);
}
$dirty = "héllo some others í and ô";
$dirty = conv($dirty);
echo conv($dirty);
?>

Open in new window

0
 
FlorisMKAuthor Commented:
Thanks everyone, for the help so far.

drakeshe, of course that would work, but that kind of brute force approach would be my last resort.

tillgeffken, gizmola, this must be the right approach. Unfortunately, mb_convert_encoding just throws out all the diacritics. And for some reason, the //TRANSLIT parameter does nothing in iconv. WIth or without it, iconv breaks my string at the first illegal character. The code below, with or without TRANSLIIT results in:

Dirty string: aëiöu$%
Encoding: UTF-8
Clean string (iconv): a
Clean string (mb_convert_encoding): aiu$%
<?php
   $dirtyString = $_POST['dirtystring'];
   $encoding = mb_detect_encoding($dirtyString);
   $cleanString = iconv($encoding, 'US-ASCII//TRANSLIT', $dirtyString);
   $cleanString2 = mb_convert_encoding($dirtyString,'US-ASCII',$encoding);
?>
<html>
<head>
<title>Untitled</title>
</head>
<body>
<p>Dirty string: <?=$dirtyString;?><br>
Encoding: <?=$encoding;?><br>
Clean string (iconv): <?=$cleanString;?><br>
Clean string (mb_convert_encoding): <?=$cleanString2;?></p>
</body>
</html>

Open in new window

0
 
FlorisMKAuthor Commented:
PS: With both //TRANSLIT and //IGNORE, the iconv function yields the exact same result as the mb_convert_encoding function.
0
 
drakesheCommented:
There is no inbuilt function of PHP to do what you are asking. You should just do it the brute force way of just going through each character.

If you would like to just get rid of any of the unwanted characters instead of replacing them with their English counterparts you could just do a pregmatchreplace like below

The code below will return "hwryu?"
Just put any other characters you would like not to be removed within the [ and  ]. So if you don't want to remove spaces then you can have:
[^0-9a-zA-Z? ]
If you would like to not have question marks just remove the Question mark like so:
[^0-9a-zA-Z ]
Anyways, hope you find what your looking for...
<?php
$dirtyString = "hôw árè yöu?";
$cleanstring = preg_replace ("/[^0-9a-zA-Z?]/", "", $dirtyString);
echo($cleanstring);
?>

Open in new window

0
 
FlorisMKAuthor Commented:
I'm going to hold off on your conclusion, drakeshe, until I get confirmation that the //TRANSLIT parameter does not do what I want. According to online sources and PHP documentation, iconv with //TRANSLIT does *exactly* what I want, except for the fact that it doesn't.
0
 
drakesheCommented:
have you downloaded //TRANSLIT from http://pecl.php.net/package/translit ?
0
 
FlorisMKAuthor Commented:
Nope. And I don't think I need it; target data will always be Latin with diacritics. The iconv //TRANSLIT parameter should be enough, if it works.
0
 
FlorisMKAuthor Commented:
tillgeffken, thanks for the advice. Your solution would have worked according to all sources, except for the fact that it doesn't. Hence only 62 points.

The other 63 (one more!) are for you, drakeshe, for forcing me to accept that we have to go with the brute force approach, and for pointing me to the trick with access_keys.
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 7
  • 3
  • 2
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now