Solved

Help with cleaning string finction

Posted on 2011-09-29
8
186 Views
Last Modified: 2012-05-12
Hi

I have the attached function to clean some titles from useless characters.

The problem is the the spanish/italian/genrman characters like áéáèéë.. are being removed too.

How can I modify this function to keep those characters ?
function betterTitle($str){
	$str = strip_tags($str);
	$str = preg_replace('/[^a-zA-Z0-9\s\']/','',$str);
	return $str;
}

Open in new window

0
Comment
Question by:Fernanditos
  • 4
  • 3
8 Comments
 
LVL 16

Expert Comment

by:sjklein42
ID: 36818768
Try this:

function betterTitle($str){
	$str = strip_tags($str);
	$str = preg_replace('/[^\w0-9\s\']/','',$str);
	return $str;
} 

Open in new window


\w
any "word" character
\W
any "non-word" character


A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place. For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.

http://www.php.net/manual/en/regexp.reference.escape.php
0
 

Author Comment

by:Fernanditos
ID: 36890143
Thank you sjklein42.

It does not work, still removing the characters like: áéí

Any idea?
<?php function betterTitle($str){
	$str = strip_tags($str);
	$str = preg_replace('/[^\w0-9\s\']/','',$str);
	return $str;
}

$string="estó éspada es un%··5%%%/()=? prueba";
echo betterTitle($string);
?>

Open in new window

0
 
LVL 16

Expert Comment

by:sjklein42
ID: 36890209
Try adding the /u switch.

<?php function betterTitle($str){
	$str = strip_tags($str);
	$str = preg_replace('/[^\w0-9\s\']/u','',$str);
	return $str;
}

$string="estó éspada es un%··5%%%/()=? prueba";
echo betterTitle($string);
?>

Open in new window

0
Master Your Team's Linux and Cloud Stack

Come see why top tech companies like Mailchimp and Media Temple use Linux Academy to build their employee training programs.

 
LVL 16

Expert Comment

by:sjklein42
ID: 36890231
Sorry, that did not work either.  It does not appear to be easy.  Other solutions I found were all brute-force, enumerating all the allowed accented characters.
0
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 36891588
The problem is the the spanish/italian/genrman characters like áéáèéë.. are being removed too.

Of course they are removed - they are not part of your REGEX character class.  Try adding them to the class, something like this.  You will need to find all the characters you want to allow and put them into the regex string.  You might do that by adding more lines around line 16.

See it in action here:
http://www.laprbass.com/RAY_temp_fernanditos.php
<?php // RAY_temp_fernanditos.php
error_reporting(E_ALL);
echo "<pre>";

// SOME TEST DATA
$chars = 'the spanish/italian/genrman characters like áéáèéë.. are being removed too.';

// A REGULAR EXPRESSION TO SANITIZE THE TEST DATA
$regex
= '/'         // REGEX DELIMITER
. '['         // START OF CHARACTR CLASS
. '^'         // NEGATION - MATCH ANYTHING NOT HERE
. 'A-Z0-9'    // LETTERS AND NUMBERS
. '\s'        // WHITE SPACE
. "'"         // THE APOSTROPHE
. 'áéáèéë'    // SOME ACCENTED CHARACTERS
. ']'         // END CHARACTER CLASS
. '/'         // REGEX DELIMITER
. 'i'         // CASE-INSENSITIVE
;

// SHOW THE REGEX
echo PHP_EOL . $regex;

// SHOW THE WORK PRODUCT
$new = preg_replace($regex, NULL, $chars);
echo PHP_EOL . $chars;
echo PHP_EOL . $new;

Open in new window

0
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 36891623
Read this over and see if it gives you any ideas.  I think the letters you may want to keep include those from #192 to #255.  I'll try to show you how I might generate a regex string to include those.  Back in a moment...
<?php // RAY_entitize_western_letters.php
error_reporting(E_ALL);


// DEMONSTRATE HOW TO TRANSLATE SOME WESTERN CHARACTERS INTO ENGLISH-PRINTABLE OR ENTITIES
// SEE http://www.joelonsoftware.com/articles/Unicode.html


// TEST CASES
$arr
= array
( 'Françoise'
, 'ßeta or Beta?'
, 'ENCYCLOPÆDIA'
, 'ça va! mon élève mi niña?'
, 'A stealthy ƒart'
, 'Jean "Ðango" Reinhardt of Pont-à-Celles'
)
;

// DISPLAY EACH TEST CASE
foreach ($arr as $str)
{
    echo PHP_EOL
    . '<br/>'
    . $str
    . ' = '
    . '<strong>'
    . mungstring($str)
    . '</strong>'
    ;
}


// EXAMPLE SHOWING HOW TO TURN A PORTUGESE NAME INTO PART OF A URL STRING
$str = 'Armação de Pêra';
$new = mungString($str);
$new = strtolower($new);
$new = str_replace(' ', '-', $new);

// SHOW THE URL STRING
echo PHP_EOL
. '<br/>'
. '<strong>'
. '<a target="blank" href="http://lmgtfy.com?q='
. htmlentities(mungstring($new))
. '">'
. $str
. '</a>'
. '</strong>'
;


// EXAMPLE SHOWING HOW TO TURN A STRING INTO A NUMERICALLY ENTITIZED STRING
$str = 'Armação de Pêra';
$new = mungString($str, 'ENTITIES');
echo "<pre>";
echo PHP_EOL
. $new
. ' = '
. '<strong>'
. htmlentities($new)
. '</strong>'
;


// A FUNCTION TO RETURN THE WESTERNIZED/ENTITIZED STRING
function mungString($str, $return='TEXT')
{
    // OUR REPLACEMENT ARRAY OF ENTITIES
    static
    $entity
    = array();

    // OUR REPLACEMENT ARRAY OF CHARACTERS (YOU MAY WANT SOME CHANGES HERE)
    static
    $normal
    = array
    ( 'ƒ' => 'f'  // http://en.wikipedia.org/wiki/%C6%91 florin
    , 'Š' => 'S'  // http://en.wikipedia.org/wiki/%C5%A0 S-caron (voiceless postalveolar fricative)
    , 'š' => 's'  // http://en.wikipedia.org/wiki/%C5%A0 s-caron
    , 'Ð' => 'Dj' // http://en.wikipedia.org/wiki/Eth (voiced dental fricative)
    , 'Ž' => 'Z'  // http://en.wikipedia.org/wiki/%C5%BD Z-caron (voiced postalveolar fricative)
    , 'ž' => 'z'  // http://en.wikipedia.org/wiki/%C5%BD z-caron
    , 'À' => 'A'
    , 'Á' => 'A'
    , 'Â' => 'A'
    , 'Ã' => 'A'
    , 'Ä' => 'A'
    , 'Å' => 'A'
    , 'Æ' => 'E'
    , 'Ç' => 'C'
    , 'È' => 'E'
    , 'É' => 'E'
    , 'Ê' => 'E'
    , 'Ë' => 'E'
    , 'Ì' => 'I'
    , 'Í' => 'I'
    , 'Î' => 'I'
    , 'Ï' => 'I'
    , 'Ñ' => 'N'
    , 'Ò' => 'O'
    , 'Ó' => 'O'
    , 'Ô' => 'O'
    , 'Õ' => 'O'
    , 'Ö' => 'O'
    , 'Ø' => 'O'
    , 'Ù' => 'U'
    , 'Ú' => 'U'
    , 'Û' => 'U'
    , 'Ü' => 'U'
    , 'Ý' => 'Y'
    , 'Þ' => 'B'
    , 'ß' => 'Ss'
    , 'à' => 'a'
    , 'á' => 'a'
    , 'â' => 'a'
    , 'ã' => 'a'
    , 'ä' => 'a'
    , 'å' => 'a'
    , 'æ' => 'e'
    , 'ç' => 'c'
    , 'è' => 'e'
    , 'é' => 'e'
    , 'ê' => 'e'
    , 'ë' => 'e'
    , 'ì' => 'i'
    , 'í' => 'i'
    , 'î' => 'i'
    , 'ï' => 'i'
    , 'ð' => 'o'
    , 'ñ' => 'n'
    , 'ò' => 'o'
    , 'ó' => 'o'
    , 'ô' => 'o'
    , 'õ' => 'o'
    , 'ö' => 'o'
    , 'ø' => 'o'
    , 'ù' => 'u'
    , 'ú' => 'u'
    , 'û' => 'u'
    , 'ý' => 'y'
    , 'ý' => 'y'
    , 'þ' => 'b'
    , 'ÿ' => 'y'
    )
    ;
    // RETURN THE "TRANSLATED" TEXT
    if (substr(strtoupper($return),0,1) == 'T') return strtr($str, $normal);

    // RETURN THE "ENTITIZED" TEXT
    if (substr(strtoupper($return),0,1) == 'E')
    {
        if (empty($entity))
        {
            foreach ($normal as $key => $nothing)
            {
                $entity[$key] = '&#' . ord($key) . ';';
            }
        }
        return strtr($str, $entity);
    }

    // MIGHT BE USEFUL TO GET THE LIST OF ORIGINAL LETTERS
    return array_keys($normal);
}

Open in new window

0
 
LVL 109

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 36891678
This seems to work fairly well.  Outputs

/[^A-Z0-9\s'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ]/i
Françoise = Françoise
ßeta or Beta? = ßeta or Beta
ENCYCLOPÆDIA = ENCYCLOPÆDIA
ça va! mon élève mi niña? = ça va mon élève mi niña
<?php // RAY_temp_fernanditos.php
error_reporting(E_ALL);
echo "<pre>";

// TEST CASES
$arr
= array
( 'Françoise'
, 'ßeta or Beta?'
, 'ENCYCLOPÆDIA'
, 'ça va! mon élève mi niña?'
)
;

// A REGULAR EXPRESSION TO SANITIZE THE TEST DATA
$rgx
= '/'         // REGEX DELIMITER
. '['         // START OF CHARACTR CLASS
. '^'         // NEGATION - MATCH ANYTHING NOT HERE
. 'A-Z0-9'    // LETTERS AND NUMBERS
. '\s'        // WHITE SPACE
. "'"         // THE APOSTROPHE
. 'XXX'       // A PLACE HOLDER
. ']'         // END CHARACTER CLASS
. '/'         // REGEX DELIMITER
. 'i'         // CASE-INSENSITIVE
;

// MODIFY THE REGEX TO ADD THE CHARACTERS AT #192-255
$num = range(192, 255);
$chs = NULL;
foreach ($num as $ord)
{
    $chs .= chr($ord);
}
$rgx = str_replace('XXX', $chs, $rgx);

// SHOW THE REGEX
echo PHP_EOL . $rgx;

// SHOW THE WORK PRODUCT
// DISPLAY EACH TEST CASE
foreach ($arr as $str)
{
    echo PHP_EOL
    . $str
    . ' = '
    . '<strong>'
    . preg_replace($rgx, NULL, $str)
    . '</strong>'
    ;
}

Open in new window

0
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 36891693
Here is the regex string, using a range of characters.  As you decide you need to keep more characters like the dash or question mark you can add them to this string.

Best of luck with your project, ~Ray
/[^A-Z0-9\s'À-ÿ]/i

Open in new window

0

Featured Post

DevOps Toolchain Recommendations

Read this Gartner Research Note and discover how your IT organization can automate and optimize DevOps processes using a toolchain architecture.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

"In order to have an organized way for empathy mapping, we rely on a psychological model and trying to model it in a simple way, so we will split the board to three section for each persona and a scenario and try to see what those personas would Do,…
Any business that wants to seriously grow needs to keep the needs and desires of an international audience of their websites in mind. Making a website friendly to international users isn’t prohibitively expensive and can provide an incredible return…
The viewer will learn how to dynamically set the form action using jQuery.
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…

776 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question