Go Premium for a chance to win a PS4. Enter to Win

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 200
  • Last Modified:

Help with cleaning string finction

Hi

I have the attached function to clean some titles from useless characters.

The problem is the the spanish/italian/genrman characters like áéáèéë.. are being removed too.

How can I modify this function to keep those characters ?
function betterTitle($str){
	$str = strip_tags($str);
	$str = preg_replace('/[^a-zA-Z0-9\s\']/','',$str);
	return $str;
}

Open in new window

0
Fernanditos
Asked:
Fernanditos
  • 4
  • 3
1 Solution
 
sjklein42Commented:
Try this:

function betterTitle($str){
	$str = strip_tags($str);
	$str = preg_replace('/[^\w0-9\s\']/','',$str);
	return $str;
} 

Open in new window


\w
any "word" character
\W
any "non-word" character


A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place. For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.

http://www.php.net/manual/en/regexp.reference.escape.php
0
 
FernanditosAuthor Commented:
Thank you sjklein42.

It does not work, still removing the characters like: áéí

Any idea?
<?php function betterTitle($str){
	$str = strip_tags($str);
	$str = preg_replace('/[^\w0-9\s\']/','',$str);
	return $str;
}

$string="estó éspada es un%··5%%%/()=? prueba";
echo betterTitle($string);
?>

Open in new window

0
 
sjklein42Commented:
Try adding the /u switch.

<?php function betterTitle($str){
	$str = strip_tags($str);
	$str = preg_replace('/[^\w0-9\s\']/u','',$str);
	return $str;
}

$string="estó éspada es un%··5%%%/()=? prueba";
echo betterTitle($string);
?>

Open in new window

0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
sjklein42Commented:
Sorry, that did not work either.  It does not appear to be easy.  Other solutions I found were all brute-force, enumerating all the allowed accented characters.
0
 
Ray PaseurCommented:
The problem is the the spanish/italian/genrman characters like áéáèéë.. are being removed too.

Of course they are removed - they are not part of your REGEX character class.  Try adding them to the class, something like this.  You will need to find all the characters you want to allow and put them into the regex string.  You might do that by adding more lines around line 16.

See it in action here:
http://www.laprbass.com/RAY_temp_fernanditos.php
<?php // RAY_temp_fernanditos.php
error_reporting(E_ALL);
echo "<pre>";

// SOME TEST DATA
$chars = 'the spanish/italian/genrman characters like áéáèéë.. are being removed too.';

// A REGULAR EXPRESSION TO SANITIZE THE TEST DATA
$regex
= '/'         // REGEX DELIMITER
. '['         // START OF CHARACTR CLASS
. '^'         // NEGATION - MATCH ANYTHING NOT HERE
. 'A-Z0-9'    // LETTERS AND NUMBERS
. '\s'        // WHITE SPACE
. "'"         // THE APOSTROPHE
. 'áéáèéë'    // SOME ACCENTED CHARACTERS
. ']'         // END CHARACTER CLASS
. '/'         // REGEX DELIMITER
. 'i'         // CASE-INSENSITIVE
;

// SHOW THE REGEX
echo PHP_EOL . $regex;

// SHOW THE WORK PRODUCT
$new = preg_replace($regex, NULL, $chars);
echo PHP_EOL . $chars;
echo PHP_EOL . $new;

Open in new window

0
 
Ray PaseurCommented:
Read this over and see if it gives you any ideas.  I think the letters you may want to keep include those from #192 to #255.  I'll try to show you how I might generate a regex string to include those.  Back in a moment...
<?php // RAY_entitize_western_letters.php
error_reporting(E_ALL);


// DEMONSTRATE HOW TO TRANSLATE SOME WESTERN CHARACTERS INTO ENGLISH-PRINTABLE OR ENTITIES
// SEE http://www.joelonsoftware.com/articles/Unicode.html


// TEST CASES
$arr
= array
( 'Françoise'
, 'ßeta or Beta?'
, 'ENCYCLOPÆDIA'
, 'ça va! mon élève mi niña?'
, 'A stealthy ƒart'
, 'Jean "Ðango" Reinhardt of Pont-à-Celles'
)
;

// DISPLAY EACH TEST CASE
foreach ($arr as $str)
{
    echo PHP_EOL
    . '<br/>'
    . $str
    . ' = '
    . '<strong>'
    . mungstring($str)
    . '</strong>'
    ;
}


// EXAMPLE SHOWING HOW TO TURN A PORTUGESE NAME INTO PART OF A URL STRING
$str = 'Armação de Pêra';
$new = mungString($str);
$new = strtolower($new);
$new = str_replace(' ', '-', $new);

// SHOW THE URL STRING
echo PHP_EOL
. '<br/>'
. '<strong>'
. '<a target="blank" href="http://lmgtfy.com?q='
. htmlentities(mungstring($new))
. '">'
. $str
. '</a>'
. '</strong>'
;


// EXAMPLE SHOWING HOW TO TURN A STRING INTO A NUMERICALLY ENTITIZED STRING
$str = 'Armação de Pêra';
$new = mungString($str, 'ENTITIES');
echo "<pre>";
echo PHP_EOL
. $new
. ' = '
. '<strong>'
. htmlentities($new)
. '</strong>'
;


// A FUNCTION TO RETURN THE WESTERNIZED/ENTITIZED STRING
function mungString($str, $return='TEXT')
{
    // OUR REPLACEMENT ARRAY OF ENTITIES
    static
    $entity
    = array();

    // OUR REPLACEMENT ARRAY OF CHARACTERS (YOU MAY WANT SOME CHANGES HERE)
    static
    $normal
    = array
    ( 'ƒ' => 'f'  // http://en.wikipedia.org/wiki/%C6%91 florin
    , 'Š' => 'S'  // http://en.wikipedia.org/wiki/%C5%A0 S-caron (voiceless postalveolar fricative)
    , 'š' => 's'  // http://en.wikipedia.org/wiki/%C5%A0 s-caron
    , 'Ð' => 'Dj' // http://en.wikipedia.org/wiki/Eth (voiced dental fricative)
    , 'Ž' => 'Z'  // http://en.wikipedia.org/wiki/%C5%BD Z-caron (voiced postalveolar fricative)
    , 'ž' => 'z'  // http://en.wikipedia.org/wiki/%C5%BD z-caron
    , 'À' => 'A'
    , 'Á' => 'A'
    , 'Â' => 'A'
    , 'Ã' => 'A'
    , 'Ä' => 'A'
    , 'Å' => 'A'
    , 'Æ' => 'E'
    , 'Ç' => 'C'
    , 'È' => 'E'
    , 'É' => 'E'
    , 'Ê' => 'E'
    , 'Ë' => 'E'
    , 'Ì' => 'I'
    , 'Í' => 'I'
    , 'Î' => 'I'
    , 'Ï' => 'I'
    , 'Ñ' => 'N'
    , 'Ò' => 'O'
    , 'Ó' => 'O'
    , 'Ô' => 'O'
    , 'Õ' => 'O'
    , 'Ö' => 'O'
    , 'Ø' => 'O'
    , 'Ù' => 'U'
    , 'Ú' => 'U'
    , 'Û' => 'U'
    , 'Ü' => 'U'
    , 'Ý' => 'Y'
    , 'Þ' => 'B'
    , 'ß' => 'Ss'
    , 'à' => 'a'
    , 'á' => 'a'
    , 'â' => 'a'
    , 'ã' => 'a'
    , 'ä' => 'a'
    , 'å' => 'a'
    , 'æ' => 'e'
    , 'ç' => 'c'
    , 'è' => 'e'
    , 'é' => 'e'
    , 'ê' => 'e'
    , 'ë' => 'e'
    , 'ì' => 'i'
    , 'í' => 'i'
    , 'î' => 'i'
    , 'ï' => 'i'
    , 'ð' => 'o'
    , 'ñ' => 'n'
    , 'ò' => 'o'
    , 'ó' => 'o'
    , 'ô' => 'o'
    , 'õ' => 'o'
    , 'ö' => 'o'
    , 'ø' => 'o'
    , 'ù' => 'u'
    , 'ú' => 'u'
    , 'û' => 'u'
    , 'ý' => 'y'
    , 'ý' => 'y'
    , 'þ' => 'b'
    , 'ÿ' => 'y'
    )
    ;
    // RETURN THE "TRANSLATED" TEXT
    if (substr(strtoupper($return),0,1) == 'T') return strtr($str, $normal);

    // RETURN THE "ENTITIZED" TEXT
    if (substr(strtoupper($return),0,1) == 'E')
    {
        if (empty($entity))
        {
            foreach ($normal as $key => $nothing)
            {
                $entity[$key] = '&#' . ord($key) . ';';
            }
        }
        return strtr($str, $entity);
    }

    // MIGHT BE USEFUL TO GET THE LIST OF ORIGINAL LETTERS
    return array_keys($normal);
}

Open in new window

0
 
Ray PaseurCommented:
This seems to work fairly well.  Outputs

/[^A-Z0-9\s'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ]/i
Françoise = Françoise
ßeta or Beta? = ßeta or Beta
ENCYCLOPÆDIA = ENCYCLOPÆDIA
ça va! mon élève mi niña? = ça va mon élève mi niña
<?php // RAY_temp_fernanditos.php
error_reporting(E_ALL);
echo "<pre>";

// TEST CASES
$arr
= array
( 'Françoise'
, 'ßeta or Beta?'
, 'ENCYCLOPÆDIA'
, 'ça va! mon élève mi niña?'
)
;

// A REGULAR EXPRESSION TO SANITIZE THE TEST DATA
$rgx
= '/'         // REGEX DELIMITER
. '['         // START OF CHARACTR CLASS
. '^'         // NEGATION - MATCH ANYTHING NOT HERE
. 'A-Z0-9'    // LETTERS AND NUMBERS
. '\s'        // WHITE SPACE
. "'"         // THE APOSTROPHE
. 'XXX'       // A PLACE HOLDER
. ']'         // END CHARACTER CLASS
. '/'         // REGEX DELIMITER
. 'i'         // CASE-INSENSITIVE
;

// MODIFY THE REGEX TO ADD THE CHARACTERS AT #192-255
$num = range(192, 255);
$chs = NULL;
foreach ($num as $ord)
{
    $chs .= chr($ord);
}
$rgx = str_replace('XXX', $chs, $rgx);

// SHOW THE REGEX
echo PHP_EOL . $rgx;

// SHOW THE WORK PRODUCT
// DISPLAY EACH TEST CASE
foreach ($arr as $str)
{
    echo PHP_EOL
    . $str
    . ' = '
    . '<strong>'
    . preg_replace($rgx, NULL, $str)
    . '</strong>'
    ;
}

Open in new window

0
 
Ray PaseurCommented:
Here is the regex string, using a range of characters.  As you decide you need to keep more characters like the dash or question mark you can add them to this string.

Best of luck with your project, ~Ray
/[^A-Z0-9\s'À-ÿ]/i

Open in new window

0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

  • 4
  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now