Solved

Help with cleaning string finction

Posted on 2011-09-29
8
185 Views
Last Modified: 2012-05-12
Hi

I have the attached function to clean some titles from useless characters.

The problem is the the spanish/italian/genrman characters like áéáèéë.. are being removed too.

How can I modify this function to keep those characters ?
function betterTitle($str){
	$str = strip_tags($str);
	$str = preg_replace('/[^a-zA-Z0-9\s\']/','',$str);
	return $str;
}

Open in new window

0
Comment
Question by:Fernanditos
  • 4
  • 3
8 Comments
 
LVL 16

Expert Comment

by:sjklein42
ID: 36818768
Try this:

function betterTitle($str){
	$str = strip_tags($str);
	$str = preg_replace('/[^\w0-9\s\']/','',$str);
	return $str;
} 

Open in new window


\w
any "word" character
\W
any "non-word" character


A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place. For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.

http://www.php.net/manual/en/regexp.reference.escape.php
0
 

Author Comment

by:Fernanditos
ID: 36890143
Thank you sjklein42.

It does not work, still removing the characters like: áéí

Any idea?
<?php function betterTitle($str){
	$str = strip_tags($str);
	$str = preg_replace('/[^\w0-9\s\']/','',$str);
	return $str;
}

$string="estó éspada es un%··5%%%/()=? prueba";
echo betterTitle($string);
?>

Open in new window

0
 
LVL 16

Expert Comment

by:sjklein42
ID: 36890209
Try adding the /u switch.

<?php function betterTitle($str){
	$str = strip_tags($str);
	$str = preg_replace('/[^\w0-9\s\']/u','',$str);
	return $str;
}

$string="estó éspada es un%··5%%%/()=? prueba";
echo betterTitle($string);
?>

Open in new window

0
 
LVL 16

Expert Comment

by:sjklein42
ID: 36890231
Sorry, that did not work either.  It does not appear to be easy.  Other solutions I found were all brute-force, enumerating all the allowed accented characters.
0
3 Use Cases for Connected Systems

Our Dev teams are like yours. They’re continually cranking out code for new features/bugs fixes, testing, deploying, testing some more, responding to production monitoring events and more. It’s complex. So, we thought you’d like to see what’s working for us.

 
LVL 108

Expert Comment

by:Ray Paseur
ID: 36891588
The problem is the the spanish/italian/genrman characters like áéáèéë.. are being removed too.

Of course they are removed - they are not part of your REGEX character class.  Try adding them to the class, something like this.  You will need to find all the characters you want to allow and put them into the regex string.  You might do that by adding more lines around line 16.

See it in action here:
http://www.laprbass.com/RAY_temp_fernanditos.php
<?php // RAY_temp_fernanditos.php
error_reporting(E_ALL);
echo "<pre>";

// SOME TEST DATA
$chars = 'the spanish/italian/genrman characters like áéáèéë.. are being removed too.';

// A REGULAR EXPRESSION TO SANITIZE THE TEST DATA
$regex
= '/'         // REGEX DELIMITER
. '['         // START OF CHARACTR CLASS
. '^'         // NEGATION - MATCH ANYTHING NOT HERE
. 'A-Z0-9'    // LETTERS AND NUMBERS
. '\s'        // WHITE SPACE
. "'"         // THE APOSTROPHE
. 'áéáèéë'    // SOME ACCENTED CHARACTERS
. ']'         // END CHARACTER CLASS
. '/'         // REGEX DELIMITER
. 'i'         // CASE-INSENSITIVE
;

// SHOW THE REGEX
echo PHP_EOL . $regex;

// SHOW THE WORK PRODUCT
$new = preg_replace($regex, NULL, $chars);
echo PHP_EOL . $chars;
echo PHP_EOL . $new;

Open in new window

0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 36891623
Read this over and see if it gives you any ideas.  I think the letters you may want to keep include those from #192 to #255.  I'll try to show you how I might generate a regex string to include those.  Back in a moment...
<?php // RAY_entitize_western_letters.php
error_reporting(E_ALL);


// DEMONSTRATE HOW TO TRANSLATE SOME WESTERN CHARACTERS INTO ENGLISH-PRINTABLE OR ENTITIES
// SEE http://www.joelonsoftware.com/articles/Unicode.html


// TEST CASES
$arr
= array
( 'Françoise'
, 'ßeta or Beta?'
, 'ENCYCLOPÆDIA'
, 'ça va! mon élève mi niña?'
, 'A stealthy ƒart'
, 'Jean "Ðango" Reinhardt of Pont-à-Celles'
)
;

// DISPLAY EACH TEST CASE
foreach ($arr as $str)
{
    echo PHP_EOL
    . '<br/>'
    . $str
    . ' = '
    . '<strong>'
    . mungstring($str)
    . '</strong>'
    ;
}


// EXAMPLE SHOWING HOW TO TURN A PORTUGESE NAME INTO PART OF A URL STRING
$str = 'Armação de Pêra';
$new = mungString($str);
$new = strtolower($new);
$new = str_replace(' ', '-', $new);

// SHOW THE URL STRING
echo PHP_EOL
. '<br/>'
. '<strong>'
. '<a target="blank" href="http://lmgtfy.com?q='
. htmlentities(mungstring($new))
. '">'
. $str
. '</a>'
. '</strong>'
;


// EXAMPLE SHOWING HOW TO TURN A STRING INTO A NUMERICALLY ENTITIZED STRING
$str = 'Armação de Pêra';
$new = mungString($str, 'ENTITIES');
echo "<pre>";
echo PHP_EOL
. $new
. ' = '
. '<strong>'
. htmlentities($new)
. '</strong>'
;


// A FUNCTION TO RETURN THE WESTERNIZED/ENTITIZED STRING
function mungString($str, $return='TEXT')
{
    // OUR REPLACEMENT ARRAY OF ENTITIES
    static
    $entity
    = array();

    // OUR REPLACEMENT ARRAY OF CHARACTERS (YOU MAY WANT SOME CHANGES HERE)
    static
    $normal
    = array
    ( 'ƒ' => 'f'  // http://en.wikipedia.org/wiki/%C6%91 florin
    , 'Š' => 'S'  // http://en.wikipedia.org/wiki/%C5%A0 S-caron (voiceless postalveolar fricative)
    , 'š' => 's'  // http://en.wikipedia.org/wiki/%C5%A0 s-caron
    , 'Ð' => 'Dj' // http://en.wikipedia.org/wiki/Eth (voiced dental fricative)
    , 'Ž' => 'Z'  // http://en.wikipedia.org/wiki/%C5%BD Z-caron (voiced postalveolar fricative)
    , 'ž' => 'z'  // http://en.wikipedia.org/wiki/%C5%BD z-caron
    , 'À' => 'A'
    , 'Á' => 'A'
    , 'Â' => 'A'
    , 'Ã' => 'A'
    , 'Ä' => 'A'
    , 'Å' => 'A'
    , 'Æ' => 'E'
    , 'Ç' => 'C'
    , 'È' => 'E'
    , 'É' => 'E'
    , 'Ê' => 'E'
    , 'Ë' => 'E'
    , 'Ì' => 'I'
    , 'Í' => 'I'
    , 'Î' => 'I'
    , 'Ï' => 'I'
    , 'Ñ' => 'N'
    , 'Ò' => 'O'
    , 'Ó' => 'O'
    , 'Ô' => 'O'
    , 'Õ' => 'O'
    , 'Ö' => 'O'
    , 'Ø' => 'O'
    , 'Ù' => 'U'
    , 'Ú' => 'U'
    , 'Û' => 'U'
    , 'Ü' => 'U'
    , 'Ý' => 'Y'
    , 'Þ' => 'B'
    , 'ß' => 'Ss'
    , 'à' => 'a'
    , 'á' => 'a'
    , 'â' => 'a'
    , 'ã' => 'a'
    , 'ä' => 'a'
    , 'å' => 'a'
    , 'æ' => 'e'
    , 'ç' => 'c'
    , 'è' => 'e'
    , 'é' => 'e'
    , 'ê' => 'e'
    , 'ë' => 'e'
    , 'ì' => 'i'
    , 'í' => 'i'
    , 'î' => 'i'
    , 'ï' => 'i'
    , 'ð' => 'o'
    , 'ñ' => 'n'
    , 'ò' => 'o'
    , 'ó' => 'o'
    , 'ô' => 'o'
    , 'õ' => 'o'
    , 'ö' => 'o'
    , 'ø' => 'o'
    , 'ù' => 'u'
    , 'ú' => 'u'
    , 'û' => 'u'
    , 'ý' => 'y'
    , 'ý' => 'y'
    , 'þ' => 'b'
    , 'ÿ' => 'y'
    )
    ;
    // RETURN THE "TRANSLATED" TEXT
    if (substr(strtoupper($return),0,1) == 'T') return strtr($str, $normal);

    // RETURN THE "ENTITIZED" TEXT
    if (substr(strtoupper($return),0,1) == 'E')
    {
        if (empty($entity))
        {
            foreach ($normal as $key => $nothing)
            {
                $entity[$key] = '&#' . ord($key) . ';';
            }
        }
        return strtr($str, $entity);
    }

    // MIGHT BE USEFUL TO GET THE LIST OF ORIGINAL LETTERS
    return array_keys($normal);
}

Open in new window

0
 
LVL 108

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 36891678
This seems to work fairly well.  Outputs

/[^A-Z0-9\s'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ]/i
Françoise = Françoise
ßeta or Beta? = ßeta or Beta
ENCYCLOPÆDIA = ENCYCLOPÆDIA
ça va! mon élève mi niña? = ça va mon élève mi niña
<?php // RAY_temp_fernanditos.php
error_reporting(E_ALL);
echo "<pre>";

// TEST CASES
$arr
= array
( 'Françoise'
, 'ßeta or Beta?'
, 'ENCYCLOPÆDIA'
, 'ça va! mon élève mi niña?'
)
;

// A REGULAR EXPRESSION TO SANITIZE THE TEST DATA
$rgx
= '/'         // REGEX DELIMITER
. '['         // START OF CHARACTR CLASS
. '^'         // NEGATION - MATCH ANYTHING NOT HERE
. 'A-Z0-9'    // LETTERS AND NUMBERS
. '\s'        // WHITE SPACE
. "'"         // THE APOSTROPHE
. 'XXX'       // A PLACE HOLDER
. ']'         // END CHARACTER CLASS
. '/'         // REGEX DELIMITER
. 'i'         // CASE-INSENSITIVE
;

// MODIFY THE REGEX TO ADD THE CHARACTERS AT #192-255
$num = range(192, 255);
$chs = NULL;
foreach ($num as $ord)
{
    $chs .= chr($ord);
}
$rgx = str_replace('XXX', $chs, $rgx);

// SHOW THE REGEX
echo PHP_EOL . $rgx;

// SHOW THE WORK PRODUCT
// DISPLAY EACH TEST CASE
foreach ($arr as $str)
{
    echo PHP_EOL
    . $str
    . ' = '
    . '<strong>'
    . preg_replace($rgx, NULL, $str)
    . '</strong>'
    ;
}

Open in new window

0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 36891693
Here is the regex string, using a range of characters.  As you decide you need to keep more characters like the dash or question mark you can add them to this string.

Best of luck with your project, ~Ray
/[^A-Z0-9\s'À-ÿ]/i

Open in new window

0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
simplest php form 3 57
Site hacked - decoding the PHP? 15 55
Should I use subdomains or addon domains? 3 33
updating table data with inner join 9 24
Read about why website design really matters in today's demanding market.
Any business that wants to seriously grow needs to keep the needs and desires of an international audience of their websites in mind. Making a website friendly to international users isn’t prohibitively expensive and can provide an incredible return…
The viewer will learn how to count occurrences of each item in an array.
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.

930 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

14 Experts available now in Live!

Get 1:1 Help Now