Solved

Help with cleaning string finction

Posted on 2011-09-29
8
196 Views
Last Modified: 2012-05-12
Hi

I have the attached function to clean some titles from useless characters.

The problem is the the spanish/italian/genrman characters like áéáèéë.. are being removed too.

How can I modify this function to keep those characters ?
function betterTitle($str){
	$str = strip_tags($str);
	$str = preg_replace('/[^a-zA-Z0-9\s\']/','',$str);
	return $str;
}

Open in new window

0
Comment
Question by:Fernanditos
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 3
8 Comments
 
LVL 16

Expert Comment

by:sjklein42
ID: 36818768
Try this:

function betterTitle($str){
	$str = strip_tags($str);
	$str = preg_replace('/[^\w0-9\s\']/','',$str);
	return $str;
} 

Open in new window


\w
any "word" character
\W
any "non-word" character


A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place. For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.

http://www.php.net/manual/en/regexp.reference.escape.php
0
 

Author Comment

by:Fernanditos
ID: 36890143
Thank you sjklein42.

It does not work, still removing the characters like: áéí

Any idea?
<?php function betterTitle($str){
	$str = strip_tags($str);
	$str = preg_replace('/[^\w0-9\s\']/','',$str);
	return $str;
}

$string="estó éspada es un%··5%%%/()=? prueba";
echo betterTitle($string);
?>

Open in new window

0
 
LVL 16

Expert Comment

by:sjklein42
ID: 36890209
Try adding the /u switch.

<?php function betterTitle($str){
	$str = strip_tags($str);
	$str = preg_replace('/[^\w0-9\s\']/u','',$str);
	return $str;
}

$string="estó éspada es un%··5%%%/()=? prueba";
echo betterTitle($string);
?>

Open in new window

0
Are You Using the Best Web Development Editor?

The worlds of web hosting and web development are constantly evolving. Every year we see design trends change, coding standards adapt and new frameworks/CMS created. With such a quick pace of change it’s easy to get lost trying to keep up.

See if your editor made the list.

 
LVL 16

Expert Comment

by:sjklein42
ID: 36890231
Sorry, that did not work either.  It does not appear to be easy.  Other solutions I found were all brute-force, enumerating all the allowed accented characters.
0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 36891588
The problem is the the spanish/italian/genrman characters like áéáèéë.. are being removed too.

Of course they are removed - they are not part of your REGEX character class.  Try adding them to the class, something like this.  You will need to find all the characters you want to allow and put them into the regex string.  You might do that by adding more lines around line 16.

See it in action here:
http://www.laprbass.com/RAY_temp_fernanditos.php
<?php // RAY_temp_fernanditos.php
error_reporting(E_ALL);
echo "<pre>";

// SOME TEST DATA
$chars = 'the spanish/italian/genrman characters like áéáèéë.. are being removed too.';

// A REGULAR EXPRESSION TO SANITIZE THE TEST DATA
$regex
= '/'         // REGEX DELIMITER
. '['         // START OF CHARACTR CLASS
. '^'         // NEGATION - MATCH ANYTHING NOT HERE
. 'A-Z0-9'    // LETTERS AND NUMBERS
. '\s'        // WHITE SPACE
. "'"         // THE APOSTROPHE
. 'áéáèéë'    // SOME ACCENTED CHARACTERS
. ']'         // END CHARACTER CLASS
. '/'         // REGEX DELIMITER
. 'i'         // CASE-INSENSITIVE
;

// SHOW THE REGEX
echo PHP_EOL . $regex;

// SHOW THE WORK PRODUCT
$new = preg_replace($regex, NULL, $chars);
echo PHP_EOL . $chars;
echo PHP_EOL . $new;

Open in new window

0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 36891623
Read this over and see if it gives you any ideas.  I think the letters you may want to keep include those from #192 to #255.  I'll try to show you how I might generate a regex string to include those.  Back in a moment...
<?php // RAY_entitize_western_letters.php
error_reporting(E_ALL);


// DEMONSTRATE HOW TO TRANSLATE SOME WESTERN CHARACTERS INTO ENGLISH-PRINTABLE OR ENTITIES
// SEE http://www.joelonsoftware.com/articles/Unicode.html


// TEST CASES
$arr
= array
( 'Françoise'
, 'ßeta or Beta?'
, 'ENCYCLOPÆDIA'
, 'ça va! mon élève mi niña?'
, 'A stealthy ƒart'
, 'Jean "Ðango" Reinhardt of Pont-à-Celles'
)
;

// DISPLAY EACH TEST CASE
foreach ($arr as $str)
{
    echo PHP_EOL
    . '<br/>'
    . $str
    . ' = '
    . '<strong>'
    . mungstring($str)
    . '</strong>'
    ;
}


// EXAMPLE SHOWING HOW TO TURN A PORTUGESE NAME INTO PART OF A URL STRING
$str = 'Armação de Pêra';
$new = mungString($str);
$new = strtolower($new);
$new = str_replace(' ', '-', $new);

// SHOW THE URL STRING
echo PHP_EOL
. '<br/>'
. '<strong>'
. '<a target="blank" href="http://lmgtfy.com?q='
. htmlentities(mungstring($new))
. '">'
. $str
. '</a>'
. '</strong>'
;


// EXAMPLE SHOWING HOW TO TURN A STRING INTO A NUMERICALLY ENTITIZED STRING
$str = 'Armação de Pêra';
$new = mungString($str, 'ENTITIES');
echo "<pre>";
echo PHP_EOL
. $new
. ' = '
. '<strong>'
. htmlentities($new)
. '</strong>'
;


// A FUNCTION TO RETURN THE WESTERNIZED/ENTITIZED STRING
function mungString($str, $return='TEXT')
{
    // OUR REPLACEMENT ARRAY OF ENTITIES
    static
    $entity
    = array();

    // OUR REPLACEMENT ARRAY OF CHARACTERS (YOU MAY WANT SOME CHANGES HERE)
    static
    $normal
    = array
    ( 'ƒ' => 'f'  // http://en.wikipedia.org/wiki/%C6%91 florin
    , 'Š' => 'S'  // http://en.wikipedia.org/wiki/%C5%A0 S-caron (voiceless postalveolar fricative)
    , 'š' => 's'  // http://en.wikipedia.org/wiki/%C5%A0 s-caron
    , 'Ð' => 'Dj' // http://en.wikipedia.org/wiki/Eth (voiced dental fricative)
    , 'Ž' => 'Z'  // http://en.wikipedia.org/wiki/%C5%BD Z-caron (voiced postalveolar fricative)
    , 'ž' => 'z'  // http://en.wikipedia.org/wiki/%C5%BD z-caron
    , 'À' => 'A'
    , 'Á' => 'A'
    , 'Â' => 'A'
    , 'Ã' => 'A'
    , 'Ä' => 'A'
    , 'Å' => 'A'
    , 'Æ' => 'E'
    , 'Ç' => 'C'
    , 'È' => 'E'
    , 'É' => 'E'
    , 'Ê' => 'E'
    , 'Ë' => 'E'
    , 'Ì' => 'I'
    , 'Í' => 'I'
    , 'Î' => 'I'
    , 'Ï' => 'I'
    , 'Ñ' => 'N'
    , 'Ò' => 'O'
    , 'Ó' => 'O'
    , 'Ô' => 'O'
    , 'Õ' => 'O'
    , 'Ö' => 'O'
    , 'Ø' => 'O'
    , 'Ù' => 'U'
    , 'Ú' => 'U'
    , 'Û' => 'U'
    , 'Ü' => 'U'
    , 'Ý' => 'Y'
    , 'Þ' => 'B'
    , 'ß' => 'Ss'
    , 'à' => 'a'
    , 'á' => 'a'
    , 'â' => 'a'
    , 'ã' => 'a'
    , 'ä' => 'a'
    , 'å' => 'a'
    , 'æ' => 'e'
    , 'ç' => 'c'
    , 'è' => 'e'
    , 'é' => 'e'
    , 'ê' => 'e'
    , 'ë' => 'e'
    , 'ì' => 'i'
    , 'í' => 'i'
    , 'î' => 'i'
    , 'ï' => 'i'
    , 'ð' => 'o'
    , 'ñ' => 'n'
    , 'ò' => 'o'
    , 'ó' => 'o'
    , 'ô' => 'o'
    , 'õ' => 'o'
    , 'ö' => 'o'
    , 'ø' => 'o'
    , 'ù' => 'u'
    , 'ú' => 'u'
    , 'û' => 'u'
    , 'ý' => 'y'
    , 'ý' => 'y'
    , 'þ' => 'b'
    , 'ÿ' => 'y'
    )
    ;
    // RETURN THE "TRANSLATED" TEXT
    if (substr(strtoupper($return),0,1) == 'T') return strtr($str, $normal);

    // RETURN THE "ENTITIZED" TEXT
    if (substr(strtoupper($return),0,1) == 'E')
    {
        if (empty($entity))
        {
            foreach ($normal as $key => $nothing)
            {
                $entity[$key] = '&#' . ord($key) . ';';
            }
        }
        return strtr($str, $entity);
    }

    // MIGHT BE USEFUL TO GET THE LIST OF ORIGINAL LETTERS
    return array_keys($normal);
}

Open in new window

0
 
LVL 110

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 36891678
This seems to work fairly well.  Outputs

/[^A-Z0-9\s'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ]/i
Françoise = Françoise
ßeta or Beta? = ßeta or Beta
ENCYCLOPÆDIA = ENCYCLOPÆDIA
ça va! mon élève mi niña? = ça va mon élève mi niña
<?php // RAY_temp_fernanditos.php
error_reporting(E_ALL);
echo "<pre>";

// TEST CASES
$arr
= array
( 'Françoise'
, 'ßeta or Beta?'
, 'ENCYCLOPÆDIA'
, 'ça va! mon élève mi niña?'
)
;

// A REGULAR EXPRESSION TO SANITIZE THE TEST DATA
$rgx
= '/'         // REGEX DELIMITER
. '['         // START OF CHARACTR CLASS
. '^'         // NEGATION - MATCH ANYTHING NOT HERE
. 'A-Z0-9'    // LETTERS AND NUMBERS
. '\s'        // WHITE SPACE
. "'"         // THE APOSTROPHE
. 'XXX'       // A PLACE HOLDER
. ']'         // END CHARACTER CLASS
. '/'         // REGEX DELIMITER
. 'i'         // CASE-INSENSITIVE
;

// MODIFY THE REGEX TO ADD THE CHARACTERS AT #192-255
$num = range(192, 255);
$chs = NULL;
foreach ($num as $ord)
{
    $chs .= chr($ord);
}
$rgx = str_replace('XXX', $chs, $rgx);

// SHOW THE REGEX
echo PHP_EOL . $rgx;

// SHOW THE WORK PRODUCT
// DISPLAY EACH TEST CASE
foreach ($arr as $str)
{
    echo PHP_EOL
    . $str
    . ' = '
    . '<strong>'
    . preg_replace($rgx, NULL, $str)
    . '</strong>'
    ;
}

Open in new window

0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 36891693
Here is the regex string, using a range of characters.  As you decide you need to keep more characters like the dash or question mark you can add them to this string.

Best of luck with your project, ~Ray
/[^A-Z0-9\s'À-ÿ]/i

Open in new window

0

Featured Post

AWS Certified Solutions Architect - Associate

This course has been developed to provide you with the requisite knowledge to not only pass the AWS CSA certification exam but also gain the hands-on experience required to become a qualified AWS Solutions architect working in a real-world environment.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

When it comes to write a Context Sensitive Help (an online help that is obtained from a specific point in state of software to provide help with that state) ,  first we need to make the file that contains all topics, which are given exclusive IDs. …
Does your audience prefer people in photos or no people? How can you best highlight what you’re selling? What are your competitors doing, and what can you do that is different and unique from them?  Continue reading to learn how to make your images …
The purpose of this video is to demonstrate how to set up the WordPress backend so that each page automatically generates a Mailchimp signup form in the sidebar. This will be demonstrated using a Windows 8 PC. Tools Used are Photoshop, Awesome…
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.

615 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question