Solved

Help with cleaning string finction

Posted on 2011-09-29
8
184 Views
Last Modified: 2012-05-12
Hi

I have the attached function to clean some titles from useless characters.

The problem is the the spanish/italian/genrman characters like áéáèéë.. are being removed too.

How can I modify this function to keep those characters ?
function betterTitle($str){
	$str = strip_tags($str);
	$str = preg_replace('/[^a-zA-Z0-9\s\']/','',$str);
	return $str;
}

Open in new window

0
Comment
Question by:Fernanditos
  • 4
  • 3
8 Comments
 
LVL 16

Expert Comment

by:sjklein42
ID: 36818768
Try this:

function betterTitle($str){
	$str = strip_tags($str);
	$str = preg_replace('/[^\w0-9\s\']/','',$str);
	return $str;
} 

Open in new window


\w
any "word" character
\W
any "non-word" character


A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place. For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.

http://www.php.net/manual/en/regexp.reference.escape.php
0
 

Author Comment

by:Fernanditos
ID: 36890143
Thank you sjklein42.

It does not work, still removing the characters like: áéí

Any idea?
<?php function betterTitle($str){
	$str = strip_tags($str);
	$str = preg_replace('/[^\w0-9\s\']/','',$str);
	return $str;
}

$string="estó éspada es un%··5%%%/()=? prueba";
echo betterTitle($string);
?>

Open in new window

0
 
LVL 16

Expert Comment

by:sjklein42
ID: 36890209
Try adding the /u switch.

<?php function betterTitle($str){
	$str = strip_tags($str);
	$str = preg_replace('/[^\w0-9\s\']/u','',$str);
	return $str;
}

$string="estó éspada es un%··5%%%/()=? prueba";
echo betterTitle($string);
?>

Open in new window

0
 
LVL 16

Expert Comment

by:sjklein42
ID: 36890231
Sorry, that did not work either.  It does not appear to be easy.  Other solutions I found were all brute-force, enumerating all the allowed accented characters.
0
What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

 
LVL 108

Expert Comment

by:Ray Paseur
ID: 36891588
The problem is the the spanish/italian/genrman characters like áéáèéë.. are being removed too.

Of course they are removed - they are not part of your REGEX character class.  Try adding them to the class, something like this.  You will need to find all the characters you want to allow and put them into the regex string.  You might do that by adding more lines around line 16.

See it in action here:
http://www.laprbass.com/RAY_temp_fernanditos.php
<?php // RAY_temp_fernanditos.php
error_reporting(E_ALL);
echo "<pre>";

// SOME TEST DATA
$chars = 'the spanish/italian/genrman characters like áéáèéë.. are being removed too.';

// A REGULAR EXPRESSION TO SANITIZE THE TEST DATA
$regex
= '/'         // REGEX DELIMITER
. '['         // START OF CHARACTR CLASS
. '^'         // NEGATION - MATCH ANYTHING NOT HERE
. 'A-Z0-9'    // LETTERS AND NUMBERS
. '\s'        // WHITE SPACE
. "'"         // THE APOSTROPHE
. 'áéáèéë'    // SOME ACCENTED CHARACTERS
. ']'         // END CHARACTER CLASS
. '/'         // REGEX DELIMITER
. 'i'         // CASE-INSENSITIVE
;

// SHOW THE REGEX
echo PHP_EOL . $regex;

// SHOW THE WORK PRODUCT
$new = preg_replace($regex, NULL, $chars);
echo PHP_EOL . $chars;
echo PHP_EOL . $new;

Open in new window

0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 36891623
Read this over and see if it gives you any ideas.  I think the letters you may want to keep include those from #192 to #255.  I'll try to show you how I might generate a regex string to include those.  Back in a moment...
<?php // RAY_entitize_western_letters.php
error_reporting(E_ALL);


// DEMONSTRATE HOW TO TRANSLATE SOME WESTERN CHARACTERS INTO ENGLISH-PRINTABLE OR ENTITIES
// SEE http://www.joelonsoftware.com/articles/Unicode.html


// TEST CASES
$arr
= array
( 'Françoise'
, 'ßeta or Beta?'
, 'ENCYCLOPÆDIA'
, 'ça va! mon élève mi niña?'
, 'A stealthy ƒart'
, 'Jean "Ðango" Reinhardt of Pont-à-Celles'
)
;

// DISPLAY EACH TEST CASE
foreach ($arr as $str)
{
    echo PHP_EOL
    . '<br/>'
    . $str
    . ' = '
    . '<strong>'
    . mungstring($str)
    . '</strong>'
    ;
}


// EXAMPLE SHOWING HOW TO TURN A PORTUGESE NAME INTO PART OF A URL STRING
$str = 'Armação de Pêra';
$new = mungString($str);
$new = strtolower($new);
$new = str_replace(' ', '-', $new);

// SHOW THE URL STRING
echo PHP_EOL
. '<br/>'
. '<strong>'
. '<a target="blank" href="http://lmgtfy.com?q='
. htmlentities(mungstring($new))
. '">'
. $str
. '</a>'
. '</strong>'
;


// EXAMPLE SHOWING HOW TO TURN A STRING INTO A NUMERICALLY ENTITIZED STRING
$str = 'Armação de Pêra';
$new = mungString($str, 'ENTITIES');
echo "<pre>";
echo PHP_EOL
. $new
. ' = '
. '<strong>'
. htmlentities($new)
. '</strong>'
;


// A FUNCTION TO RETURN THE WESTERNIZED/ENTITIZED STRING
function mungString($str, $return='TEXT')
{
    // OUR REPLACEMENT ARRAY OF ENTITIES
    static
    $entity
    = array();

    // OUR REPLACEMENT ARRAY OF CHARACTERS (YOU MAY WANT SOME CHANGES HERE)
    static
    $normal
    = array
    ( 'ƒ' => 'f'  // http://en.wikipedia.org/wiki/%C6%91 florin
    , 'Š' => 'S'  // http://en.wikipedia.org/wiki/%C5%A0 S-caron (voiceless postalveolar fricative)
    , 'š' => 's'  // http://en.wikipedia.org/wiki/%C5%A0 s-caron
    , 'Ð' => 'Dj' // http://en.wikipedia.org/wiki/Eth (voiced dental fricative)
    , 'Ž' => 'Z'  // http://en.wikipedia.org/wiki/%C5%BD Z-caron (voiced postalveolar fricative)
    , 'ž' => 'z'  // http://en.wikipedia.org/wiki/%C5%BD z-caron
    , 'À' => 'A'
    , 'Á' => 'A'
    , 'Â' => 'A'
    , 'Ã' => 'A'
    , 'Ä' => 'A'
    , 'Å' => 'A'
    , 'Æ' => 'E'
    , 'Ç' => 'C'
    , 'È' => 'E'
    , 'É' => 'E'
    , 'Ê' => 'E'
    , 'Ë' => 'E'
    , 'Ì' => 'I'
    , 'Í' => 'I'
    , 'Î' => 'I'
    , 'Ï' => 'I'
    , 'Ñ' => 'N'
    , 'Ò' => 'O'
    , 'Ó' => 'O'
    , 'Ô' => 'O'
    , 'Õ' => 'O'
    , 'Ö' => 'O'
    , 'Ø' => 'O'
    , 'Ù' => 'U'
    , 'Ú' => 'U'
    , 'Û' => 'U'
    , 'Ü' => 'U'
    , 'Ý' => 'Y'
    , 'Þ' => 'B'
    , 'ß' => 'Ss'
    , 'à' => 'a'
    , 'á' => 'a'
    , 'â' => 'a'
    , 'ã' => 'a'
    , 'ä' => 'a'
    , 'å' => 'a'
    , 'æ' => 'e'
    , 'ç' => 'c'
    , 'è' => 'e'
    , 'é' => 'e'
    , 'ê' => 'e'
    , 'ë' => 'e'
    , 'ì' => 'i'
    , 'í' => 'i'
    , 'î' => 'i'
    , 'ï' => 'i'
    , 'ð' => 'o'
    , 'ñ' => 'n'
    , 'ò' => 'o'
    , 'ó' => 'o'
    , 'ô' => 'o'
    , 'õ' => 'o'
    , 'ö' => 'o'
    , 'ø' => 'o'
    , 'ù' => 'u'
    , 'ú' => 'u'
    , 'û' => 'u'
    , 'ý' => 'y'
    , 'ý' => 'y'
    , 'þ' => 'b'
    , 'ÿ' => 'y'
    )
    ;
    // RETURN THE "TRANSLATED" TEXT
    if (substr(strtoupper($return),0,1) == 'T') return strtr($str, $normal);

    // RETURN THE "ENTITIZED" TEXT
    if (substr(strtoupper($return),0,1) == 'E')
    {
        if (empty($entity))
        {
            foreach ($normal as $key => $nothing)
            {
                $entity[$key] = '&#' . ord($key) . ';';
            }
        }
        return strtr($str, $entity);
    }

    // MIGHT BE USEFUL TO GET THE LIST OF ORIGINAL LETTERS
    return array_keys($normal);
}

Open in new window

0
 
LVL 108

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 36891678
This seems to work fairly well.  Outputs

/[^A-Z0-9\s'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ]/i
Françoise = Françoise
ßeta or Beta? = ßeta or Beta
ENCYCLOPÆDIA = ENCYCLOPÆDIA
ça va! mon élève mi niña? = ça va mon élève mi niña
<?php // RAY_temp_fernanditos.php
error_reporting(E_ALL);
echo "<pre>";

// TEST CASES
$arr
= array
( 'Françoise'
, 'ßeta or Beta?'
, 'ENCYCLOPÆDIA'
, 'ça va! mon élève mi niña?'
)
;

// A REGULAR EXPRESSION TO SANITIZE THE TEST DATA
$rgx
= '/'         // REGEX DELIMITER
. '['         // START OF CHARACTR CLASS
. '^'         // NEGATION - MATCH ANYTHING NOT HERE
. 'A-Z0-9'    // LETTERS AND NUMBERS
. '\s'        // WHITE SPACE
. "'"         // THE APOSTROPHE
. 'XXX'       // A PLACE HOLDER
. ']'         // END CHARACTER CLASS
. '/'         // REGEX DELIMITER
. 'i'         // CASE-INSENSITIVE
;

// MODIFY THE REGEX TO ADD THE CHARACTERS AT #192-255
$num = range(192, 255);
$chs = NULL;
foreach ($num as $ord)
{
    $chs .= chr($ord);
}
$rgx = str_replace('XXX', $chs, $rgx);

// SHOW THE REGEX
echo PHP_EOL . $rgx;

// SHOW THE WORK PRODUCT
// DISPLAY EACH TEST CASE
foreach ($arr as $str)
{
    echo PHP_EOL
    . $str
    . ' = '
    . '<strong>'
    . preg_replace($rgx, NULL, $str)
    . '</strong>'
    ;
}

Open in new window

0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 36891693
Here is the regex string, using a range of characters.  As you decide you need to keep more characters like the dash or question mark you can add them to this string.

Best of luck with your project, ~Ray
/[^A-Z0-9\s'À-ÿ]/i

Open in new window

0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Recently I spent hours debugging an issue in a Rails project where ActiveRecord was causing MySQL errors trying to create a User object of a class at the top level of a Single Table Inheritance model structure.  It turns out `.create` behaves differ…
Boost your ability to deliver ambitious and competitive web apps by choosing the right JavaScript framework to best suit your project’s needs.
Viewers will get an overview of the benefits and risks of using Bitcoin to accept payments. What Bitcoin is: Legality: Risks: Benefits: Which businesses are best suited?: Other things you should know: How to get started:
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

757 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

22 Experts available now in Live!

Get 1:1 Help Now