worldofwires
asked on
Special Characters in UTF8
Hi there,
I'm having some issues with converting 'html' utf8 charcters to 'xml' style. So, for instance, I want to convert á to á because at present, I'm getting the error 'Entity 'aacute' not defined in Entity' from the DomDocument LoadXML function.
When I do a simple str_replace("á", "á", $xml) it works, no errors. So, I found a list of special characters are their codes (here) and built them into a mysql table.
From there, I built two arrays and populated them like so:
From there, according to the examples given on php.net, I should be able to the same str_replace with the $srch and $fnd arrays to achieve the same result I get when I do it for a one-off. However, it doesn't work. I get the same message as usual whcih makes me think that the str_replace isn't working (as it still mentions 'aacute' which should have been translated).
Can anyone spot where I'm going wrong?
Thanks,
John
I'm having some issues with converting 'html' utf8 charcters to 'xml' style. So, for instance, I want to convert á to á because at present, I'm getting the error 'Entity 'aacute' not defined in Entity' from the DomDocument LoadXML function.
When I do a simple str_replace("á", "á", $xml) it works, no errors. So, I found a list of special characters are their codes (here) and built them into a mysql table.
From there, I built two arrays and populated them like so:
$srch=array();
$fnd=array();
$qry=$db->Execute("SELECT * FROM cfg_utf8");
while($utf=$qry->FetchRow()) {
$srch[]=htmlspecialchars($utf['html'], ENT_QUOTES);
$fnd[]=htmlspecialchars($utf['xml'], ENT_QUOTES);
}
From there, according to the examples given on php.net, I should be able to the same str_replace with the $srch and $fnd arrays to achieve the same result I get when I do it for a one-off. However, it doesn't work. I get the same message as usual whcih makes me think that the str_replace isn't working (as it still mentions 'aacute' which should have been translated).
Can anyone spot where I'm going wrong?
Thanks,
John
There's missing ";" in the xml's character in the first row. I'm working on the rest.
How do you use your str_replace ?
This seems to be working:
This seems to be working:
// your code
while($utf=$qry->FetchRow()) {
$srch[]=htmlspecialchars($utf['html'], ENT_QUOTES);
$fnd[]=htmlspecialchars($utf['xml'], ENT_QUOTES);
}
$test = str_replace($srch,$fnd,$srch);
print_r($test);
You might want to read up on this issue here. The article is old, but the problem seems to be an enduring one!
http://www.joelonsoftware.com/articles/Unicode.html
I have used this to let "westernized" characters survive in the UTF-8 environment. Maybe it will help you with your thinking about how to solve the problem. You could change the $normal array to use numeric entities instead of my pidgin-language character set.
HTH, ~Ray
http://www.joelonsoftware.com/articles/Unicode.html
I have used this to let "westernized" characters survive in the UTF-8 environment. Maybe it will help you with your thinking about how to solve the problem. You could change the $normal array to use numeric entities instead of my pidgin-language character set.
HTH, ~Ray
<?php // RAY_westernize_letters.php
error_reporting(E_ALL);
// DEMONSTRATE HOW TO TRANSLATE SOME WESTERN CHARACTERS INTO ENGLISH-PRINTABLE
// TEST CASES
$arr
= array
( 'Françoise'
, 'ßeta or Beta?'
, 'ENCYCLOPÆDIA'
, 'ça va! mon élève mi niña?'
, 'A stealthy ƒart'
, 'Jean "Ðango" Reinhardt of Pont-à-Celles'
)
;
// DISPLAY EACH TEST CASE
foreach ($arr as $str)
{
echo PHP_EOL
. '<br/>'
. $str
. ' = '
. '<strong>'
. mungstring($str)
. '</strong>'
;
}
// EXAMPLE SHOWING HOW TO TURN A PORTUGESE NAME INTO PART OF A URL STRING
$str = 'Armação de Pêra';
$new = mungString($str);
$new = strtolower($new);
$new = str_replace(' ', '-', $new);
// SHOW THE URL STRING
echo PHP_EOL
. '<br/>'
. '<strong>'
. '<a target="blank" href="http://lmgtfy.com?q='
. htmlentities(mungstring($new))
. '">'
. $str
. '</a>'
. '</strong>'
;
// A FUNCTION TO RETURN THE WESTERNIZED STRING
function mungString($str, $return='TEXT')
{
// OUR REPLACEMENT ARRAY (MAY WANT SOME CHANGES HERE)
static
$normal
= array
( 'ƒ' => 'f' // http://en.wikipedia.org/wiki/%C6%91 florin
, 'Š' => 'S' // http://en.wikipedia.org/wiki/%C5%A0 S-caron (voiceless postalveolar fricative)
, 'š' => 's' // http://en.wikipedia.org/wiki/%C5%A0 s-caron
, 'Ð' => 'Dj' // http://en.wikipedia.org/wiki/Eth (voiced dental fricative)
, 'Ž' => 'Z' // http://en.wikipedia.org/wiki/%C5%BD Z-caron (voiced postalveolar fricative)
, 'ž' => 'z' // http://en.wikipedia.org/wiki/%C5%BD z-caron
, 'À' => 'A'
, 'Á' => 'A'
, 'Â' => 'A'
, 'Ã' => 'A'
, 'Ä' => 'A'
, 'Å' => 'A'
, 'Æ' => 'E'
, 'Ç' => 'C'
, 'È' => 'E'
, 'É' => 'E'
, 'Ê' => 'E'
, 'Ë' => 'E'
, 'Ì' => 'I'
, 'Í' => 'I'
, 'Î' => 'I'
, 'Ï' => 'I'
, 'Ñ' => 'N'
, 'Ò' => 'O'
, 'Ó' => 'O'
, 'Ô' => 'O'
, 'Õ' => 'O'
, 'Ö' => 'O'
, 'Ø' => 'O'
, 'Ù' => 'U'
, 'Ú' => 'U'
, 'Û' => 'U'
, 'Ü' => 'U'
, 'Ý' => 'Y'
, 'Þ' => 'B'
, 'ß' => 'Ss'
, 'à' => 'a'
, 'á' => 'a'
, 'â' => 'a'
, 'ã' => 'a'
, 'ä' => 'a'
, 'å' => 'a'
, 'æ' => 'e'
, 'ç' => 'c'
, 'è' => 'e'
, 'é' => 'e'
, 'ê' => 'e'
, 'ë' => 'e'
, 'ì' => 'i'
, 'í' => 'i'
, 'î' => 'i'
, 'ï' => 'i'
, 'ð' => 'o'
, 'ñ' => 'n'
, 'ò' => 'o'
, 'ó' => 'o'
, 'ô' => 'o'
, 'õ' => 'o'
, 'ö' => 'o'
, 'ø' => 'o'
, 'ù' => 'u'
, 'ú' => 'u'
, 'û' => 'u'
, 'ý' => 'y'
, 'ý' => 'y'
, 'þ' => 'b'
, 'ÿ' => 'y'
)
;
// RETURN THE "TRANSLATED" TEXT
if ($return == 'TEXT') return strtr($str, $normal);
// MIGHT BE USEFUL TO GET THE LIST OF ORIGINAL LETTERS
return array_keys($normal);
}
ASKER
Thank you both for your responses, apologies for my tardy reply. Roads, I've got it working with the arrays when the array's are declared in the PHP script. It's when I drag the values from the SQL table that things go awry. In case anyone wants to use the arrays, I've included them below:
You're right about the missing semi-colon, thanks for that. I've corected it but it wasn't causing an issue.
Ray, I'm not really wanting to westernise the text, I just want the special formatting to survive into the XML. That array that you pasted will be very useful in testing the str_replace which uses the above arrays. Thanks for your post.
$arr=array("html"=>array(), "xml"=>array());
$arr['html'][]=""";
$arr['html'][]="&";
$arr['html'][]="<";
$arr['html'][]=">";
$arr['html'][]=" ";
$arr['html'][]="¡";
$arr['html'][]="¢";
$arr['html'][]="£";
$arr['html'][]="¤";
$arr['html'][]="¥";
$arr['html'][]="¦";
$arr['html'][]="§";
$arr['html'][]="¨";
$arr['html'][]="©";
$arr['html'][]="ª";
$arr['html'][]="«";
$arr['html'][]="¬";
$arr['html'][]="­";
$arr['html'][]="®";
$arr['html'][]="¯";
$arr['html'][]="°";
$arr['html'][]="±";
$arr['html'][]="²";
$arr['html'][]="³";
$arr['html'][]="´";
$arr['html'][]="µ";
$arr['html'][]="¶";
$arr['html'][]="·";
$arr['html'][]="¸";
$arr['html'][]="¹";
$arr['html'][]="º";
$arr['html'][]="»";
$arr['html'][]="¼";
$arr['html'][]="½";
$arr['html'][]="¾";
$arr['html'][]="¿";
$arr['html'][]="À";
$arr['html'][]="Á";
$arr['html'][]="Â";
$arr['html'][]="Ã";
$arr['html'][]="Ä";
$arr['html'][]="Å";
$arr['html'][]="Æ";
$arr['html'][]="Ç";
$arr['html'][]="È";
$arr['html'][]="É";
$arr['html'][]="Ê";
$arr['html'][]="Ë";
$arr['html'][]="Ì";
$arr['html'][]="Í";
$arr['html'][]="Î";
$arr['html'][]="Ï";
$arr['html'][]="Ð";
$arr['html'][]="Ñ";
$arr['html'][]="Ò";
$arr['html'][]="Ó";
$arr['html'][]="Ô";
$arr['html'][]="Õ";
$arr['html'][]="Ö";
$arr['html'][]="×";
$arr['html'][]="Ø";
$arr['html'][]="Ù";
$arr['html'][]="Ú";
$arr['html'][]="Û";
$arr['html'][]="Ü";
$arr['html'][]="Ý";
$arr['html'][]="Þ";
$arr['html'][]="ß";
$arr['html'][]="à";
$arr['html'][]="á";
$arr['html'][]="â";
$arr['html'][]="ã";
$arr['html'][]="ä";
$arr['html'][]="å";
$arr['html'][]="æ";
$arr['html'][]="ç";
$arr['html'][]="è";
$arr['html'][]="é";
$arr['html'][]="ê";
$arr['html'][]="ë";
$arr['html'][]="ì";
$arr['html'][]="í";
$arr['html'][]="î";
$arr['html'][]="ï";
$arr['html'][]="ð";
$arr['html'][]="ñ";
$arr['html'][]="ò";
$arr['html'][]="ó";
$arr['html'][]="ô";
$arr['html'][]="õ";
$arr['html'][]="ö";
$arr['html'][]="÷";
$arr['html'][]="ø";
$arr['html'][]="ù";
$arr['html'][]="ú";
$arr['html'][]="û";
$arr['html'][]="ü";
$arr['html'][]="ý";
$arr['html'][]="þ";
$arr['html'][]="ÿ";
$arr['html'][]="€";
$arr['xml'][]=""";
$arr['xml'][]="&";
$arr['xml'][]="<";
$arr['xml'][]=">";
$arr['xml'][]=" ";
$arr['xml'][]="¡";
$arr['xml'][]="¢";
$arr['xml'][]="£";
$arr['xml'][]="¤";
$arr['xml'][]="¥";
$arr['xml'][]="¦";
$arr['xml'][]="§";
$arr['xml'][]="¨";
$arr['xml'][]="©";
$arr['xml'][]="ª";
$arr['xml'][]="«";
$arr['xml'][]="¬";
$arr['xml'][]="­";
$arr['xml'][]="®";
$arr['xml'][]="¯";
$arr['xml'][]="°";
$arr['xml'][]="±";
$arr['xml'][]="²";
$arr['xml'][]="³";
$arr['xml'][]="´";
$arr['xml'][]="µ";
$arr['xml'][]="¶";
$arr['xml'][]="·";
$arr['xml'][]="¸";
$arr['xml'][]="¹";
$arr['xml'][]="º";
$arr['xml'][]="»";
$arr['xml'][]="¼";
$arr['xml'][]="½";
$arr['xml'][]="¾";
$arr['xml'][]="¿";
$arr['xml'][]="À";
$arr['xml'][]="Á";
$arr['xml'][]="Â";
$arr['xml'][]="Ã";
$arr['xml'][]="Ä";
$arr['xml'][]="Å";
$arr['xml'][]="Æ";
$arr['xml'][]="Ç";
$arr['xml'][]="È";
$arr['xml'][]="É";
$arr['xml'][]="Ê";
$arr['xml'][]="Ë";
$arr['xml'][]="Ì";
$arr['xml'][]="Í";
$arr['xml'][]="Î";
$arr['xml'][]="Ï";
$arr['xml'][]="Ð";
$arr['xml'][]="Ñ";
$arr['xml'][]="Ò";
$arr['xml'][]="Ó";
$arr['xml'][]="Ô";
$arr['xml'][]="Õ";
$arr['xml'][]="Ö";
$arr['xml'][]="×";
$arr['xml'][]="Ø";
$arr['xml'][]="Ù";
$arr['xml'][]="Ú";
$arr['xml'][]="Û";
$arr['xml'][]="Ü";
$arr['xml'][]="Ý";
$arr['xml'][]="Þ";
$arr['xml'][]="ß";
$arr['xml'][]="à";
$arr['xml'][]="á";
$arr['xml'][]="â";
$arr['xml'][]="ã";
$arr['xml'][]="ä";
$arr['xml'][]="å";
$arr['xml'][]="æ";
$arr['xml'][]="ç";
$arr['xml'][]="è";
$arr['xml'][]="é";
$arr['xml'][]="ê";
$arr['xml'][]="ë";
$arr['xml'][]="ì";
$arr['xml'][]="í";
$arr['xml'][]="î";
$arr['xml'][]="ï";
$arr['xml'][]="ð";
$arr['xml'][]="ñ";
$arr['xml'][]="ò";
$arr['xml'][]="ó";
$arr['xml'][]="ô";
$arr['xml'][]="õ";
$arr['xml'][]="ö";
$arr['xml'][]="÷";
$arr['xml'][]="ø";
$arr['xml'][]="ù";
$arr['xml'][]="ú";
$arr['xml'][]="û";
$arr['xml'][]="ü";
$arr['xml'][]="ý";
$arr['xml'][]="þ";
$arr['xml'][]="ÿ";
$arr['xml'][]="€";
You're right about the missing semi-colon, thanks for that. I've corected it but it wasn't causing an issue.
Ray, I'm not really wanting to westernise the text, I just want the special formatting to survive into the XML. That array that you pasted will be very useful in testing the str_replace which uses the above arrays. Thanks for your post.
ASKER
Excellent link too Ray, thanks for that.
Check this post. I will try to come up with a code snippet that would be helpful to you.
http://us3.php.net/manual/en/function.ord.php#103277
http://us3.php.net/manual/en/function.ord.php#103277
Install this and run it, then do "view source" (or just look at the source from my site here).
http://www.laprbass.com/RAY_entitize_western_letters.php
You might find a more efficient way to handle this issue. For example, you might just test each character to see if its ord() was 128 or above, and entitize those with numerical values. Or you might find that UTF-8 is not the encoding you want to use.
Best regards, ~Ray
http://www.laprbass.com/RAY_entitize_western_letters.php
You might find a more efficient way to handle this issue. For example, you might just test each character to see if its ord() was 128 or above, and entitize those with numerical values. Or you might find that UTF-8 is not the encoding you want to use.
Best regards, ~Ray
<?php // RAY_entitize_western_letters.php
error_reporting(E_ALL);
// DEMONSTRATE HOW TO TRANSLATE SOME WESTERN CHARACTERS INTO ENGLISH-PRINTABLE OR ENTITIES
// TEST CASES
$arr
= array
( 'Françoise'
, 'ßeta or Beta?'
, 'ENCYCLOPÆDIA'
, 'ça va! mon élève mi niña?'
, 'A stealthy ƒart'
, 'Jean "Ðango" Reinhardt of Pont-à-Celles'
)
;
// DISPLAY EACH TEST CASE
foreach ($arr as $str)
{
echo PHP_EOL
. '<br/>'
. $str
. ' = '
. '<strong>'
. mungstring($str)
. '</strong>'
;
}
// EXAMPLE SHOWING HOW TO TURN A PORTUGESE NAME INTO PART OF A URL STRING
$str = 'Armação de Pêra';
$new = mungString($str);
$new = strtolower($new);
$new = str_replace(' ', '-', $new);
// SHOW THE URL STRING
echo PHP_EOL
. '<br/>'
. '<strong>'
. '<a target="blank" href="http://lmgtfy.com?q='
. htmlentities(mungstring($new))
. '">'
. $str
. '</a>'
. '</strong>'
;
$str = 'Armação de Pêra';
$new = mungString($str, 'FOO');
echo "<pre>";
foreach ($new as $chr)
{
echo PHP_EOL . $chr . '=' . '&#' . ord($chr) . ';' ;
}
// A FUNCTION TO RETURN THE WESTERNIZED STRING
function mungString($str, $return='TEXT')
{
// OUR REPLACEMENT ARRAY (MAY WANT SOME CHANGES HERE)
static
$normal
= array
( 'ƒ' => 'f' // http://en.wikipedia.org/wiki/%C6%91 florin
, 'Š' => 'S' // http://en.wikipedia.org/wiki/%C5%A0 S-caron (voiceless postalveolar fricative)
, 'š' => 's' // http://en.wikipedia.org/wiki/%C5%A0 s-caron
, 'Ð' => 'Dj' // http://en.wikipedia.org/wiki/Eth (voiced dental fricative)
, 'Ž' => 'Z' // http://en.wikipedia.org/wiki/%C5%BD Z-caron (voiced postalveolar fricative)
, 'ž' => 'z' // http://en.wikipedia.org/wiki/%C5%BD z-caron
, 'À' => 'A'
, 'Á' => 'A'
, 'Â' => 'A'
, 'Ã' => 'A'
, 'Ä' => 'A'
, 'Å' => 'A'
, 'Æ' => 'E'
, 'Ç' => 'C'
, 'È' => 'E'
, 'É' => 'E'
, 'Ê' => 'E'
, 'Ë' => 'E'
, 'Ì' => 'I'
, 'Í' => 'I'
, 'Î' => 'I'
, 'Ï' => 'I'
, 'Ñ' => 'N'
, 'Ò' => 'O'
, 'Ó' => 'O'
, 'Ô' => 'O'
, 'Õ' => 'O'
, 'Ö' => 'O'
, 'Ø' => 'O'
, 'Ù' => 'U'
, 'Ú' => 'U'
, 'Û' => 'U'
, 'Ü' => 'U'
, 'Ý' => 'Y'
, 'Þ' => 'B'
, 'ß' => 'Ss'
, 'à' => 'a'
, 'á' => 'a'
, 'â' => 'a'
, 'ã' => 'a'
, 'ä' => 'a'
, 'å' => 'a'
, 'æ' => 'e'
, 'ç' => 'c'
, 'è' => 'e'
, 'é' => 'e'
, 'ê' => 'e'
, 'ë' => 'e'
, 'ì' => 'i'
, 'í' => 'i'
, 'î' => 'i'
, 'ï' => 'i'
, 'ð' => 'o'
, 'ñ' => 'n'
, 'ò' => 'o'
, 'ó' => 'o'
, 'ô' => 'o'
, 'õ' => 'o'
, 'ö' => 'o'
, 'ø' => 'o'
, 'ù' => 'u'
, 'ú' => 'u'
, 'û' => 'u'
, 'ý' => 'y'
, 'ý' => 'y'
, 'þ' => 'b'
, 'ÿ' => 'y'
)
;
// RETURN THE "TRANSLATED" TEXT
if ($return == 'TEXT') return strtr($str, $normal);
// MIGHT BE USEFUL TO GET THE LIST OF ORIGINAL LETTERS
return array_keys($normal);
}
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Hi Ray,
Yes, using Entities through the mungstring function provides the right output. Now I'll see if I can implement it into my project. Thanks for your help on this one, I've learnt a lot about unicode!
John
Yes, using Entities through the mungstring function provides the right output. Now I'll see if I can implement it into my project. Thanks for your help on this one, I've learnt a lot about unicode!
John
Thanks for the points - it's a great question, ~Ray
ASKER
I'll leave it open to see if anyone can spot the issue with the DB method.