Link to home
Start Free TrialLog in
Avatar of worldofwires
worldofwiresFlag for United Kingdom of Great Britain and Northern Ireland

asked on

Special Characters in UTF8

Hi there,

I'm having some issues with converting 'html' utf8 charcters to 'xml' style. So, for instance, I want to convert á to á because at present, I'm getting the error 'Entity 'aacute' not defined in Entity' from the DomDocument LoadXML function.

When I do a simple str_replace("á", "á", $xml) it works, no errors. So, I found a list of special characters are their codes (here) and built them into a mysql table.
 User generated image
From there, I built two arrays and populated them like so:
	$srch=array();
	$fnd=array();
	$qry=$db->Execute("SELECT * FROM cfg_utf8");
	while($utf=$qry->FetchRow()) {
		$srch[]=htmlspecialchars($utf['html'], ENT_QUOTES);
		$fnd[]=htmlspecialchars($utf['xml'], ENT_QUOTES);
	}

Open in new window


From there, according to the examples given on php.net, I should be able to the same str_replace with the $srch and $fnd arrays to achieve the same result I get when I do it for a one-off. However, it doesn't work. I get the same message as usual whcih makes me think that the str_replace isn't working (as it still mentions 'aacute' which should have been translated).

Can anyone spot where I'm going wrong?

Thanks,
John
Avatar of worldofwires
worldofwires
Flag of United Kingdom of Great Britain and Northern Ireland image

ASKER

I've got a workaround which will do me for now. I created the two columns of data from the link in teh post in an array wihtout using teh databse table. That works fine so it's something to do with the way it retrieves the data from the databse (so probably the htmlspecialchars function).

I'll leave it open to see if anyone can spot the issue with the DB method.
There's missing ";" in the xml's character in the first row. I'm working on the rest.
How do you use your str_replace ?
This seems to be working:

                // your code
	while($utf=$qry->FetchRow()) {
		$srch[]=htmlspecialchars($utf['html'], ENT_QUOTES);
		$fnd[]=htmlspecialchars($utf['xml'], ENT_QUOTES);
	}

	$test = str_replace($srch,$fnd,$srch);
	print_r($test);

Open in new window

You might want to read up on this issue here.   The article is old, but the problem seems to be an enduring one!
http://www.joelonsoftware.com/articles/Unicode.html

I have used this to let "westernized" characters survive in the UTF-8 environment.  Maybe it will help you with your thinking about how to solve the problem.  You could change the $normal array to use numeric entities instead of my pidgin-language character set.

HTH, ~Ray
<?php // RAY_westernize_letters.php
error_reporting(E_ALL);


// DEMONSTRATE HOW TO TRANSLATE SOME WESTERN CHARACTERS INTO ENGLISH-PRINTABLE


// TEST CASES
$arr
= array
( 'Françoise'
, 'ßeta or Beta?'
, 'ENCYCLOPÆDIA'
, 'ça va! mon élève mi niña?'
, 'A stealthy ƒart'
, 'Jean "Ðango" Reinhardt of Pont-à-Celles'
)
;

// DISPLAY EACH TEST CASE
foreach ($arr as $str)
{
    echo PHP_EOL
    . '<br/>'
    . $str
    . ' = '
    . '<strong>'
    . mungstring($str)
    . '</strong>'
    ;
}

// EXAMPLE SHOWING HOW TO TURN A PORTUGESE NAME INTO PART OF A URL STRING
$str = 'Armação de Pêra';
$new = mungString($str);
$new = strtolower($new);
$new = str_replace(' ', '-', $new);

// SHOW THE URL STRING
echo PHP_EOL
. '<br/>'
. '<strong>'
. '<a target="blank" href="http://lmgtfy.com?q='
. htmlentities(mungstring($new))
. '">'
. $str
. '</a>'
. '</strong>'
;

// A FUNCTION TO RETURN THE WESTERNIZED STRING
function mungString($str, $return='TEXT')
{
    // OUR REPLACEMENT ARRAY (MAY WANT SOME CHANGES HERE)
    static
    $normal
    = array
    ( 'ƒ' => 'f'  // http://en.wikipedia.org/wiki/%C6%91 florin
    , 'Š' => 'S'  // http://en.wikipedia.org/wiki/%C5%A0 S-caron (voiceless postalveolar fricative)
    , 'š' => 's'  // http://en.wikipedia.org/wiki/%C5%A0 s-caron
    , 'Ð' => 'Dj' // http://en.wikipedia.org/wiki/Eth (voiced dental fricative)
    , 'Ž' => 'Z'  // http://en.wikipedia.org/wiki/%C5%BD Z-caron (voiced postalveolar fricative)
    , 'ž' => 'z'  // http://en.wikipedia.org/wiki/%C5%BD z-caron
    , 'À' => 'A'
    , 'Á' => 'A'
    , 'Â' => 'A'
    , 'Ã' => 'A'
    , 'Ä' => 'A'
    , 'Å' => 'A'
    , 'Æ' => 'E'
    , 'Ç' => 'C'
    , 'È' => 'E'
    , 'É' => 'E'
    , 'Ê' => 'E'
    , 'Ë' => 'E'
    , 'Ì' => 'I'
    , 'Í' => 'I'
    , 'Î' => 'I'
    , 'Ï' => 'I'
    , 'Ñ' => 'N'
    , 'Ò' => 'O'
    , 'Ó' => 'O'
    , 'Ô' => 'O'
    , 'Õ' => 'O'
    , 'Ö' => 'O'
    , 'Ø' => 'O'
    , 'Ù' => 'U'
    , 'Ú' => 'U'
    , 'Û' => 'U'
    , 'Ü' => 'U'
    , 'Ý' => 'Y'
    , 'Þ' => 'B'
    , 'ß' => 'Ss'
    , 'à' => 'a'
    , 'á' => 'a'
    , 'â' => 'a'
    , 'ã' => 'a'
    , 'ä' => 'a'
    , 'å' => 'a'
    , 'æ' => 'e'
    , 'ç' => 'c'
    , 'è' => 'e'
    , 'é' => 'e'
    , 'ê' => 'e'
    , 'ë' => 'e'
    , 'ì' => 'i'
    , 'í' => 'i'
    , 'î' => 'i'
    , 'ï' => 'i'
    , 'ð' => 'o'
    , 'ñ' => 'n'
    , 'ò' => 'o'
    , 'ó' => 'o'
    , 'ô' => 'o'
    , 'õ' => 'o'
    , 'ö' => 'o'
    , 'ø' => 'o'
    , 'ù' => 'u'
    , 'ú' => 'u'
    , 'û' => 'u'
    , 'ý' => 'y'
    , 'ý' => 'y'
    , 'þ' => 'b'
    , 'ÿ' => 'y'
    )
    ;
    // RETURN THE "TRANSLATED" TEXT
    if ($return == 'TEXT') return strtr($str, $normal);

    // MIGHT BE USEFUL TO GET THE LIST OF ORIGINAL LETTERS
    return array_keys($normal);
}

Open in new window

Thank you both for your responses, apologies for my tardy reply. Roads, I've got it working with the arrays when the array's are declared in the PHP script. It's when I drag the values from the SQL table that things go awry. In case anyone wants to use the arrays, I've included them below:
$arr=array("html"=>array(), "xml"=>array());

	$arr['html'][]="&quot;";
	$arr['html'][]="&amp;";
	$arr['html'][]="&lt;";
	$arr['html'][]="&gt;";
	$arr['html'][]="&nbsp;";
	$arr['html'][]="&iexcl;";
	$arr['html'][]="&cent;";
	$arr['html'][]="&pound;";
	$arr['html'][]="&curren;";
	$arr['html'][]="&yen;";
	$arr['html'][]="&brvbar;";
	$arr['html'][]="&sect;";
	$arr['html'][]="&uml;";
	$arr['html'][]="&copy;";
	$arr['html'][]="&ordf;";
	$arr['html'][]="&laquo;";
	$arr['html'][]="&not;";
	$arr['html'][]="&shy;";
	$arr['html'][]="&reg;";
	$arr['html'][]="&macr;";
	$arr['html'][]="&deg;";
	$arr['html'][]="&plusmn;";
	$arr['html'][]="&sup2;";
	$arr['html'][]="&sup3;";
	$arr['html'][]="&acute;";
	$arr['html'][]="&micro;";
	$arr['html'][]="&para;";
	$arr['html'][]="&middot;";
	$arr['html'][]="&cedil;";
	$arr['html'][]="&sup1;";
	$arr['html'][]="&ordm;";
	$arr['html'][]="&raquo;";
	$arr['html'][]="&frac14;";
	$arr['html'][]="&frac12;";
	$arr['html'][]="&frac34;";
	$arr['html'][]="&iquest;";
	$arr['html'][]="&Agrave;";
	$arr['html'][]="&Aacute;";
	$arr['html'][]="&Acirc;";
	$arr['html'][]="&Atilde;";
	$arr['html'][]="&Auml;";
	$arr['html'][]="&Aring;";
	$arr['html'][]="&AElig;";
	$arr['html'][]="&Ccedil;";
	$arr['html'][]="&Egrave;";
	$arr['html'][]="&Eacute;";
	$arr['html'][]="&Ecirc;";
	$arr['html'][]="&Euml;";
	$arr['html'][]="&Igrave;";
	$arr['html'][]="&Iacute;";
	$arr['html'][]="&Icirc;";
	$arr['html'][]="&Iuml;";
	$arr['html'][]="&ETH;";
	$arr['html'][]="&Ntilde;";
	$arr['html'][]="&Ograve;";
	$arr['html'][]="&Oacute;";
	$arr['html'][]="&Ocirc;";
	$arr['html'][]="&Otilde;";
	$arr['html'][]="&Ouml;";
	$arr['html'][]="&times;";
	$arr['html'][]="&Oslash;";
	$arr['html'][]="&Ugrave;";
	$arr['html'][]="&Uacute;";
	$arr['html'][]="&Ucirc;";
	$arr['html'][]="&Uuml;";
	$arr['html'][]="&Yacute;";
	$arr['html'][]="&THORN;";
	$arr['html'][]="&szlig;";
	$arr['html'][]="&agrave;";
	$arr['html'][]="&aacute;";
	$arr['html'][]="&acirc;";
	$arr['html'][]="&atilde;";
	$arr['html'][]="&auml;";
	$arr['html'][]="&aring;";
	$arr['html'][]="&aelig;";
	$arr['html'][]="&ccedil;";
	$arr['html'][]="&egrave;";
	$arr['html'][]="&eacute;";
	$arr['html'][]="&ecirc;";
	$arr['html'][]="&euml;";
	$arr['html'][]="&igrave;";
	$arr['html'][]="&iacute;";
	$arr['html'][]="&icirc;";
	$arr['html'][]="&iuml;";
	$arr['html'][]="&eth;";
	$arr['html'][]="&ntilde;";
	$arr['html'][]="&ograve;";
	$arr['html'][]="&oacute;";
	$arr['html'][]="&ocirc;";
	$arr['html'][]="&otilde;";
	$arr['html'][]="&ouml;";
	$arr['html'][]="&divide;";
	$arr['html'][]="&oslash;";
	$arr['html'][]="&ugrave;";
	$arr['html'][]="&uacute;";
	$arr['html'][]="&ucirc;";
	$arr['html'][]="&uuml;";
	$arr['html'][]="&yacute;";
	$arr['html'][]="&thorn;";
	$arr['html'][]="&yuml;";
	$arr['html'][]="&euro;";

	$arr['xml'][]="&#34;";
	$arr['xml'][]="&#38;";
	$arr['xml'][]="&#60;";
	$arr['xml'][]="&#62;";
	$arr['xml'][]="&#160;";
	$arr['xml'][]="&#161;";
	$arr['xml'][]="&#162;";
	$arr['xml'][]="&#163;";
	$arr['xml'][]="&#164;";
	$arr['xml'][]="&#165;";
	$arr['xml'][]="&#166;";
	$arr['xml'][]="&#167;";
	$arr['xml'][]="&#168;";
	$arr['xml'][]="&#169;";
	$arr['xml'][]="&#170;";
	$arr['xml'][]="&#171;";
	$arr['xml'][]="&#172;";
	$arr['xml'][]="&#173;";
	$arr['xml'][]="&#174;";
	$arr['xml'][]="&#175;";
	$arr['xml'][]="&#176;";
	$arr['xml'][]="&#177;";
	$arr['xml'][]="&#178;";
	$arr['xml'][]="&#179;";
	$arr['xml'][]="&#180;";
	$arr['xml'][]="&#181;";
	$arr['xml'][]="&#182;";
	$arr['xml'][]="&#183;";
	$arr['xml'][]="&#184;";
	$arr['xml'][]="&#185;";
	$arr['xml'][]="&#186;";
	$arr['xml'][]="&#187;";
	$arr['xml'][]="&#188;";
	$arr['xml'][]="&#189;";
	$arr['xml'][]="&#190;";
	$arr['xml'][]="&#191;";
	$arr['xml'][]="&#192;";
	$arr['xml'][]="&#193;";
	$arr['xml'][]="&#194;";
	$arr['xml'][]="&#195;";
	$arr['xml'][]="&#196;";
	$arr['xml'][]="&#197;";
	$arr['xml'][]="&#198;";
	$arr['xml'][]="&#199;";
	$arr['xml'][]="&#200;";
	$arr['xml'][]="&#201;";
	$arr['xml'][]="&#202;";
	$arr['xml'][]="&#203;";
	$arr['xml'][]="&#204;";
	$arr['xml'][]="&#205;";
	$arr['xml'][]="&#206;";
	$arr['xml'][]="&#207;";
	$arr['xml'][]="&#208;";
	$arr['xml'][]="&#209;";
	$arr['xml'][]="&#210;";
	$arr['xml'][]="&#211;";
	$arr['xml'][]="&#212;";
	$arr['xml'][]="&#213;";
	$arr['xml'][]="&#214;";
	$arr['xml'][]="&#215;";
	$arr['xml'][]="&#216;";
	$arr['xml'][]="&#217;";
	$arr['xml'][]="&#218;";
	$arr['xml'][]="&#219;";
	$arr['xml'][]="&#220;";
	$arr['xml'][]="&#221;";
	$arr['xml'][]="&#222;";
	$arr['xml'][]="&#223;";
	$arr['xml'][]="&#224;";
	$arr['xml'][]="&#225;";
	$arr['xml'][]="&#226;";
	$arr['xml'][]="&#227;";
	$arr['xml'][]="&#228;";
	$arr['xml'][]="&#229;";
	$arr['xml'][]="&#230;";
	$arr['xml'][]="&#231;";
	$arr['xml'][]="&#232;";
	$arr['xml'][]="&#233;";
	$arr['xml'][]="&#234;";
	$arr['xml'][]="&#235;";
	$arr['xml'][]="&#236;";
	$arr['xml'][]="&#237;";
	$arr['xml'][]="&#238;";
	$arr['xml'][]="&#239;";
	$arr['xml'][]="&#240;";
	$arr['xml'][]="&#241;";
	$arr['xml'][]="&#242;";
	$arr['xml'][]="&#243;";
	$arr['xml'][]="&#244;";
	$arr['xml'][]="&#245;";
	$arr['xml'][]="&#246;";
	$arr['xml'][]="&#247;";
	$arr['xml'][]="&#248;";
	$arr['xml'][]="&#249;";
	$arr['xml'][]="&#250;";
	$arr['xml'][]="&#251;";
	$arr['xml'][]="&#252;";
	$arr['xml'][]="&#253;";
	$arr['xml'][]="&#254;";
	$arr['xml'][]="&#255;";
	$arr['xml'][]="&#8364;";

Open in new window


You're right about the missing semi-colon, thanks for that. I've corected it but it wasn't causing an issue.

Ray, I'm not really wanting to westernise the text, I just want the special formatting to survive into the XML. That array that you pasted will be very useful in testing the str_replace which uses the above arrays. Thanks for your post.
Excellent link too Ray, thanks for that.
Check this post.  I will try to come up with a code snippet that would be helpful to you.
http://us3.php.net/manual/en/function.ord.php#103277
Install this and run it, then do "view source" (or just look at the source from my site here).
http://www.laprbass.com/RAY_entitize_western_letters.php

You might find a more efficient way to handle this issue.  For example, you might just test each character to see if its ord() was 128 or above, and entitize those with numerical values.  Or you might find that UTF-8 is not the encoding you want to use.

Best regards, ~Ray
<?php // RAY_entitize_western_letters.php
error_reporting(E_ALL);


// DEMONSTRATE HOW TO TRANSLATE SOME WESTERN CHARACTERS INTO ENGLISH-PRINTABLE OR ENTITIES


// TEST CASES
$arr
= array
( 'Françoise'
, 'ßeta or Beta?'
, 'ENCYCLOPÆDIA'
, 'ça va! mon élève mi niña?'
, 'A stealthy ƒart'
, 'Jean "Ðango" Reinhardt of Pont-à-Celles'
)
;

// DISPLAY EACH TEST CASE
foreach ($arr as $str)
{
    echo PHP_EOL
    . '<br/>'
    . $str
    . ' = '
    . '<strong>'
    . mungstring($str)
    . '</strong>'
    ;
}

// EXAMPLE SHOWING HOW TO TURN A PORTUGESE NAME INTO PART OF A URL STRING
$str = 'Armação de Pêra';
$new = mungString($str);
$new = strtolower($new);
$new = str_replace(' ', '-', $new);

// SHOW THE URL STRING
echo PHP_EOL
. '<br/>'
. '<strong>'
. '<a target="blank" href="http://lmgtfy.com?q='
. htmlentities(mungstring($new))
. '">'
. $str
. '</a>'
. '</strong>'
;


$str = 'Armação de Pêra';
$new = mungString($str, 'FOO');
echo "<pre>";
foreach ($new as $chr)
{
    echo PHP_EOL . $chr . '=' . '&#' . ord($chr) . ';' ;
}

// A FUNCTION TO RETURN THE WESTERNIZED STRING
function mungString($str, $return='TEXT')
{
    // OUR REPLACEMENT ARRAY (MAY WANT SOME CHANGES HERE)
    static
    $normal
    = array
    ( 'ƒ' => 'f'  // http://en.wikipedia.org/wiki/%C6%91 florin
    , 'Š' => 'S'  // http://en.wikipedia.org/wiki/%C5%A0 S-caron (voiceless postalveolar fricative)
    , 'š' => 's'  // http://en.wikipedia.org/wiki/%C5%A0 s-caron
    , 'Ð' => 'Dj' // http://en.wikipedia.org/wiki/Eth (voiced dental fricative)
    , 'Ž' => 'Z'  // http://en.wikipedia.org/wiki/%C5%BD Z-caron (voiced postalveolar fricative)
    , 'ž' => 'z'  // http://en.wikipedia.org/wiki/%C5%BD z-caron
    , 'À' => 'A'
    , 'Á' => 'A'
    , 'Â' => 'A'
    , 'Ã' => 'A'
    , 'Ä' => 'A'
    , 'Å' => 'A'
    , 'Æ' => 'E'
    , 'Ç' => 'C'
    , 'È' => 'E'
    , 'É' => 'E'
    , 'Ê' => 'E'
    , 'Ë' => 'E'
    , 'Ì' => 'I'
    , 'Í' => 'I'
    , 'Î' => 'I'
    , 'Ï' => 'I'
    , 'Ñ' => 'N'
    , 'Ò' => 'O'
    , 'Ó' => 'O'
    , 'Ô' => 'O'
    , 'Õ' => 'O'
    , 'Ö' => 'O'
    , 'Ø' => 'O'
    , 'Ù' => 'U'
    , 'Ú' => 'U'
    , 'Û' => 'U'
    , 'Ü' => 'U'
    , 'Ý' => 'Y'
    , 'Þ' => 'B'
    , 'ß' => 'Ss'
    , 'à' => 'a'
    , 'á' => 'a'
    , 'â' => 'a'
    , 'ã' => 'a'
    , 'ä' => 'a'
    , 'å' => 'a'
    , 'æ' => 'e'
    , 'ç' => 'c'
    , 'è' => 'e'
    , 'é' => 'e'
    , 'ê' => 'e'
    , 'ë' => 'e'
    , 'ì' => 'i'
    , 'í' => 'i'
    , 'î' => 'i'
    , 'ï' => 'i'
    , 'ð' => 'o'
    , 'ñ' => 'n'
    , 'ò' => 'o'
    , 'ó' => 'o'
    , 'ô' => 'o'
    , 'õ' => 'o'
    , 'ö' => 'o'
    , 'ø' => 'o'
    , 'ù' => 'u'
    , 'ú' => 'u'
    , 'û' => 'u'
    , 'ý' => 'y'
    , 'ý' => 'y'
    , 'þ' => 'b'
    , 'ÿ' => 'y'
    )
    ;
    // RETURN THE "TRANSLATED" TEXT
    if ($return == 'TEXT') return strtr($str, $normal);

    // MIGHT BE USEFUL TO GET THE LIST OF ORIGINAL LETTERS
    return array_keys($normal);
}

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Hi Ray,

Yes, using Entities through the mungstring function provides the right output. Now I'll see if I can implement it into my project. Thanks for your help on this one, I've learnt a lot about unicode!

John
Thanks for the points - it's a great question, ~Ray