Solved

Character encoding problem

Posted on 2014-02-03
5
274 Views
Last Modified: 2014-04-01
I am trying to write some foreign characters into an xml file but there seems to be a problem encoding some of the characters. I am using UTF-8 for the XML header.
Following characters work fine.

ÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜàáâãäåæç

but the following characters œ c a e l n s z z
they gets converted into the following html when writing into the XML.
&#263 &#261 &#281 &#322 &#324 &#347 &#378 &#380

How can I write them in the exact format?

Thanks.
0
Comment
Question by:Herci
  • 3
5 Comments
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39829190
You may find some ideas in this article.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_11880-Unicode-PHP-and-Character-Collisions.html

For us to offer any specific help, we would need to see the test data set and see how it interacts with the program code that creates the XML document.  The numeric character entities would seem to be good "visually" when the XML is rendered by a browser, but there is nothing that inherently changes the UTF-8 characters into numeric entities without a specific programmatic step.
0
 
LVL 33

Accepted Solution

by:
Slick812 earned 250 total points
ID: 39830057
greetings Herci, , unfortunately for you, this problem with your "foreign characters"  may not be something you can solve without some awkward changes in using PHP, I will express my opinion that just having the "UTF-8  header" in any document almost never solves any problems with "foreign characters" if they they are "TWOBYTE" characters (multi-byte), PHP used in the English language is set up to only use single byte characters, although there is the PHP multi-byte strings and functions, you can see some of the functions from Manual here -
      http://php.net/manual/en/ref.mbstring.php
at the top o that page it says - "Multibyte character encoding schemes and their related issues are fairly complicated, ", so the issues for this are many times difficult to deal with.
When I see this question in my browser it says that the last two characters are -
&#378 &#380

and yet in my browser I see them as two English language "z", , so I see this -
       but the following characters  œ c a e l n s z z
so the NUMBERS in  &#378 &#380 show me that these are Multibyte characters, as the single byte can NOT GO ABOVE 255 as    &#255
I would think that these -
    &#263 &#261 &#281 &#322 &#324 &#347 &#378 &#380
where sent up from a post from a form, and that post translated the multi-byte characters to the decimal HTML equivalents,
but either way these HTML as &#347 can NOT be set into single byte character sets (english, french).
0
 
LVL 108

Assisted Solution

by:Ray Paseur
Ray Paseur earned 250 total points
ID: 39830216
To try to shed a little more light on it, here are links to two scripts.  The scripts are identical, except that one of the scripts is stored in ANSI and the other is stored in UTF-8.  As you can see, they produce different output.  Single-byte characters above code point 127 are not valid UTF-8, and all of these "special" characters are above that code point in ANSI.  So in UTF-8 they have to be represented by a multi-byte character.

http://www.laprbass.com/RAY_temp_herci_ansi.php
http://www.laprbass.com/RAY_temp_herci_utf8.php

You might try copying the utf8 version of script and adding a meta tag to tell the browser that you've got UTF-8 output.

<?php // RAY_temp_herci_ansi.php
ini_set('display_errors', TRUE);
error_reporting(E_ALL);
echo '<pre>';


// SEE http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_28354854.html
// THIS VERSION OF THE SCRIPT IS CREATED IN UTF8 AND STORED IN UTF8


$str = 'ÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜàáâãäåæç';

hexdump($str);

// SHOW A SHORT HEX STRING BYTE-BY-BYTE
function hexdump($str, $br=PHP_EOL)
{
    if (empty($str)) return FALSE;

    // GET THE HEX BYTE VALUES IN A STRING
    $hex = str_split(implode(NULL, unpack('H*', $str)));

    // ALLOCATE BYTES INTO HI AND LO NIBBLES
    $hi  = NULL;
    $lo  = NULL;
    $mod = 0;
    foreach ($hex as $nib)
    {
        $mod++;
        $mod = $mod % 2;
        if ($mod)
        {
            $hi .= $nib;
        }
        else
        {
            $lo .= $nib;
        }
    }

    // SHOW THE SCALE, THE STRING AND THE HEX
    $num = substr('1...5...10...15...20...25...30...35...40...45...50...55...60...65...70...75...80...85...90...95..100..105..110..115..120..125..130', 0, strlen($str));
    echo $br . $num;
    echo $br . $str;
    echo $br . $hi;
    echo $br . $lo;
    echo $br;
}

Open in new window

<?php // RAY_temp_herci_utf8.php
ini_set('display_errors', TRUE);
error_reporting(E_ALL);
echo '<pre>';


// SEE http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_28354854.html
// THIS VERSION OF THE SCRIPT IS CREATED IN ANSI AND STORED IN ANSI


$str = 'ÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜàáâãäåæç';

hexdump($str);

// SHOW A SHORT HEX STRING BYTE-BY-BYTE
function hexdump($str, $br=PHP_EOL)
{
    if (empty($str)) return FALSE;

    // GET THE HEX BYTE VALUES IN A STRING
    $hex = str_split(implode(NULL, unpack('H*', $str)));

    // ALLOCATE BYTES INTO HI AND LO NIBBLES
    $hi  = NULL;
    $lo  = NULL;
    $mod = 0;
    foreach ($hex as $nib)
    {
        $mod++;
        $mod = $mod % 2;
        if ($mod)
        {
            $hi .= $nib;
        }
        else
        {
            $lo .= $nib;
        }
    }

    // SHOW THE SCALE, THE STRING AND THE HEX
    $num = substr('1...5...10...15...20...25...30...35...40...45...50...55...60...65...70...75...80...85...90...95..100..105..110..115..120..125..130', 0, strlen($str));
    echo $br . $num;
    echo $br . $str;
    echo $br . $hi;
    echo $br . $lo;
    echo $br;
}

Open in new window

0
 

Author Closing Comment

by:Herci
ID: 39968004
I've still not figured out a solution for this yet and that's why it took a long time to give an update. I've decided to close this question but I will keep your answers in mind and carry on doing further research on this. Thanks a lot.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39968258
A month and a half?  And you still could not respond, then you gave a bad grade?  What were you expecting?  Please read the grading guidelines then explain why you gave the bad grade without any response or explanation!  Nobody does this at EE.  What was wrong?
http://support.experts-exchange.com/customer/portal/articles/481419
0

Featured Post

6 Surprising Benefits of Threat Intelligence

All sorts of threat intelligence is available on the web. Intelligence you can learn from, and use to anticipate and prepare for future attacks.

Join & Write a Comment

Have you tried to learn about Unicode, UTF-8, and multibyte text encoding and all the articles are just too "academic" or too technical? This article aims to make the whole topic easy for just about anyone to understand.
Since pre-biblical times, humans have sought ways to keep secrets, and share the secrets selectively.  This article explores the ways PHP can be used to hide and encrypt information.
The viewer will learn how to count occurrences of each item in an array.
The viewer will the learn the benefit of plain text editors and code an HTML5 based template for use in further tutorials.

757 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

17 Experts available now in Live!

Get 1:1 Help Now