Solved

Character encoding problem

Posted on 2014-02-03
5
296 Views
Last Modified: 2014-04-01
I am trying to write some foreign characters into an xml file but there seems to be a problem encoding some of the characters. I am using UTF-8 for the XML header.
Following characters work fine.

ÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜàáâãäåæç

but the following characters œ c a e l n s z z
they gets converted into the following html when writing into the XML.
&#263 &#261 &#281 &#322 &#324 &#347 &#378 &#380

How can I write them in the exact format?

Thanks.
0
Comment
Question by:Herci
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
5 Comments
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 39829190
You may find some ideas in this article.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_11880-Unicode-PHP-and-Character-Collisions.html

For us to offer any specific help, we would need to see the test data set and see how it interacts with the program code that creates the XML document.  The numeric character entities would seem to be good "visually" when the XML is rendered by a browser, but there is nothing that inherently changes the UTF-8 characters into numeric entities without a specific programmatic step.
0
 
LVL 34

Accepted Solution

by:
Slick812 earned 250 total points
ID: 39830057
greetings Herci, , unfortunately for you, this problem with your "foreign characters"  may not be something you can solve without some awkward changes in using PHP, I will express my opinion that just having the "UTF-8  header" in any document almost never solves any problems with "foreign characters" if they they are "TWOBYTE" characters (multi-byte), PHP used in the English language is set up to only use single byte characters, although there is the PHP multi-byte strings and functions, you can see some of the functions from Manual here -
      http://php.net/manual/en/ref.mbstring.php
at the top o that page it says - "Multibyte character encoding schemes and their related issues are fairly complicated, ", so the issues for this are many times difficult to deal with.
When I see this question in my browser it says that the last two characters are -
&#378 &#380

and yet in my browser I see them as two English language "z", , so I see this -
       but the following characters  œ c a e l n s z z
so the NUMBERS in  &#378 &#380 show me that these are Multibyte characters, as the single byte can NOT GO ABOVE 255 as    &#255
I would think that these -
    &#263 &#261 &#281 &#322 &#324 &#347 &#378 &#380
where sent up from a post from a form, and that post translated the multi-byte characters to the decimal HTML equivalents,
but either way these HTML as &#347 can NOT be set into single byte character sets (english, french).
0
 
LVL 110

Assisted Solution

by:Ray Paseur
Ray Paseur earned 250 total points
ID: 39830216
To try to shed a little more light on it, here are links to two scripts.  The scripts are identical, except that one of the scripts is stored in ANSI and the other is stored in UTF-8.  As you can see, they produce different output.  Single-byte characters above code point 127 are not valid UTF-8, and all of these "special" characters are above that code point in ANSI.  So in UTF-8 they have to be represented by a multi-byte character.

http://www.laprbass.com/RAY_temp_herci_ansi.php
http://www.laprbass.com/RAY_temp_herci_utf8.php

You might try copying the utf8 version of script and adding a meta tag to tell the browser that you've got UTF-8 output.

<?php // RAY_temp_herci_ansi.php
ini_set('display_errors', TRUE);
error_reporting(E_ALL);
echo '<pre>';


// SEE http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_28354854.html
// THIS VERSION OF THE SCRIPT IS CREATED IN UTF8 AND STORED IN UTF8


$str = 'ÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜàáâãäåæç';

hexdump($str);

// SHOW A SHORT HEX STRING BYTE-BY-BYTE
function hexdump($str, $br=PHP_EOL)
{
    if (empty($str)) return FALSE;

    // GET THE HEX BYTE VALUES IN A STRING
    $hex = str_split(implode(NULL, unpack('H*', $str)));

    // ALLOCATE BYTES INTO HI AND LO NIBBLES
    $hi  = NULL;
    $lo  = NULL;
    $mod = 0;
    foreach ($hex as $nib)
    {
        $mod++;
        $mod = $mod % 2;
        if ($mod)
        {
            $hi .= $nib;
        }
        else
        {
            $lo .= $nib;
        }
    }

    // SHOW THE SCALE, THE STRING AND THE HEX
    $num = substr('1...5...10...15...20...25...30...35...40...45...50...55...60...65...70...75...80...85...90...95..100..105..110..115..120..125..130', 0, strlen($str));
    echo $br . $num;
    echo $br . $str;
    echo $br . $hi;
    echo $br . $lo;
    echo $br;
}

Open in new window

<?php // RAY_temp_herci_utf8.php
ini_set('display_errors', TRUE);
error_reporting(E_ALL);
echo '<pre>';


// SEE http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_28354854.html
// THIS VERSION OF THE SCRIPT IS CREATED IN ANSI AND STORED IN ANSI


$str = 'ÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜàáâãäåæç';

hexdump($str);

// SHOW A SHORT HEX STRING BYTE-BY-BYTE
function hexdump($str, $br=PHP_EOL)
{
    if (empty($str)) return FALSE;

    // GET THE HEX BYTE VALUES IN A STRING
    $hex = str_split(implode(NULL, unpack('H*', $str)));

    // ALLOCATE BYTES INTO HI AND LO NIBBLES
    $hi  = NULL;
    $lo  = NULL;
    $mod = 0;
    foreach ($hex as $nib)
    {
        $mod++;
        $mod = $mod % 2;
        if ($mod)
        {
            $hi .= $nib;
        }
        else
        {
            $lo .= $nib;
        }
    }

    // SHOW THE SCALE, THE STRING AND THE HEX
    $num = substr('1...5...10...15...20...25...30...35...40...45...50...55...60...65...70...75...80...85...90...95..100..105..110..115..120..125..130', 0, strlen($str));
    echo $br . $num;
    echo $br . $str;
    echo $br . $hi;
    echo $br . $lo;
    echo $br;
}

Open in new window

0
 

Author Closing Comment

by:Herci
ID: 39968004
I've still not figured out a solution for this yet and that's why it took a long time to give an update. I've decided to close this question but I will keep your answers in mind and carry on doing further research on this. Thanks a lot.
0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 39968258
A month and a half?  And you still could not respond, then you gave a bad grade?  What were you expecting?  Please read the grading guidelines then explain why you gave the bad grade without any response or explanation!  Nobody does this at EE.  What was wrong?
http://support.experts-exchange.com/customer/portal/articles/481419
0

Featured Post

How Do You Stack Up Against Your Peers?

With today’s modern enterprise so dependent on digital infrastructures, the impact of major incidents has increased dramatically. Grab the report now to gain insight into how your organization ranks against your peers and learn best-in-class strategies to resolve incidents.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article discusses four methods for overlaying images in a container on a web page
This article discusses how to implement server side field validation and display customized error messages to the client.
Viewers will learn one way to get user input in Java. Introduce the Scanner object: Declare the variable that stores the user input: An example prompting the user for input: Methods you need to invoke in order to properly get  user input:
HTML5 has deprecated a few of the older ways of showing media as well as offering up a new way to create games and animations. Audio, video, and canvas are just a few of the adjustments made between XHTML and HTML5. As we learned in our last micr…

752 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question