> XML is UTF-8
this is not true
XML can be UTF-8, but doesn't necessarily have to be.
CDATA sections make that you don't have to escape the famous five,
but they don't make illegal characters legal.
You can set the encoding that was used in the XML through the xml declaration
example <?xml version="1.0" encoding="ISO-8859-1" ?>
indicates that the encoding used is ISO-8859-1 (iso latin 1)
If the declaration is left out, the default is UTF-8 and UTF-8 is most commonly used
You really have to find out what encoding your perl script is generating
and mention that in the declaration
Why don't you test by adding this in the front of your XML
<?xml version="1.0" encoding="ISO-8859-1" ?>
it might already work
If you don't find the correct encoding
(this could be the case if you are merging data from texts, databases etc, and you are dealing with a mixed encoding)
you could add filters that map certain characters to the unicode number, like this é for "é"
This number is correct, regardless of the encoding
By the way
XML is Unicode, the encoding is just the binary representation, UTF-8 is simply such a repreentation, but there are many
I think that is what archang3l meant to say
cheers
Geert
Main Topics
Browse All Topics





by: archang3lPosted on 2007-10-03 at 22:46:00ID: 20012002
Hello dkim18,
XML is UTF-8, so foreign language characters should not be any problem.
There are five characters that are markup delimiters in XML, and therefore can never appear in their literal form in XML character data (such as the text value of an element). If these characters are needed as literals, the following named entities MUST be used:
* & for & (ampersand)
* < for < (left angle bracket, less-than sign)
* > for > (right angle bracket, greater-than sign)
* " for " (quotation mark)
* ' for ' (apostrophe)
Regards,
archang3l