Be sure to convert your xml file to UTF-8 before sending.
UTF-8 should have those characters converted to their equiv &x0000.
Main Topics
Browse All TopicsWe supply an xml feed to various third-parties and specify encoding='utf-8' in the xml file. We've had a couple of calls this week from a client who could not parse the xml file as it contained one entry where the data contained the vertical bar character (|) and another entry contained an e-acute character rather than the plain e character.
Apparently these characters are not valid in UTF-8 encoding ...
Is there a list of characters (that are not valid in UTF-8 encoding) that we should be filtering out when producing our xml file or should we be using some encoding other than UTF-8 encoding ?
This Question has been solved and asker verified All Experts Exchange premium technology solutions are available to subscription members.
Experts Exchange has been collecting answers to technology questions since 1996…3 million and counting! If you have a question, chances are we already have your answer.
If you can't find the exact answer you're looking for, ask our exclusive community of 50,000 experts. You’ll get a personalized answer from a trusted professional.
Thousands of free tech tips, tricks, how-to’s and tutorials are available in our peer reviewed articles section. See for yourself how smart our experts are, no login required.
Access the answers to your technology questions today.
30-day free trial. Register in 60 seconds.
Members of the expert community talk about why the experience at Experts Exchange is different than what you will find anywhere else.

Try it out and discover for yourself.
30-day free trial. Register in 60 seconds.
Join the community of experts here and help other tech pros by answering question in your area of expertise. You can earn FREE access to all Experts Exchange's premium features and resources.
> Be sure to convert your xml file to UTF-8 before sending.
that is one of the two options from my post
> UTF-8 should have those characters converted to their equiv &x0000.
"should" is not the correct phrase here and your format is a bit dubious
é could be converted to é (decimal) or é (hexadecimal) but basically that i working around the problem
you don't want to convert the entire list of multibyte characters into a numbered character entity.
This is surely not the approach I would take
You need to find the correct encoding and either pass that encoding (which wold make the XML wellformed)
or transform the file into a real UTF-8 file (and then XSLT is the best option)
cheers
Geert
>You have to find out what the encoding realy is,
>where does it come from? from a database, from a windows text editor?
>likely the correct encoding is ISO-8859-1 or WIN-1252
>If you pass that encoding with the XML, it will be valid for your customer
Thanks for pointing this out - the data in the xml file comes from an sql server database, so I will look into the possibilities of using ISO-8859-1 as an alternative encoding ....
Business Accounts
Answer for Membership
by: GertonePosted on 2007-09-13 at 02:16:30ID: 19882421
you sort of take the wrong approach
é and | can perfectly be expresed in UTF-8
The problem is that what you think is UTF-8, simply isn't
so you have to make sure that you find out what the encoding really is and pass that encoding with the XML declaration
encodings specify how a certain character is stored on disk (as bytes)
UTF-8 is a mechanism that uses one byte for the first 127 characters
and uses two bytes for the next 127, three bytes for the next 256 etc...
The first byte of the two byte series is an escape value
The first 127 characters (actualy some are forbidden, so there are a lot less)
are encoded exactly the same way as ASCII or ISO-8859-1,
so if your document only has pretty simple characters, you won't spot the difference in encoding
That is why you thought you could safely label your XML as UTF-8, which it obviously isn't
This is a common mistake: people create XML from various sources and then forget to put the correct XML declaration
(this is when you use tools that are not XML aware to create the XML)
You have to find out what the encoding realy is,
where does it come from? from a database, from a windows text editor?
likely the correct encoding is ISO-8859-1 or WIN-1252
If you pass that encoding with the XML, it will be valid for your customer
If for one reason or another the customer requires UTF-8, do a little identity transform in XSLT on your data
the data will remain teh same, but the "é" will be transformed in a double byte UTF-8char
You ask for a list of characters that are not valid UTF-8.... that list is many tens of thousands characters long,
and it is not correct that they are not valid, they simply can't be expressed in a one byte UTF-8 char,
they require multiple bytes
Hope this helps
Geert