illegal characters in XML using UTF-8

Posted on 2011-05-05
Last Modified: 2016-06-02
What are the
illegal characters in XML using UTF-8
Question by:ecandel420

    Author Comment

    Aside from &,",',Carriage Return, Line Feed, < and >

    Entity      Entity Reference      Hex      Description      Value
    amp      &amp;        26            ampersand (&)      U+0026
    lt      &lt;            3C      less-than (<)                           U+003C
    apos      &apos;        27      apostrophe (')      U+0027
    quot      &quot;       22      double quote (")       U+0022
    gt      &gt;         3E                      greater-than (>)      U+003E
    ¿                   OA                         Carriage return (¿)      
    ¿                   OD                          Line Feed  (¿)      
    LVL 27

    Accepted Solution

    Converting the UTF-8 into Unicode, then the illegal Unicode character positions are documented at

    Simply put anything under hex 20 other than CR LF and TAB is illegal, then from hex 80 to 9F except 85 also and then various "gaps" between the "code pages".

    But UTF-8 is an encoding of Unicode whereby the Unicode positions are encoded in 1 to 6 bytes. The start byte determines how many other bytes follow and these bytes must be in specific ranges in order to tell if one byte is out of sequence. See

    If you want to check a UTF-8 string for complete consistancy with the standards you'll need to implement a UTF-8 to Unicode converter and the test each Unicode character against the standard (say in a bit map) which you can download from

    LVL 107

    Expert Comment

    by:Ray Paseur
    This article can shed some light on the issues that we frequently encounter.

    A lot of times people seem to thing Unicode (UTF-8) is needed when it really is not needed.  Example: You are producing XML with Western European characters, but not Chinese or Korean.  In that case you can use ISO8859-1 and things should work out fine.

    What programming languages are you using?  If you use PHP I can give you code examples showing some ways of getting past the problem.  One strategy, no matter what language you use, is to replace the illegal unicode characters with numbered entities like &#0A; etc...

    Expert Comment

    I've requested that this question be deleted for the following reason:

    No comment has been added to this question in more than 21 days, so it is now classified as abandoned and is now flagged for deletion.

    If there is a valid solution, please OBJECT and indicate the comments that are, or would otherwise lead to, a solution.

    Use the specific format https:#axxxxxxxx for comment ID(s).

    Also, please don't object simply because the author did not respond to your comment. While we understand this is frustrating, unfortunately we cannot force the author to return to the question. Unless you feel you have presented a valid, verifiable solution we'll simply delete the question.

    Experts-Exchange Auto Deleter
    LVL 107

    Expert Comment

    by:Ray Paseur
    Wow, since this is like five years old, the state of the art has advanced a bit.  I think BigRat gives a good answer here.  Also, numbered entities have always worked with XML.  And since the long-ago time of this question, I've written an article on UTF-8 and its role in character set encoding.

    Featured Post

    How to run any project with ease

    Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
    - Combine task lists, docs, spreadsheets, and chat in one
    - View and edit from mobile/offline
    - Cut down on emails

    Join & Write a Comment

    Introduction Knockoutjs (Knockout) is a JavaScript framework (Model View ViewModel or MVVM framework).   The main ideology behind Knockout is to control from JavaScript how a page looks whilst creating an engaging user experience in the least …
    Introduction Since I wrote the original article about Handling Date and Time in PHP and MySQL ( several years ago, it seemed like now was a good time to updat…
    Viewers will learn about the different types of variables in Java and how to declare them. Decide the type of variable desired: Put the keyword corresponding to the type of variable in front of the variable name: Use the equal sign to assign a v…
    Viewers will learn about the regular for loop in Java and how to use it. Definition: Break the for loop down into 3 parts: Syntax when using for loops: Example using a for loop:

    728 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    18 Experts available now in Live!

    Get 1:1 Help Now