Confused about XML and CDATA

Posted on 2006-04-28
Last Modified: 2010-05-18
I am trying to work with XML's CDATA option but am running into difficulty. This should be simple but I am obviously overlooking something. Here is the XML:

<?xml version="1.0"?>
<ContractorName><![CDATA[André Dart]]></ContractorName>
<ContractorName><![CDATA[Annabelle McKinley]]></ContractorName>
<ContractorName><![CDATA[Axiom Puree Ltd]]></ContractorName>

However, Microsoft IE when loading the xml file complains...

The XML page cannot be displayed
Cannot view XML input using XSL style sheet. Please correct the error and then click the Refresh button, or try again later.

An invalid character was found in text content. Error processing resource 'file:///C:/test.xml'. Line 6, Position 30


Also I am working with Microsoft's DOM object (version 5) using VB6 so when you come to load the XML file it fails. If I take out the acute character then it works just fine.

Any suggestions anyone?
Question by:Kitsune
    LVL 142

    Expert Comment

    by:Guy Hengel [angelIII / a3]
    CDATA section only helps to ignore the content to be parsed as HTML strucure, but not about illegal characters.

    <ContractorName><![CDATA[André Dart]]></ContractorName>
    should be
    <ContractorName><![CDATA[Andr&eaccut; Dart]]></ContractorName>
    LVL 1

    Author Comment

    The only illegal characters in XML (according to W3C) are less than and ampersand, although I am replacing all 5 of the main list. However characters used in European and Asian languages could also be entered and I cannot check for every such character occurence. Isn't that the whole point of CDATA?

    Note: Only the characters "<" and "&" are strictly illegal in XML. Apostrophes,
    quotation marks and greater than signs are legal, but it is a good habit to
    replace them.



    Thanks for your suggestion and I would do this if this were the only possible scenario but there are literally thousands of non-ASCII characters and trying to replace each of them individually is not acceptable. I don't know what my customers may enter and many of them are not native English speakers. What other tricks have you got up your sleeve?  :)
    LVL 60

    Accepted Solution

    Hi Kitsune,

    your parser is treating this as UTF-8
    try to find the correct encoding
    If I copy your XML... IE screams like you experienced
    If I replace the declaration, like this <?xml version="1.0" encoding="ISO-8859-1"?>
    IE is hapier

    apparently the encoding of your XML is not UTF-8,
    and the person giving it to you should have added a correct declaration

    LVL 1

    Author Comment

    I'm away from the office now and won't be back until Tuesday. I'll look at the encoding problem then.

    Nobody is giving me anything, we are simply trying to make an XML object as robust as possible for clients. We can't predict the characters that clients may enter in the software and they surely won't be the type of users who even know what XML encoding is let alone which one is suitable for their language character set.

    What is the most flexible or all-encompassing of the encoding options?

    Thanks for you help.
    LVL 1

    Author Comment

    Yes. It is definitely the encoding causing the problem. W3C recommend using no encoding however unless the XML file has been written using UNicode this does not work. For anyone reading this post later the following information may help...

    XML documents may contain foreign characters, like Norwegian æ ø å , or French ê è é.

    To let your XML parser understand these characters, you should save your XML documents as Unicode.
    Windows 2000 Notepad

    Windows 2000 Notepad can save files as Unicode.

    Save the XML file below as Unicode (note that the document does not contain any encoding attribute):

    <?xml version="1.0"?>
      <message>Norwegian: æøå. French: êèé</message>

    The file above, note_encode_none_u.xml will NOT generate an error in IE 5+, Firefox, or Opera, but it WILL generate an error in Netscape 6.2.
    Windows 2000 Notepad with Encoding

    Windows 2000 Notepad files saved as Unicode use "UTF-16" encoding.

    If you add an encoding attribute to XML files saved as Unicode, windows encoding values will generate an error.

    The following encoding (open it), will NOT give an error message:

    <?xml version="1.0" encoding="windows-1252"?>

    The following encoding (open it), will NOT give an error message:

    <?xml version="1.0" encoding="ISO-8859-1"?>

    The following encoding (open it), will NOT give an error message:

    <?xml version="1.0" encoding="UTF-8"?>

    The following encoding (open it), will NOT generate an error in IE 5+, Firefox, or Opera, but it WILL generate an error in Netscape 6.2.

    <?xml version="1.0" encoding="UTF-16"?>

    Error Messages

    If you try to load an XML document into Internet Explorer, you can get two different errors indicating encoding problems:

    An invalid character was found in text content.

    You will get this error message if a character in the XML document does not match the encoding attribute. Normally you will get this error message if your XML document contains "foreign" characters, and the file was saved with a single-byte encoding editor like Notepad, and no encoding attribute was specified.

    Switch from current encoding to specified encoding not supported.

    You will get this error message if your file was saved as Unicode/UTF-16 but the encoding attribute specified a single-byte encoding like Windows-1252, ISO-8859-1 or  UTF-8. You can also get this error message if your document was saved with single-byte encoding, but the encoding attribute specified a double-byte encoding like UTF-16.

    The conclusion is that the encoding attribute has to specify the encoding used when the document was saved. My best advice to avoid errors is:

        * Use an editor that supports encoding
        * Make sure you know what encoding it uses
        * Use the same encoding attribute in your XML documents

    Write Comment

    Please enter a first name

    Please enter a last name

    We will never share this with anyone.

    Featured Post

    Why You Should Analyze Threat Actor TTPs

    After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

    Suggested Solutions

    Introduction In my previous article ( I showed you how the XML Source component can be used to load XML files into a SQL Server database, us…
    Browsing the questions asked to the Experts of this forum, you will be amazed to see how many times people are headaching about monster regular expressions (regex) to select that specific part of some HTML or XML file they want to extract. The examp…
    Hi everyone! This is Experts Exchange customer support.  This quick video will show you how to change your primary email address.  If you have any questions, then please Write a Comment below!
    In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…

    737 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    22 Experts available now in Live!

    Get 1:1 Help Now