Confused about XML and CDATA

I am trying to work with XML's CDATA option but am running into difficulty. This should be simple but I am obviously overlooking something. Here is the XML:

<?xml version="1.0"?>
<ContractorName><![CDATA[André Dart]]></ContractorName>
<ContractorName><![CDATA[Annabelle McKinley]]></ContractorName>
<ContractorName><![CDATA[Axiom Puree Ltd]]></ContractorName>

However, Microsoft IE when loading the xml file complains...

The XML page cannot be displayed
Cannot view XML input using XSL style sheet. Please correct the error and then click the Refresh button, or try again later.

An invalid character was found in text content. Error processing resource 'file:///C:/test.xml'. Line 6, Position 30


Also I am working with Microsoft's DOM object (version 5) using VB6 so when you come to load the XML file it fails. If I take out the acute character then it works just fine.

Any suggestions anyone?
Who is Participating?
Geert BormansConnect With a Mentor Information ArchitectCommented:
Hi Kitsune,

your parser is treating this as UTF-8
try to find the correct encoding
If I copy your XML... IE screams like you experienced
If I replace the declaration, like this <?xml version="1.0" encoding="ISO-8859-1"?>
IE is hapier

apparently the encoding of your XML is not UTF-8,
and the person giving it to you should have added a correct declaration

Guy Hengel [angelIII / a3]Billing EngineerCommented:
CDATA section only helps to ignore the content to be parsed as HTML strucure, but not about illegal characters.

<ContractorName><![CDATA[André Dart]]></ContractorName>
should be
<ContractorName><![CDATA[Andr&eaccut; Dart]]></ContractorName>
KitsuneAuthor Commented:
The only illegal characters in XML (according to W3C) are less than and ampersand, although I am replacing all 5 of the main list. However characters used in European and Asian languages could also be entered and I cannot check for every such character occurence. Isn't that the whole point of CDATA?

Note: Only the characters "<" and "&" are strictly illegal in XML. Apostrophes,
quotation marks and greater than signs are legal, but it is a good habit to
replace them.



Thanks for your suggestion and I would do this if this were the only possible scenario but there are literally thousands of non-ASCII characters and trying to replace each of them individually is not acceptable. I don't know what my customers may enter and many of them are not native English speakers. What other tricks have you got up your sleeve?  :)
KitsuneAuthor Commented:
I'm away from the office now and won't be back until Tuesday. I'll look at the encoding problem then.

Nobody is giving me anything, we are simply trying to make an XML object as robust as possible for clients. We can't predict the characters that clients may enter in the software and they surely won't be the type of users who even know what XML encoding is let alone which one is suitable for their language character set.

What is the most flexible or all-encompassing of the encoding options?

Thanks for you help.
KitsuneAuthor Commented:
Yes. It is definitely the encoding causing the problem. W3C recommend using no encoding however unless the XML file has been written using UNicode this does not work. For anyone reading this post later the following information may help...

XML documents may contain foreign characters, like Norwegian æ ø å , or French ê è é.

To let your XML parser understand these characters, you should save your XML documents as Unicode.
Windows 2000 Notepad

Windows 2000 Notepad can save files as Unicode.

Save the XML file below as Unicode (note that the document does not contain any encoding attribute):

<?xml version="1.0"?>
  <message>Norwegian: æøå. French: êèé</message>

The file above, note_encode_none_u.xml will NOT generate an error in IE 5+, Firefox, or Opera, but it WILL generate an error in Netscape 6.2.
Windows 2000 Notepad with Encoding

Windows 2000 Notepad files saved as Unicode use "UTF-16" encoding.

If you add an encoding attribute to XML files saved as Unicode, windows encoding values will generate an error.

The following encoding (open it), will NOT give an error message:

<?xml version="1.0" encoding="windows-1252"?>

The following encoding (open it), will NOT give an error message:

<?xml version="1.0" encoding="ISO-8859-1"?>

The following encoding (open it), will NOT give an error message:

<?xml version="1.0" encoding="UTF-8"?>

The following encoding (open it), will NOT generate an error in IE 5+, Firefox, or Opera, but it WILL generate an error in Netscape 6.2.

<?xml version="1.0" encoding="UTF-16"?>

Error Messages

If you try to load an XML document into Internet Explorer, you can get two different errors indicating encoding problems:

An invalid character was found in text content.

You will get this error message if a character in the XML document does not match the encoding attribute. Normally you will get this error message if your XML document contains "foreign" characters, and the file was saved with a single-byte encoding editor like Notepad, and no encoding attribute was specified.

Switch from current encoding to specified encoding not supported.

You will get this error message if your file was saved as Unicode/UTF-16 but the encoding attribute specified a single-byte encoding like Windows-1252, ISO-8859-1 or  UTF-8. You can also get this error message if your document was saved with single-byte encoding, but the encoding attribute specified a double-byte encoding like UTF-16.

The conclusion is that the encoding attribute has to specify the encoding used when the document was saved. My best advice to avoid errors is:

    * Use an editor that supports encoding
    * Make sure you know what encoding it uses
    * Use the same encoding attribute in your XML documents
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.