?
Solved

Confused about XML and CDATA

Posted on 2006-04-28
5
Medium Priority
?
529 Views
Last Modified: 2010-05-18
I am trying to work with XML's CDATA option but am running into difficulty. This should be simple but I am obviously overlooking something. Here is the XML:

<?xml version="1.0"?>
<ROOT>
<Contractors>
<ROW>
<ContractorID>{6D440678-41E2-4F74-A599-5C05DF5F2421}</ContractorID>
<ContractorName><![CDATA[André Dart]]></ContractorName>
</ROW>
<ROW>
<ContractorID>{60BC0624-C453-499B-A3FD-B34E0D036EB1}</ContractorID>
<ContractorName><![CDATA[Annabelle McKinley]]></ContractorName>
</ROW>
<ROW>
<ContractorID>{0DD5B7CC-6315-42A9-8898-F55725A8249F}</ContractorID>
<ContractorName><![CDATA[Axiom Puree Ltd]]></ContractorName>
</ROW>
</Contractors>
</ROOT>


However, Microsoft IE when loading the xml file complains...

The XML page cannot be displayed
Cannot view XML input using XSL style sheet. Please correct the error and then click the Refresh button, or try again later.
--------------------------------------------------------------------------------

An invalid character was found in text content. Error processing resource 'file:///C:/test.xml'. Line 6, Position 30

<ContractorName><![CDATA[Andr



Also I am working with Microsoft's DOM object (version 5) using VB6 so when you come to load the XML file it fails. If I take out the acute character then it works just fine.

Any suggestions anyone?
0
Comment
Question by:Kitsune
  • 3
5 Comments
 
LVL 143

Expert Comment

by:Guy Hengel [angelIII / a3]
ID: 16560264
CDATA section only helps to ignore the content to be parsed as HTML strucure, but not about illegal characters.

<ContractorName><![CDATA[André Dart]]></ContractorName>
should be
<ContractorName><![CDATA[Andr&eaccut; Dart]]></ContractorName>
0
 
LVL 1

Author Comment

by:Kitsune
ID: 16560304
The only illegal characters in XML (according to W3C) are less than and ampersand, although I am replacing all 5 of the main list. However characters used in European and Asian languages could also be entered and I cannot check for every such character occurence. Isn't that the whole point of CDATA?

Note: Only the characters "<" and "&" are strictly illegal in XML. Apostrophes,
quotation marks and greater than signs are legal, but it is a good habit to
replace them.

See: http://www.w3schools.com/xml/xml_cdata.asp


angelIII,

Thanks for your suggestion and I would do this if this were the only possible scenario but there are literally thousands of non-ASCII characters and trying to replace each of them individually is not acceptable. I don't know what my customers may enter and many of them are not native English speakers. What other tricks have you got up your sleeve?  :)
0
 
LVL 60

Accepted Solution

by:
Geert Bormans earned 500 total points
ID: 16561404
Hi Kitsune,

your parser is treating this as UTF-8
try to find the correct encoding
If I copy your XML... IE screams like you experienced
If I replace the declaration, like this <?xml version="1.0" encoding="ISO-8859-1"?>
IE is hapier

apparently the encoding of your XML is not UTF-8,
and the person giving it to you should have added a correct declaration

Cheers!
0
 
LVL 1

Author Comment

by:Kitsune
ID: 16561576
I'm away from the office now and won't be back until Tuesday. I'll look at the encoding problem then.

Nobody is giving me anything, we are simply trying to make an XML object as robust as possible for clients. We can't predict the characters that clients may enter in the software and they surely won't be the type of users who even know what XML encoding is let alone which one is suitable for their language character set.

Gertone,
What is the most flexible or all-encompassing of the encoding options?

Thanks for you help.
0
 
LVL 1

Author Comment

by:Kitsune
ID: 16582602
Yes. It is definitely the encoding causing the problem. W3C recommend using no encoding however unless the XML file has been written using UNicode this does not work. For anyone reading this post later the following information may help...


XML documents may contain foreign characters, like Norwegian æ ø å , or French ê è é.

To let your XML parser understand these characters, you should save your XML documents as Unicode.
Windows 2000 Notepad

Windows 2000 Notepad can save files as Unicode.

Save the XML file below as Unicode (note that the document does not contain any encoding attribute):

<?xml version="1.0"?>
<note>
  <from>Jani</from>
  <to>Tove</to>
  <message>Norwegian: æøå. French: êèé</message>
</note>

The file above, note_encode_none_u.xml will NOT generate an error in IE 5+, Firefox, or Opera, but it WILL generate an error in Netscape 6.2.
Windows 2000 Notepad with Encoding

Windows 2000 Notepad files saved as Unicode use "UTF-16" encoding.

If you add an encoding attribute to XML files saved as Unicode, windows encoding values will generate an error.

The following encoding (open it), will NOT give an error message:

<?xml version="1.0" encoding="windows-1252"?>

The following encoding (open it), will NOT give an error message:

<?xml version="1.0" encoding="ISO-8859-1"?>

The following encoding (open it), will NOT give an error message:

<?xml version="1.0" encoding="UTF-8"?>

The following encoding (open it), will NOT generate an error in IE 5+, Firefox, or Opera, but it WILL generate an error in Netscape 6.2.

<?xml version="1.0" encoding="UTF-16"?>


Error Messages

If you try to load an XML document into Internet Explorer, you can get two different errors indicating encoding problems:

An invalid character was found in text content.

You will get this error message if a character in the XML document does not match the encoding attribute. Normally you will get this error message if your XML document contains "foreign" characters, and the file was saved with a single-byte encoding editor like Notepad, and no encoding attribute was specified.

Switch from current encoding to specified encoding not supported.

You will get this error message if your file was saved as Unicode/UTF-16 but the encoding attribute specified a single-byte encoding like Windows-1252, ISO-8859-1 or  UTF-8. You can also get this error message if your document was saved with single-byte encoding, but the encoding attribute specified a double-byte encoding like UTF-16.
Conclusion

The conclusion is that the encoding attribute has to specify the encoding used when the document was saved. My best advice to avoid errors is:

    * Use an editor that supports encoding
    * Make sure you know what encoding it uses
    * Use the same encoding attribute in your XML documents
0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I was working on a PowerPoint add-in the other day and a client asked me "can you implement a feature which processes a chart when it's pasted into a slide from another deck?". It got me wondering how to hook into built-in ribbon events in Office.
Create a Windows 10 custom Image with custom task bar and custom start menu using XML for deployment.
This video shows how to quickly and easily deploy an email signature for all users in Office 365 and prevent it from being added to replies and forwards. (the resulting signature is applied on the server level in Exchange Online) The email signat…
With just a little bit of  SQL and VBA, many doors open to cool things like synchronize a list box to display data relevant to other information on a form.  If you have never written code or looked at an SQL statement before, no problem! ...  give i…
Suggested Courses
Course of the Month15 days, 10 hours left to enroll

850 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question