special character encoding in XML

Posted on 2009-07-06
Last Modified: 2013-11-18
I am using python's xml.dom.minidom to create a service that simulates a production server, and I'm using this service on my local machine to test software, rather than hitting the production server for my testing.

While simulating the responses from the production server, there's one part I haven't been able to accurately reproduce...

The server will respond with an xml element like this:

<Item Name="TSCDATA" Value=":020000040001F9&#xA;:100140001600F8F2EFEEF2F6000D0FF303FF0005D4&#xA;:10015000070B0A2101F6000102050304F9F70A035F&#xA;:10016000F1E7E2DDDDDCDFE30BEE0302070A090B5A&#xA;:100170000211F6FCF2EBE7E5E8EBF2FCFEFB050111&#xA;:10018000F9394C3850203730303020474120203169&#xA;:10019000332041343932322032333036303939313C&#xA;:1001A000373430373137313331373030393230341A&#xA;:1001B000323401802406029C3E02010303001DB973&#xA;:00000001FF" />

notice the "&#xA;" which is used as an EndOfLine character

I can't get my simulator program to create that same sequence of characters.  The problem is not the content, it's the line delimiter character.  I can't make the "&#xA;"

I get something like this:

<Item Name="TSCDATA" Value=":02000004DE011B\&amp;#xA;:08000000081022003800000086\&amp;#xA;:08000800081022D020000000C6\&amp;#xA;:08001000081042E0200000008E\&amp;#xA;:08001800061042D02000000098\&amp;#xA;:08002000062042E02000000070\&amp;#xA;:08002800062041D02000000079\&amp;#xA;:08003000042041C02000000083\&amp;#xA;:08003800041041B0200000009B\&amp;#xA;:00000001FF"/>

I've tried a number of variations.

When I supply this as an EOL: "&#xA;", the minidom encodes it as this: "&amp;#xA;"
When I supply this: "\&#xA;", the minidom encodes it as "\&amp;#xA;"
When I supply this: "&&#xA;", the minidom encodes it as "&amp;&amp;#xA;"
When I supply this: "\n", the minidom does not encode it, and leaves it as a linefeed.

How can I tell the minidom engine to either NOT encode the "&#xA;"
or force it to encode "\n" as "&#xA;" ?


Question by:Brian Withun
  • 2
LVL 27

Accepted Solution

BigRat earned 500 total points
ID: 24792532
When an XML parser parses a string containing &#xA; it MUST convert it into a line feed character.
When an XML parse parses a string containing a character whose hex value is 0A, it MUST insert a line feed character for it.

So to get &x#A; UNCHANGED in an XML document is impossible, if it is to represent a line feed character.

I suspect the same will happen if you put the data in a CDATA section, since CDATA sections may only contain VALID XML characters.

When an XML document outputs an XML string, say via doc.xml(), then the resulting XML should not have entities (ie: the &#x..; sequence) for line feeds. It is however NOT forbidden to convert EVERY character into an entity, although the resultant string would be rather bulky.

That said, why is it necessary to have an entity for line feed in the output? If it is necessary you'll have to write a bit of script to post-process it and replace the line feeds with the entity sequence.
LVL 13

Author Comment

by:Brian Withun
ID: 25051423
The reason I need the embedded linefeeds is because I'm writing a simulator for an actual XML server.  If I do not do it that way, my simulation will not accurately reflect the behavior of the server it is intended to simulate.

If the server is creating non-standard XML, that is outside my realm of influence and I, too, must create non-standard XML.

It sounds like you are suggesting that this is not possible.  I find it difficult to believe that I cannot embed this string of characters (for example) in an XML document.  I believed XML to be capable of sending anything.


How do I encode the string above without it being mangled into something that it is not?

Is there no way to "escape" these characters?


LVL 27

Expert Comment

ID: 25058446
I have spent more time on this problem. You can't user CDATA sections since minidom does not support them. I can't seem to find the dom configuration (it probably doesn't have one) where one can switch off character normalization or set the entities property to true. If you can find that interface try it. I doubt however that that will help.

Strictly speaking any XML processor, and that includes things which just sniff it, MUST handle &#xA; and the newline character in the same way. In fact the ENTIRE contents of an XML element could be encoded in entities, eg: &#x41; for an "A". That is just as acceptable as plain text (however silly it might look).

Featured Post

Master Your Team's Linux and Cloud Stack!

The average business loses $13.5M per year to ineffective training (per 1,000 employees). Keep ahead of the competition and combine in-person quality with online cost and flexibility by training with Linux Academy.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Formatting issues in XSL FO 3 44
Downgrading from Python 3.4.5 to 3.4.3 5 58
Eliminate additional border 1 23
How to configure empty element in XML Document parser? 15 36
This article covers the basics of the Sass, which is a CSS extension language. You will learn about variables, mixins, and nesting.
What is Node.js? Node.js is a server side scripting language much like PHP or ASP but is used to implement the complete package of HTTP webserver and application framework. The difference is that Node.js’s execution engine is asynchronous and event…
Viewers will learn about the regular for loop in Java and how to use it. Definition: Break the for loop down into 3 parts: Syntax when using for loops: Example using a for loop:
The viewer will learn how to dynamically set the form action using jQuery.

830 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question