Go Premium for a chance to win a PS4. Enter to Win

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1307
  • Last Modified:

special character encoding in XML

I am using python's xml.dom.minidom to create a service that simulates a production server, and I'm using this service on my local machine to test software, rather than hitting the production server for my testing.

While simulating the responses from the production server, there's one part I haven't been able to accurately reproduce...

The server will respond with an xml element like this:

<Item Name="TSCDATA" Value=":020000040001F9&#xA;:100140001600F8F2EFEEF2F6000D0FF303FF0005D4&#xA;:10015000070B0A2101F6000102050304F9F70A035F&#xA;:10016000F1E7E2DDDDDCDFE30BEE0302070A090B5A&#xA;:100170000211F6FCF2EBE7E5E8EBF2FCFEFB050111&#xA;:10018000F9394C3850203730303020474120203169&#xA;:10019000332041343932322032333036303939313C&#xA;:1001A000373430373137313331373030393230341A&#xA;:1001B000323401802406029C3E02010303001DB973&#xA;:00000001FF" />

notice the "&#xA;" which is used as an EndOfLine character

I can't get my simulator program to create that same sequence of characters.  The problem is not the content, it's the line delimiter character.  I can't make the "&#xA;"

I get something like this:

<Item Name="TSCDATA" Value=":02000004DE011B\&amp;#xA;:08000000081022003800000086\&amp;#xA;:08000800081022D020000000C6\&amp;#xA;:08001000081042E0200000008E\&amp;#xA;:08001800061042D02000000098\&amp;#xA;:08002000062042E02000000070\&amp;#xA;:08002800062041D02000000079\&amp;#xA;:08003000042041C02000000083\&amp;#xA;:08003800041041B0200000009B\&amp;#xA;:00000001FF"/>

I've tried a number of variations.

When I supply this as an EOL: "&#xA;", the minidom encodes it as this: "&amp;#xA;"
When I supply this: "\&#xA;", the minidom encodes it as "\&amp;#xA;"
When I supply this: "&&#xA;", the minidom encodes it as "&amp;&amp;#xA;"
When I supply this: "\n", the minidom does not encode it, and leaves it as a linefeed.

How can I tell the minidom engine to either NOT encode the "&#xA;"
or force it to encode "\n" as "&#xA;" ?

Brian
Withun

0
Brian Withun
Asked:
Brian Withun
  • 2
1 Solution
 
BigRatCommented:
When an XML parser parses a string containing &#xA; it MUST convert it into a line feed character.
When an XML parse parses a string containing a character whose hex value is 0A, it MUST insert a line feed character for it.

So to get &x#A; UNCHANGED in an XML document is impossible, if it is to represent a line feed character.

I suspect the same will happen if you put the data in a CDATA section, since CDATA sections may only contain VALID XML characters.

When an XML document outputs an XML string, say via doc.xml(), then the resulting XML should not have entities (ie: the &#x..; sequence) for line feeds. It is however NOT forbidden to convert EVERY character into an entity, although the resultant string would be rather bulky.

That said, why is it necessary to have an entity for line feed in the output? If it is necessary you'll have to write a bit of script to post-process it and replace the line feeds with the entity sequence.
0
 
Brian WithunAuthor Commented:
The reason I need the embedded linefeeds is because I'm writing a simulator for an actual XML server.  If I do not do it that way, my simulation will not accurately reflect the behavior of the server it is intended to simulate.

If the server is creating non-standard XML, that is outside my realm of influence and I, too, must create non-standard XML.

It sounds like you are suggesting that this is not possible.  I find it difficult to believe that I cannot embed this string of characters (for example) in an XML document.  I believed XML to be capable of sending anything.

":020000040001F9&#xA;"

How do I encode the string above without it being mangled into something that it is not?

Is there no way to "escape" these characters?

BW

0
 
BigRatCommented:
I have spent more time on this problem. You can't user CDATA sections since minidom does not support them. I can't seem to find the dom configuration (it probably doesn't have one) where one can switch off character normalization or set the entities property to true. If you can find that interface try it. I doubt however that that will help.

Strictly speaking any XML processor, and that includes things which just sniff it, MUST handle &#xA; and the newline character in the same way. In fact the ENTIRE contents of an XML element could be encoded in entities, eg: &#x41; for an "A". That is just as acceptable as plain text (however silly it might look).
0

Featured Post

[Webinar] Cloud and Mobile-First Strategy

Maybe you’ve fully adopted the cloud since the beginning. Or maybe you started with on-prem resources but are pursuing a “cloud and mobile first” strategy. Getting to that end state has its challenges. Discover how to build out a 100% cloud and mobile IT strategy in this webinar.

  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now