Escaping HTML characters to XML

I have some XHTML code, which I want to include as a node-with-subnodes into an XML file.
example:

xml:

<root>
    <record>
        <id>20</id>
        <name>somename</somename>
        <description/>
    </record>
</root>


xhtml to be included in "description"

<p>this<b>something bold</b>is to be included</p>



and the result would be:

<root>
    <record>
        <id>20</id>
        <name>somename</somename>
        <description>
           <p>
               this
                  <b>something bold</b>
               is to be included
           </p>
        </description>
    </record>
</root>

I am using ASP with Msxml2.DOMDocument.3.0

- - - -

The inserted XHTML comes from an ActiveX in a webform, which produces XHTML.
Everything goes well, untill there are some HTML encoded characters like &euml; (é) or &euro;
Then i can not transform the XHTML to a DOM document.

I have been working to escape the HTML encoded characters to XML encoding, replacing "&euml;"  with "&#235;". Then everything works again. This however takes a simple but long function, in which i have to replace ALL possible HTML-encoded characters with their XML-equivalent.
I wonder if their is an easier way to do it.

Any suggestion is welcome.
LVL 28
sybeAsked:
Who is Participating?
 
rdcproCommented:
There was a thread on that recently.  I posted some links to a standard entity catalog DTD, but here's one:
Latin 1 entities:
http://www.utoronto.ca/webdocs/HTMLdocs/HTML_Spec/xhtml1.0/xhtml-lat1.ent
Special entities:
http://www.utoronto.ca/webdocs/HTMLdocs/HTML_Spec/xhtml1.0/xhtml-special.ent
Symbols:
http://www.utoronto.ca/webdocs/HTMLdocs/HTML_Spec/xhtml1.0/xhtml-symbol.ent

I thought there was a definitive ISO or W3C DTD that you could include in your XML that defined all the entities (an entity catalog), but I can't seem to find it at the moment.

Regards,
Mike Sharp
0
 
sparkplugCommented:
Hi,

I think you can define the entities in a DTD using the syntax <!ENTITY euml "&#235;">.

I wonder however whether you need to be able parse the XHTML that you are inserting. If not then you could escape the XHTML section using CDATA tags as follows:

<?xml version="1.0" encoding="UTF-8"?>

<root>
    <record>
        <id>20</id>
        <name>somename</name>
        <description><![CDATA[
           <p>
               this
                  <b>something  bold</b>
               is to be included &euml;
           </p>
        ]]></description>
    </record>
</root>


>S'Plug<
0
 
sybeAuthor Commented:
I tried the CDATA thing, and the parsing gives no problem. However, I am transforming the XML with XSL to a browser, and the CDATA section then is displayed as text, not as HTML.
I looked for some solution and found the disable-output-escaping which works in Internet Explorer, but not in Mozilla browsers.
So the CDATA solution did/does not bring me closer to solving the problem.

I will try to do something with the DTD thing you mention. I have never worked with that, do you have some links on that?
0
 
robbertCommented:
You can use TidyCOM ( http://perso.wanadoo.fr/ablavier/TidyCOM/ ) to clean up the source before loading it to a DOMDocument.

There are options for outputting XML (instead of XHTML) and converting HTML entities to their numeric equivalents.

I'm not aware of any concurrant products to TidyCOM, resp., HTMLTidy, and have been working with it, often, and even in mid-scaled web applications. - As HTMLTidy (the actual, wrapped application) is single-threaded, it should only be called in one instance at a time, so look forward to restart IIS every few months or so. - But, as mentioned, there doesn't seem to be an alternative.
0
 
sybeAuthor Commented:
robbert,

i had used TidyCom to create XHTNL, but i did not find the options to convert HTML entities to numerics.
i'll look at it again, but maybe you can tell me ?
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.