Solved

Escaping HTML characters to XML

Posted on 2003-10-29
5
572 Views
Last Modified: 2013-11-19
I have some XHTML code, which I want to include as a node-with-subnodes into an XML file.
example:

xml:

<root>
    <record>
        <id>20</id>
        <name>somename</somename>
        <description/>
    </record>
</root>


xhtml to be included in "description"

<p>this<b>something bold</b>is to be included</p>



and the result would be:

<root>
    <record>
        <id>20</id>
        <name>somename</somename>
        <description>
           <p>
               this
                  <b>something bold</b>
               is to be included
           </p>
        </description>
    </record>
</root>

I am using ASP with Msxml2.DOMDocument.3.0

- - - -

The inserted XHTML comes from an ActiveX in a webform, which produces XHTML.
Everything goes well, untill there are some HTML encoded characters like &euml; (é) or &euro;
Then i can not transform the XHTML to a DOM document.

I have been working to escape the HTML encoded characters to XML encoding, replacing "&euml;"  with "&#235;". Then everything works again. This however takes a simple but long function, in which i have to replace ALL possible HTML-encoded characters with their XML-equivalent.
I wonder if their is an easier way to do it.

Any suggestion is welcome.
0
Comment
Question by:sybe
5 Comments
 
LVL 9

Expert Comment

by:sparkplug
ID: 9641486
Hi,

I think you can define the entities in a DTD using the syntax <!ENTITY euml "&#235;">.

I wonder however whether you need to be able parse the XHTML that you are inserting. If not then you could escape the XHTML section using CDATA tags as follows:

<?xml version="1.0" encoding="UTF-8"?>

<root>
    <record>
        <id>20</id>
        <name>somename</name>
        <description><![CDATA[
           <p>
               this
                  <b>something  bold</b>
               is to be included &euml;
           </p>
        ]]></description>
    </record>
</root>


>S'Plug<
0
 
LVL 28

Author Comment

by:sybe
ID: 9641713
I tried the CDATA thing, and the parsing gives no problem. However, I am transforming the XML with XSL to a browser, and the CDATA section then is displayed as text, not as HTML.
I looked for some solution and found the disable-output-escaping which works in Internet Explorer, but not in Mozilla browsers.
So the CDATA solution did/does not bring me closer to solving the problem.

I will try to do something with the DTD thing you mention. I have never worked with that, do you have some links on that?
0
 
LVL 26

Accepted Solution

by:
rdcpro earned 150 total points
ID: 9642479
There was a thread on that recently.  I posted some links to a standard entity catalog DTD, but here's one:
Latin 1 entities:
http://www.utoronto.ca/webdocs/HTMLdocs/HTML_Spec/xhtml1.0/xhtml-lat1.ent
Special entities:
http://www.utoronto.ca/webdocs/HTMLdocs/HTML_Spec/xhtml1.0/xhtml-special.ent
Symbols:
http://www.utoronto.ca/webdocs/HTMLdocs/HTML_Spec/xhtml1.0/xhtml-symbol.ent

I thought there was a definitive ISO or W3C DTD that you could include in your XML that defined all the entities (an entity catalog), but I can't seem to find it at the moment.

Regards,
Mike Sharp
0
 
LVL 15

Assisted Solution

by:robbert
robbert earned 150 total points
ID: 9658979
You can use TidyCOM ( http://perso.wanadoo.fr/ablavier/TidyCOM/ ) to clean up the source before loading it to a DOMDocument.

There are options for outputting XML (instead of XHTML) and converting HTML entities to their numeric equivalents.

I'm not aware of any concurrant products to TidyCOM, resp., HTMLTidy, and have been working with it, often, and even in mid-scaled web applications. - As HTMLTidy (the actual, wrapped application) is single-threaded, it should only be called in one instance at a time, so look forward to restart IIS every few months or so. - But, as mentioned, there doesn't seem to be an alternative.
0
 
LVL 28

Author Comment

by:sybe
ID: 9662346
robbert,

i had used TidyCom to create XHTNL, but i did not find the options to convert HTML entities to numerics.
i'll look at it again, but maybe you can tell me ?
0

Featured Post

Master Your Team's Linux and Cloud Stack

Come see why top tech companies like Mailchimp and Media Temple use Linux Academy to build their employee training programs.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

The Confluence of Individual Knowledge and the Collective Intelligence At this writing (summer 2013) the term API (http://dictionary.reference.com/browse/API?s=t) has made its way into the popular lexicon of the English language.  A few years ago, …
Introduction Knockoutjs (Knockout) is a JavaScript framework (Model View ViewModel or MVVM framework).   The main ideology behind Knockout is to control from JavaScript how a page looks whilst creating an engaging user experience in the least …
The viewer will receive an overview of the basics of CSS showing inline styles. In the head tags set up your style tags: (CODE) Reference the nav tag and set your properties.: (CODE) Set the reference for the UL element and styles for it to ensu…
The viewer will learn the basics of jQuery including how to code hide show and toggles. Reference your jQuery libraries: (CODE) Include your new external js/jQuery file: (CODE) Write your first lines of code to setup your site for jQuery…

770 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question