Solved

Escaping HTML characters to XML

Posted on 2003-10-29
5
579 Views
Last Modified: 2013-11-19
I have some XHTML code, which I want to include as a node-with-subnodes into an XML file.
example:

xml:

<root>
    <record>
        <id>20</id>
        <name>somename</somename>
        <description/>
    </record>
</root>


xhtml to be included in "description"

<p>this<b>something bold</b>is to be included</p>



and the result would be:

<root>
    <record>
        <id>20</id>
        <name>somename</somename>
        <description>
           <p>
               this
                  <b>something bold</b>
               is to be included
           </p>
        </description>
    </record>
</root>

I am using ASP with Msxml2.DOMDocument.3.0

- - - -

The inserted XHTML comes from an ActiveX in a webform, which produces XHTML.
Everything goes well, untill there are some HTML encoded characters like &euml; (é) or &euro;
Then i can not transform the XHTML to a DOM document.

I have been working to escape the HTML encoded characters to XML encoding, replacing "&euml;"  with "&#235;". Then everything works again. This however takes a simple but long function, in which i have to replace ALL possible HTML-encoded characters with their XML-equivalent.
I wonder if their is an easier way to do it.

Any suggestion is welcome.
0
Comment
Question by:sybe
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
5 Comments
 
LVL 9

Expert Comment

by:sparkplug
ID: 9641486
Hi,

I think you can define the entities in a DTD using the syntax <!ENTITY euml "&#235;">.

I wonder however whether you need to be able parse the XHTML that you are inserting. If not then you could escape the XHTML section using CDATA tags as follows:

<?xml version="1.0" encoding="UTF-8"?>

<root>
    <record>
        <id>20</id>
        <name>somename</name>
        <description><![CDATA[
           <p>
               this
                  <b>something  bold</b>
               is to be included &euml;
           </p>
        ]]></description>
    </record>
</root>


>S'Plug<
0
 
LVL 28

Author Comment

by:sybe
ID: 9641713
I tried the CDATA thing, and the parsing gives no problem. However, I am transforming the XML with XSL to a browser, and the CDATA section then is displayed as text, not as HTML.
I looked for some solution and found the disable-output-escaping which works in Internet Explorer, but not in Mozilla browsers.
So the CDATA solution did/does not bring me closer to solving the problem.

I will try to do something with the DTD thing you mention. I have never worked with that, do you have some links on that?
0
 
LVL 26

Accepted Solution

by:
rdcpro earned 150 total points
ID: 9642479
There was a thread on that recently.  I posted some links to a standard entity catalog DTD, but here's one:
Latin 1 entities:
http://www.utoronto.ca/webdocs/HTMLdocs/HTML_Spec/xhtml1.0/xhtml-lat1.ent
Special entities:
http://www.utoronto.ca/webdocs/HTMLdocs/HTML_Spec/xhtml1.0/xhtml-special.ent
Symbols:
http://www.utoronto.ca/webdocs/HTMLdocs/HTML_Spec/xhtml1.0/xhtml-symbol.ent

I thought there was a definitive ISO or W3C DTD that you could include in your XML that defined all the entities (an entity catalog), but I can't seem to find it at the moment.

Regards,
Mike Sharp
0
 
LVL 15

Assisted Solution

by:robbert
robbert earned 150 total points
ID: 9658979
You can use TidyCOM ( http://perso.wanadoo.fr/ablavier/TidyCOM/ ) to clean up the source before loading it to a DOMDocument.

There are options for outputting XML (instead of XHTML) and converting HTML entities to their numeric equivalents.

I'm not aware of any concurrant products to TidyCOM, resp., HTMLTidy, and have been working with it, often, and even in mid-scaled web applications. - As HTMLTidy (the actual, wrapped application) is single-threaded, it should only be called in one instance at a time, so look forward to restart IIS every few months or so. - But, as mentioned, there doesn't seem to be an alternative.
0
 
LVL 28

Author Comment

by:sybe
ID: 9662346
robbert,

i had used TidyCom to create XHTNL, but i did not find the options to convert HTML entities to numerics.
i'll look at it again, but maybe you can tell me ?
0

Featured Post

Space-Age Communications Transitions to DevOps

ViaSat, a global provider of satellite and wireless communications, securely connects businesses, governments, and organizations to the Internet. Learn how ViaSat’s Network Solutions Engineer, drove the transition from a traditional network support to a DevOps-centric model.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Shoutout to Emily Plummer (http://www.experts-exchange.com/members/eplummer26.html) for giving me this article! She did most of it, I just finished it up and posted it for her :)    Introduction In a previous article (http://www.experts-exchang…
Introduction Since I wrote the original article about Handling Date and Time in PHP and MySQL several years ago, it seemed like now was a good time to update it for object-oriented PHP.  This article does that, replacing as much as possible the pr…
The viewer will learn how to dynamically set the form action using jQuery.
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.

739 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question