Solved

Escaping HTML characters to XML

Posted on 2003-10-29
5
570 Views
Last Modified: 2013-11-19
I have some XHTML code, which I want to include as a node-with-subnodes into an XML file.
example:

xml:

<root>
    <record>
        <id>20</id>
        <name>somename</somename>
        <description/>
    </record>
</root>


xhtml to be included in "description"

<p>this<b>something bold</b>is to be included</p>



and the result would be:

<root>
    <record>
        <id>20</id>
        <name>somename</somename>
        <description>
           <p>
               this
                  <b>something bold</b>
               is to be included
           </p>
        </description>
    </record>
</root>

I am using ASP with Msxml2.DOMDocument.3.0

- - - -

The inserted XHTML comes from an ActiveX in a webform, which produces XHTML.
Everything goes well, untill there are some HTML encoded characters like &euml; (é) or &euro;
Then i can not transform the XHTML to a DOM document.

I have been working to escape the HTML encoded characters to XML encoding, replacing "&euml;"  with "&#235;". Then everything works again. This however takes a simple but long function, in which i have to replace ALL possible HTML-encoded characters with their XML-equivalent.
I wonder if their is an easier way to do it.

Any suggestion is welcome.
0
Comment
Question by:sybe
5 Comments
 
LVL 9

Expert Comment

by:sparkplug
ID: 9641486
Hi,

I think you can define the entities in a DTD using the syntax <!ENTITY euml "&#235;">.

I wonder however whether you need to be able parse the XHTML that you are inserting. If not then you could escape the XHTML section using CDATA tags as follows:

<?xml version="1.0" encoding="UTF-8"?>

<root>
    <record>
        <id>20</id>
        <name>somename</name>
        <description><![CDATA[
           <p>
               this
                  <b>something  bold</b>
               is to be included &euml;
           </p>
        ]]></description>
    </record>
</root>


>S'Plug<
0
 
LVL 28

Author Comment

by:sybe
ID: 9641713
I tried the CDATA thing, and the parsing gives no problem. However, I am transforming the XML with XSL to a browser, and the CDATA section then is displayed as text, not as HTML.
I looked for some solution and found the disable-output-escaping which works in Internet Explorer, but not in Mozilla browsers.
So the CDATA solution did/does not bring me closer to solving the problem.

I will try to do something with the DTD thing you mention. I have never worked with that, do you have some links on that?
0
 
LVL 26

Accepted Solution

by:
rdcpro earned 150 total points
ID: 9642479
There was a thread on that recently.  I posted some links to a standard entity catalog DTD, but here's one:
Latin 1 entities:
http://www.utoronto.ca/webdocs/HTMLdocs/HTML_Spec/xhtml1.0/xhtml-lat1.ent
Special entities:
http://www.utoronto.ca/webdocs/HTMLdocs/HTML_Spec/xhtml1.0/xhtml-special.ent
Symbols:
http://www.utoronto.ca/webdocs/HTMLdocs/HTML_Spec/xhtml1.0/xhtml-symbol.ent

I thought there was a definitive ISO or W3C DTD that you could include in your XML that defined all the entities (an entity catalog), but I can't seem to find it at the moment.

Regards,
Mike Sharp
0
 
LVL 15

Assisted Solution

by:robbert
robbert earned 150 total points
ID: 9658979
You can use TidyCOM ( http://perso.wanadoo.fr/ablavier/TidyCOM/ ) to clean up the source before loading it to a DOMDocument.

There are options for outputting XML (instead of XHTML) and converting HTML entities to their numeric equivalents.

I'm not aware of any concurrant products to TidyCOM, resp., HTMLTidy, and have been working with it, often, and even in mid-scaled web applications. - As HTMLTidy (the actual, wrapped application) is single-threaded, it should only be called in one instance at a time, so look forward to restart IIS every few months or so. - But, as mentioned, there doesn't seem to be an alternative.
0
 
LVL 28

Author Comment

by:sybe
ID: 9662346
robbert,

i had used TidyCom to create XHTNL, but i did not find the options to convert HTML entities to numerics.
i'll look at it again, but maybe you can tell me ?
0

Featured Post

3 Use Cases for Connected Systems

Our Dev teams are like yours. They’re continually cranking out code for new features/bugs fixes, testing, deploying, testing some more, responding to production monitoring events and more. It’s complex. So, we thought you’d like to see what’s working for us.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Trouble parsing soap xml result 3 40
XSLT list item selection criteria not working 12 28
Test ddwrt:UserLookup 1 53
MS Access XML Export Query Setup Multiple Tag Values 15 30
Browsing the questions asked to the Experts of this forum, you will be amazed to see how many times people are headaching about monster regular expressions (regex) to select that specific part of some HTML or XML file they want to extract. The examp…
Shoutout to Emily Plummer (http://www.experts-exchange.com/members/eplummer26.html) for giving me this article! She did most of it, I just finished it up and posted it for her :)    Introduction In a previous article (http://www.experts-exchang…
Viewers will learn one way to get user input in Java. Introduce the Scanner object: Declare the variable that stores the user input: An example prompting the user for input: Methods you need to invoke in order to properly get  user input:
Learn how to create flexible layouts using relative units in CSS.  New relative units added in CSS3 include vw(viewports width), vh(viewports height), vmin(minimum of viewports height and width), and vmax (maximum of viewports height and width).

920 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now