Link to home
Start Free TrialLog in
Avatar of chloh
chloh

asked on

filtering illegal characters in xml documents

i have the following xml document:

consider the following xml file:

<?xml version="1.0"?>
<attribute>
     <other1> > </other1>
     <other2> < </other2>
     <other3> & </other3>
</attribute>

i'm using jaxp to parse the document. i encountered the following errors when parsing the document:

C:\DOMEcho>java -classpath '.\;C:\DOMEcho;C:\lib\crimson.jar;C:\lib\jaxp.jar;C:\
lib\xalan.jar;.' DOMEcho attribute.xml
Fatal Error: URI=file:C:/DOMEcho/attribute.xml Line=4: The content beginning "<
" is not legal markup. Perhaps the " " (&#20;) character should be a letter.

my investigation reveals that the character say '>' (within <other1> > </other1>) in invalid. any ideas of solving this? note that i cannot change '>' to its corresponding iso characters (xml document is generated by velocity- publishing framework).

any ideas in solving this so that i can parse my documents successfully. i have tried reading in the entire xml string and convert the illegal characters to its equivalent but it don't work. will appreciate if someone can suggest a solution (or even donate some codes for me).
Avatar of klf
klf

try converting the < to &lt;
 & becomes &amp;
Avatar of chloh

ASKER

hi klf,

thanks for the comments. the input is a Java string. you'll encounter parse exception when the xml string is something like:

&lt;xmlTag1&gt;value of this tag&lt;/xmlTag1&gt;

the program is how to convert illegal characters on the tag value if given an xml string. this xml string is dynamically generated.
Avatar of chloh

ASKER

hi klf,

thanks for the comments. the input is a Java string. you'll encounter parse exception when the xml string is something like:

&lt;xmlTag1&gt;value of this tag&lt;/xmlTag1&gt;

the program is how to convert illegal characters on the tag value if given an xml string. this xml string is dynamically generated.
Avatar of chloh

ASKER

hi klf,

thanks for the comments. the input is a Java string. you'll encounter parse exception when the xml string is something like:

&lt;xmlTag1&gt;value of this tag&lt;/xmlTag1&gt;

the program is how to convert illegal characters on the tag value if given an xml string. this xml string is dynamically generated.
do not change the < > characters that delimit the xml tags.

e.g

<t1> 10 < 20 </t1>

should be changed to

<t1> 10 &lt; </t1>
do not change the < > characters that delimit the xml tags.

e.g

<t1> 10 < 20 </t1>

should be changed to

<t1> 10 &lt; </t1>
Avatar of chloh

ASKER

hi klf,

your suggestion will definitely solve my problem but i'm accepting xml document in the form of <t1> 10 < 20 </t1>.

note: i don't know why my postings appear few times over here. i just click once.
> but i'm accepting xml document in the form of <t1> 10 < 20 </t1>.

these are not CORRECT XML documents (according to XML spec) so you cannot use "normal" XML parser.
Avatar of girionis
 The xml you accept should have the corresponding entities instead of the characters. The XML you are receiving is not well formed. I suggest you check the XML generator and see where the problem lies and try to fix it there.

  Hope it helps.
Avatar of chloh

ASKER

thanks for your replies. i knew the source of the problem. anyway i'm coding for a rule-based component (a publishing framework which takes in word by word- invoking from java objects). as such, i don't really care about whether the xml is well form or not.

when the xml string is passed onto another component (which parses the string), this where it fails (xml document is not well formed). i have been trying to filter out the illegal characters but really stuck over here.

would appreciate if someone could help me.
ASKER CERTIFIED SOLUTION
Avatar of heyhey_
heyhey_

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
if you are using limited set of XML tags, then you can replace all chars that are not part of <oneOfYourTags> and </oneOfYourTags>
According to the XML spec, the characters <, >, ', ", & should be escaped.

I use a method like

    /**
     * Encode a string safe for embedding in a default XML structure
     * without the need for specific entity mappings. A String is
     * returned.
     **/
    public static String encodeXMLSafe(String text) {
     return _encodeXMLSafe(text).toString();
    }

    /**
     * Encode a string safe for embedding in a default XML structure
     * without the need for specific entity mappings. A StringBuffer
     * is returned.
     **/
    public static StringBuffer _encodeXMLSafe(String text) {

     int len = text.length();
     StringBuffer buf = new StringBuffer(len*12/10);
     for (int i = 0; i < len; i++) {
         char c = text.charAt(i);
         if (c == '<') {
          buf.append("&lt;");
         } else if (c == '>') {
          buf.append("&gt;");
         } else if (c == '&') {
          buf.append("&amp;");
         } else if (c == '\'') {
          buf.append("&apos;");
         } else {
          buf.append(c);
         }
     }
     return buf;
    }

to make strings XML safe before writing them out.

Cheers,
--gandalf.
 Will the above not replace *all* characters with entities, including these in the XML tags (i.e. the <mytag> will be &lt;mytag&gt;)?
No comment has been added lately, so it's time to clean up this TA.

I will leave a recommendation in the Cleanup topic area that this question is:

- points to heyhey_

Please leave any comments here within the
next seven days.

PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER !

girionis
Cleanup Volunteer