filtering illegal characters in xml documents

i have the following xml document:

consider the following xml file:

<?xml version="1.0"?>
<attribute>
     <other1> > </other1>
     <other2> < </other2>
     <other3> & </other3>
</attribute>

i'm using jaxp to parse the document. i encountered the following errors when parsing the document:

C:\DOMEcho>java -classpath '.\;C:\DOMEcho;C:\lib\crimson.jar;C:\lib\jaxp.jar;C:\
lib\xalan.jar;.' DOMEcho attribute.xml
Fatal Error: URI=file:C:/DOMEcho/attribute.xml Line=4: The content beginning "<
" is not legal markup. Perhaps the " " (&#20;) character should be a letter.

my investigation reveals that the character say '>' (within <other1> > </other1>) in invalid. any ideas of solving this? note that i cannot change '>' to its corresponding iso characters (xml document is generated by velocity- publishing framework).

any ideas in solving this so that i can parse my documents successfully. i have tried reading in the entire xml string and convert the illegal characters to its equivalent but it don't work. will appreciate if someone can suggest a solution (or even donate some codes for me).
chlohAsked:
Who is Participating?
 
heyhey_Connect With a Mentor Commented:
there isn't an easy way unless you can precisely define what is "illegal characters"

i.e in expression like

<other2> 1 < 2 > 3 < / 4 > </other2>

how do you decide which "<" and ">" are illegal and which are not.
0
 
klfCommented:
try converting the < to &lt;
 & becomes &amp;
0
 
chlohAuthor Commented:
hi klf,

thanks for the comments. the input is a Java string. you'll encounter parse exception when the xml string is something like:

&lt;xmlTag1&gt;value of this tag&lt;/xmlTag1&gt;

the program is how to convert illegal characters on the tag value if given an xml string. this xml string is dynamically generated.
0
Cloud Class® Course: Microsoft Office 2010

This course will introduce you to the interfaces and features of Microsoft Office 2010 Word, Excel, PowerPoint, Outlook, and Access. You will learn about the features that are shared between all products in the Office suite, as well as the new features that are product specific.

 
chlohAuthor Commented:
hi klf,

thanks for the comments. the input is a Java string. you'll encounter parse exception when the xml string is something like:

&lt;xmlTag1&gt;value of this tag&lt;/xmlTag1&gt;

the program is how to convert illegal characters on the tag value if given an xml string. this xml string is dynamically generated.
0
 
chlohAuthor Commented:
hi klf,

thanks for the comments. the input is a Java string. you'll encounter parse exception when the xml string is something like:

&lt;xmlTag1&gt;value of this tag&lt;/xmlTag1&gt;

the program is how to convert illegal characters on the tag value if given an xml string. this xml string is dynamically generated.
0
 
klfCommented:
do not change the < > characters that delimit the xml tags.

e.g

<t1> 10 < 20 </t1>

should be changed to

<t1> 10 &lt; </t1>
0
 
klfCommented:
do not change the < > characters that delimit the xml tags.

e.g

<t1> 10 < 20 </t1>

should be changed to

<t1> 10 &lt; </t1>
0
 
chlohAuthor Commented:
hi klf,

your suggestion will definitely solve my problem but i'm accepting xml document in the form of <t1> 10 < 20 </t1>.

note: i don't know why my postings appear few times over here. i just click once.
0
 
heyhey_Commented:
> but i'm accepting xml document in the form of <t1> 10 < 20 </t1>.

these are not CORRECT XML documents (according to XML spec) so you cannot use "normal" XML parser.
0
 
girionisCommented:
 The xml you accept should have the corresponding entities instead of the characters. The XML you are receiving is not well formed. I suggest you check the XML generator and see where the problem lies and try to fix it there.

  Hope it helps.
0
 
chlohAuthor Commented:
thanks for your replies. i knew the source of the problem. anyway i'm coding for a rule-based component (a publishing framework which takes in word by word- invoking from java objects). as such, i don't really care about whether the xml is well form or not.

when the xml string is passed onto another component (which parses the string), this where it fails (xml document is not well formed). i have been trying to filter out the illegal characters but really stuck over here.

would appreciate if someone could help me.
0
 
heyhey_Commented:
if you are using limited set of XML tags, then you can replace all chars that are not part of <oneOfYourTags> and </oneOfYourTags>
0
 
gandalf94305Commented:
According to the XML spec, the characters <, >, ', ", & should be escaped.

I use a method like

    /**
     * Encode a string safe for embedding in a default XML structure
     * without the need for specific entity mappings. A String is
     * returned.
     **/
    public static String encodeXMLSafe(String text) {
     return _encodeXMLSafe(text).toString();
    }

    /**
     * Encode a string safe for embedding in a default XML structure
     * without the need for specific entity mappings. A StringBuffer
     * is returned.
     **/
    public static StringBuffer _encodeXMLSafe(String text) {

     int len = text.length();
     StringBuffer buf = new StringBuffer(len*12/10);
     for (int i = 0; i < len; i++) {
         char c = text.charAt(i);
         if (c == '<') {
          buf.append("&lt;");
         } else if (c == '>') {
          buf.append("&gt;");
         } else if (c == '&') {
          buf.append("&amp;");
         } else if (c == '\'') {
          buf.append("&apos;");
         } else {
          buf.append(c);
         }
     }
     return buf;
    }

to make strings XML safe before writing them out.

Cheers,
--gandalf.
0
 
girionisCommented:
 Will the above not replace *all* characters with entities, including these in the XML tags (i.e. the <mytag> will be &lt;mytag&gt;)?
0
 
girionisCommented:
No comment has been added lately, so it's time to clean up this TA.

I will leave a recommendation in the Cleanup topic area that this question is:

- points to heyhey_

Please leave any comments here within the
next seven days.

PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER !

girionis
Cleanup Volunteer
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.