?
Solved

filtering illegal characters in xml documents

Posted on 2002-07-11
15
Medium Priority
?
593 Views
Last Modified: 2013-11-23
i have the following xml document:

consider the following xml file:

<?xml version="1.0"?>
<attribute>
     <other1> > </other1>
     <other2> < </other2>
     <other3> & </other3>
</attribute>

i'm using jaxp to parse the document. i encountered the following errors when parsing the document:

C:\DOMEcho>java -classpath '.\;C:\DOMEcho;C:\lib\crimson.jar;C:\lib\jaxp.jar;C:\
lib\xalan.jar;.' DOMEcho attribute.xml
Fatal Error: URI=file:C:/DOMEcho/attribute.xml Line=4: The content beginning "<
" is not legal markup. Perhaps the " " (&#20;) character should be a letter.

my investigation reveals that the character say '>' (within <other1> > </other1>) in invalid. any ideas of solving this? note that i cannot change '>' to its corresponding iso characters (xml document is generated by velocity- publishing framework).

any ideas in solving this so that i can parse my documents successfully. i have tried reading in the entire xml string and convert the illegal characters to its equivalent but it don't work. will appreciate if someone can suggest a solution (or even donate some codes for me).
0
Comment
Question by:chloh
  • 5
  • 3
  • 3
  • +2
15 Comments
 
LVL 1

Expert Comment

by:klf
ID: 7148036
try converting the < to &lt;
 & becomes &amp;
0
 

Author Comment

by:chloh
ID: 7148060
hi klf,

thanks for the comments. the input is a Java string. you'll encounter parse exception when the xml string is something like:

&lt;xmlTag1&gt;value of this tag&lt;/xmlTag1&gt;

the program is how to convert illegal characters on the tag value if given an xml string. this xml string is dynamically generated.
0
 

Author Comment

by:chloh
ID: 7148064
hi klf,

thanks for the comments. the input is a Java string. you'll encounter parse exception when the xml string is something like:

&lt;xmlTag1&gt;value of this tag&lt;/xmlTag1&gt;

the program is how to convert illegal characters on the tag value if given an xml string. this xml string is dynamically generated.
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 

Author Comment

by:chloh
ID: 7148067
hi klf,

thanks for the comments. the input is a Java string. you'll encounter parse exception when the xml string is something like:

&lt;xmlTag1&gt;value of this tag&lt;/xmlTag1&gt;

the program is how to convert illegal characters on the tag value if given an xml string. this xml string is dynamically generated.
0
 
LVL 1

Expert Comment

by:klf
ID: 7148090
do not change the < > characters that delimit the xml tags.

e.g

<t1> 10 < 20 </t1>

should be changed to

<t1> 10 &lt; </t1>
0
 
LVL 1

Expert Comment

by:klf
ID: 7148129
do not change the < > characters that delimit the xml tags.

e.g

<t1> 10 < 20 </t1>

should be changed to

<t1> 10 &lt; </t1>
0
 

Author Comment

by:chloh
ID: 7148453
hi klf,

your suggestion will definitely solve my problem but i'm accepting xml document in the form of <t1> 10 < 20 </t1>.

note: i don't know why my postings appear few times over here. i just click once.
0
 
LVL 16

Expert Comment

by:heyhey_
ID: 7148557
> but i'm accepting xml document in the form of <t1> 10 < 20 </t1>.

these are not CORRECT XML documents (according to XML spec) so you cannot use "normal" XML parser.
0
 
LVL 35

Expert Comment

by:girionis
ID: 7148595
 The xml you accept should have the corresponding entities instead of the characters. The XML you are receiving is not well formed. I suggest you check the XML generator and see where the problem lies and try to fix it there.

  Hope it helps.
0
 

Author Comment

by:chloh
ID: 7150957
thanks for your replies. i knew the source of the problem. anyway i'm coding for a rule-based component (a publishing framework which takes in word by word- invoking from java objects). as such, i don't really care about whether the xml is well form or not.

when the xml string is passed onto another component (which parses the string), this where it fails (xml document is not well formed). i have been trying to filter out the illegal characters but really stuck over here.

would appreciate if someone could help me.
0
 
LVL 16

Accepted Solution

by:
heyhey_ earned 400 total points
ID: 7153480
there isn't an easy way unless you can precisely define what is "illegal characters"

i.e in expression like

<other2> 1 < 2 > 3 < / 4 > </other2>

how do you decide which "<" and ">" are illegal and which are not.
0
 
LVL 16

Expert Comment

by:heyhey_
ID: 7153485
if you are using limited set of XML tags, then you can replace all chars that are not part of <oneOfYourTags> and </oneOfYourTags>
0
 
LVL 3

Expert Comment

by:gandalf94305
ID: 7165188
According to the XML spec, the characters <, >, ', ", & should be escaped.

I use a method like

    /**
     * Encode a string safe for embedding in a default XML structure
     * without the need for specific entity mappings. A String is
     * returned.
     **/
    public static String encodeXMLSafe(String text) {
     return _encodeXMLSafe(text).toString();
    }

    /**
     * Encode a string safe for embedding in a default XML structure
     * without the need for specific entity mappings. A StringBuffer
     * is returned.
     **/
    public static StringBuffer _encodeXMLSafe(String text) {

     int len = text.length();
     StringBuffer buf = new StringBuffer(len*12/10);
     for (int i = 0; i < len; i++) {
         char c = text.charAt(i);
         if (c == '<') {
          buf.append("&lt;");
         } else if (c == '>') {
          buf.append("&gt;");
         } else if (c == '&') {
          buf.append("&amp;");
         } else if (c == '\'') {
          buf.append("&apos;");
         } else {
          buf.append(c);
         }
     }
     return buf;
    }

to make strings XML safe before writing them out.

Cheers,
--gandalf.
0
 
LVL 35

Expert Comment

by:girionis
ID: 7165687
 Will the above not replace *all* characters with entities, including these in the XML tags (i.e. the <mytag> will be &lt;mytag&gt;)?
0
 
LVL 35

Expert Comment

by:girionis
ID: 8917269
No comment has been added lately, so it's time to clean up this TA.

I will leave a recommendation in the Cleanup topic area that this question is:

- points to heyhey_

Please leave any comments here within the
next seven days.

PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER !

girionis
Cleanup Volunteer
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In this post we will learn how to connect and configure Android Device (Smartphone etc.) with Android Studio. After that we will run a simple Hello World Program.
Basic understanding on "OO- Object Orientation" is needed for designing a logical solution to solve a problem. Basic OOAD is a prerequisite for a coder to ensure that they follow the basic design of OO. This would help developers to understand the b…
The viewer will learn how to dynamically set the form action using jQuery.
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.
Suggested Courses
Course of the Month17 days, 10 hours left to enroll

831 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question