Solved

filtering illegal characters in xml documents

Posted on 2002-07-11
15
546 Views
Last Modified: 2013-11-23
i have the following xml document:

consider the following xml file:

<?xml version="1.0"?>
<attribute>
     <other1> > </other1>
     <other2> < </other2>
     <other3> & </other3>
</attribute>

i'm using jaxp to parse the document. i encountered the following errors when parsing the document:

C:\DOMEcho>java -classpath '.\;C:\DOMEcho;C:\lib\crimson.jar;C:\lib\jaxp.jar;C:\
lib\xalan.jar;.' DOMEcho attribute.xml
Fatal Error: URI=file:C:/DOMEcho/attribute.xml Line=4: The content beginning "<
" is not legal markup. Perhaps the " " (&#20;) character should be a letter.

my investigation reveals that the character say '>' (within <other1> > </other1>) in invalid. any ideas of solving this? note that i cannot change '>' to its corresponding iso characters (xml document is generated by velocity- publishing framework).

any ideas in solving this so that i can parse my documents successfully. i have tried reading in the entire xml string and convert the illegal characters to its equivalent but it don't work. will appreciate if someone can suggest a solution (or even donate some codes for me).
0
Comment
Question by:chloh
  • 5
  • 3
  • 3
  • +2
15 Comments
 
LVL 1

Expert Comment

by:klf
ID: 7148036
try converting the < to &lt;
 & becomes &amp;
0
 

Author Comment

by:chloh
ID: 7148060
hi klf,

thanks for the comments. the input is a Java string. you'll encounter parse exception when the xml string is something like:

&lt;xmlTag1&gt;value of this tag&lt;/xmlTag1&gt;

the program is how to convert illegal characters on the tag value if given an xml string. this xml string is dynamically generated.
0
 

Author Comment

by:chloh
ID: 7148064
hi klf,

thanks for the comments. the input is a Java string. you'll encounter parse exception when the xml string is something like:

&lt;xmlTag1&gt;value of this tag&lt;/xmlTag1&gt;

the program is how to convert illegal characters on the tag value if given an xml string. this xml string is dynamically generated.
0
 

Author Comment

by:chloh
ID: 7148067
hi klf,

thanks for the comments. the input is a Java string. you'll encounter parse exception when the xml string is something like:

&lt;xmlTag1&gt;value of this tag&lt;/xmlTag1&gt;

the program is how to convert illegal characters on the tag value if given an xml string. this xml string is dynamically generated.
0
 
LVL 1

Expert Comment

by:klf
ID: 7148090
do not change the < > characters that delimit the xml tags.

e.g

<t1> 10 < 20 </t1>

should be changed to

<t1> 10 &lt; </t1>
0
 
LVL 1

Expert Comment

by:klf
ID: 7148129
do not change the < > characters that delimit the xml tags.

e.g

<t1> 10 < 20 </t1>

should be changed to

<t1> 10 &lt; </t1>
0
 

Author Comment

by:chloh
ID: 7148453
hi klf,

your suggestion will definitely solve my problem but i'm accepting xml document in the form of <t1> 10 < 20 </t1>.

note: i don't know why my postings appear few times over here. i just click once.
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 16

Expert Comment

by:heyhey_
ID: 7148557
> but i'm accepting xml document in the form of <t1> 10 < 20 </t1>.

these are not CORRECT XML documents (according to XML spec) so you cannot use "normal" XML parser.
0
 
LVL 35

Expert Comment

by:girionis
ID: 7148595
 The xml you accept should have the corresponding entities instead of the characters. The XML you are receiving is not well formed. I suggest you check the XML generator and see where the problem lies and try to fix it there.

  Hope it helps.
0
 

Author Comment

by:chloh
ID: 7150957
thanks for your replies. i knew the source of the problem. anyway i'm coding for a rule-based component (a publishing framework which takes in word by word- invoking from java objects). as such, i don't really care about whether the xml is well form or not.

when the xml string is passed onto another component (which parses the string), this where it fails (xml document is not well formed). i have been trying to filter out the illegal characters but really stuck over here.

would appreciate if someone could help me.
0
 
LVL 16

Accepted Solution

by:
heyhey_ earned 100 total points
ID: 7153480
there isn't an easy way unless you can precisely define what is "illegal characters"

i.e in expression like

<other2> 1 < 2 > 3 < / 4 > </other2>

how do you decide which "<" and ">" are illegal and which are not.
0
 
LVL 16

Expert Comment

by:heyhey_
ID: 7153485
if you are using limited set of XML tags, then you can replace all chars that are not part of <oneOfYourTags> and </oneOfYourTags>
0
 
LVL 3

Expert Comment

by:gandalf94305
ID: 7165188
According to the XML spec, the characters <, >, ', ", & should be escaped.

I use a method like

    /**
     * Encode a string safe for embedding in a default XML structure
     * without the need for specific entity mappings. A String is
     * returned.
     **/
    public static String encodeXMLSafe(String text) {
     return _encodeXMLSafe(text).toString();
    }

    /**
     * Encode a string safe for embedding in a default XML structure
     * without the need for specific entity mappings. A StringBuffer
     * is returned.
     **/
    public static StringBuffer _encodeXMLSafe(String text) {

     int len = text.length();
     StringBuffer buf = new StringBuffer(len*12/10);
     for (int i = 0; i < len; i++) {
         char c = text.charAt(i);
         if (c == '<') {
          buf.append("&lt;");
         } else if (c == '>') {
          buf.append("&gt;");
         } else if (c == '&') {
          buf.append("&amp;");
         } else if (c == '\'') {
          buf.append("&apos;");
         } else {
          buf.append(c);
         }
     }
     return buf;
    }

to make strings XML safe before writing them out.

Cheers,
--gandalf.
0
 
LVL 35

Expert Comment

by:girionis
ID: 7165687
 Will the above not replace *all* characters with entities, including these in the XML tags (i.e. the <mytag> will be &lt;mytag&gt;)?
0
 
LVL 35

Expert Comment

by:girionis
ID: 8917269
No comment has been added lately, so it's time to clean up this TA.

I will leave a recommendation in the Cleanup topic area that this question is:

- points to heyhey_

Please leave any comments here within the
next seven days.

PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER !

girionis
Cleanup Volunteer
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Introduction This article is the second of three articles that explain why and how the Experts Exchange QA Team does test automation for our web site. This article covers the basic installation and configuration of the test automation tools used by…
Have you tried to learn about Unicode, UTF-8, and multibyte text encoding and all the articles are just too "academic" or too technical? This article aims to make the whole topic easy for just about anyone to understand.
The viewer will learn how to implement Singleton Design Pattern in Java.
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now