chloh
asked on
filtering illegal characters in xml documents
i have the following xml document:
consider the following xml file:
<?xml version="1.0"?>
<attribute>
<other1> > </other1>
<other2> < </other2>
<other3> & </other3>
</attribute>
i'm using jaxp to parse the document. i encountered the following errors when parsing the document:
C:\DOMEcho>java -classpath '.\;C:\DOMEcho;C:\lib\crim son.jar;C: \lib\jaxp. jar;C:\
lib\xalan.jar;.' DOMEcho attribute.xml
Fatal Error: URI=file:C:/DOMEcho/attrib ute.xml Line=4: The content beginning "<
" is not legal markup. Perhaps the " " () character should be a letter.
my investigation reveals that the character say '>' (within <other1> > </other1>) in invalid. any ideas of solving this? note that i cannot change '>' to its corresponding iso characters (xml document is generated by velocity- publishing framework).
any ideas in solving this so that i can parse my documents successfully. i have tried reading in the entire xml string and convert the illegal characters to its equivalent but it don't work. will appreciate if someone can suggest a solution (or even donate some codes for me).
consider the following xml file:
<?xml version="1.0"?>
<attribute>
<other1> > </other1>
<other2> < </other2>
<other3> & </other3>
</attribute>
i'm using jaxp to parse the document. i encountered the following errors when parsing the document:
C:\DOMEcho>java -classpath '.\;C:\DOMEcho;C:\lib\crim
lib\xalan.jar;.' DOMEcho attribute.xml
Fatal Error: URI=file:C:/DOMEcho/attrib
" is not legal markup. Perhaps the " " () character should be a letter.
my investigation reveals that the character say '>' (within <other1> > </other1>) in invalid. any ideas of solving this? note that i cannot change '>' to its corresponding iso characters (xml document is generated by velocity- publishing framework).
any ideas in solving this so that i can parse my documents successfully. i have tried reading in the entire xml string and convert the illegal characters to its equivalent but it don't work. will appreciate if someone can suggest a solution (or even donate some codes for me).
ASKER
hi klf,
thanks for the comments. the input is a Java string. you'll encounter parse exception when the xml string is something like:
<xmlTag1>value of this tag</xmlTag1>
the program is how to convert illegal characters on the tag value if given an xml string. this xml string is dynamically generated.
thanks for the comments. the input is a Java string. you'll encounter parse exception when the xml string is something like:
<xmlTag1>value of this tag</xmlTag1>
the program is how to convert illegal characters on the tag value if given an xml string. this xml string is dynamically generated.
ASKER
hi klf,
thanks for the comments. the input is a Java string. you'll encounter parse exception when the xml string is something like:
<xmlTag1>value of this tag</xmlTag1>
the program is how to convert illegal characters on the tag value if given an xml string. this xml string is dynamically generated.
thanks for the comments. the input is a Java string. you'll encounter parse exception when the xml string is something like:
<xmlTag1>value of this tag</xmlTag1>
the program is how to convert illegal characters on the tag value if given an xml string. this xml string is dynamically generated.
ASKER
hi klf,
thanks for the comments. the input is a Java string. you'll encounter parse exception when the xml string is something like:
<xmlTag1>value of this tag</xmlTag1>
the program is how to convert illegal characters on the tag value if given an xml string. this xml string is dynamically generated.
thanks for the comments. the input is a Java string. you'll encounter parse exception when the xml string is something like:
<xmlTag1>value of this tag</xmlTag1>
the program is how to convert illegal characters on the tag value if given an xml string. this xml string is dynamically generated.
do not change the < > characters that delimit the xml tags.
e.g
<t1> 10 < 20 </t1>
should be changed to
<t1> 10 < </t1>
e.g
<t1> 10 < 20 </t1>
should be changed to
<t1> 10 < </t1>
do not change the < > characters that delimit the xml tags.
e.g
<t1> 10 < 20 </t1>
should be changed to
<t1> 10 < </t1>
e.g
<t1> 10 < 20 </t1>
should be changed to
<t1> 10 < </t1>
ASKER
hi klf,
your suggestion will definitely solve my problem but i'm accepting xml document in the form of <t1> 10 < 20 </t1>.
note: i don't know why my postings appear few times over here. i just click once.
your suggestion will definitely solve my problem but i'm accepting xml document in the form of <t1> 10 < 20 </t1>.
note: i don't know why my postings appear few times over here. i just click once.
> but i'm accepting xml document in the form of <t1> 10 < 20 </t1>.
these are not CORRECT XML documents (according to XML spec) so you cannot use "normal" XML parser.
these are not CORRECT XML documents (according to XML spec) so you cannot use "normal" XML parser.
The xml you accept should have the corresponding entities instead of the characters. The XML you are receiving is not well formed. I suggest you check the XML generator and see where the problem lies and try to fix it there.
Hope it helps.
Hope it helps.
ASKER
thanks for your replies. i knew the source of the problem. anyway i'm coding for a rule-based component (a publishing framework which takes in word by word- invoking from java objects). as such, i don't really care about whether the xml is well form or not.
when the xml string is passed onto another component (which parses the string), this where it fails (xml document is not well formed). i have been trying to filter out the illegal characters but really stuck over here.
would appreciate if someone could help me.
when the xml string is passed onto another component (which parses the string), this where it fails (xml document is not well formed). i have been trying to filter out the illegal characters but really stuck over here.
would appreciate if someone could help me.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
if you are using limited set of XML tags, then you can replace all chars that are not part of <oneOfYourTags> and </oneOfYourTags>
According to the XML spec, the characters <, >, ', ", & should be escaped.
I use a method like
/**
* Encode a string safe for embedding in a default XML structure
* without the need for specific entity mappings. A String is
* returned.
**/
public static String encodeXMLSafe(String text) {
return _encodeXMLSafe(text).toStr ing();
}
/**
* Encode a string safe for embedding in a default XML structure
* without the need for specific entity mappings. A StringBuffer
* is returned.
**/
public static StringBuffer _encodeXMLSafe(String text) {
int len = text.length();
StringBuffer buf = new StringBuffer(len*12/10);
for (int i = 0; i < len; i++) {
char c = text.charAt(i);
if (c == '<') {
buf.append("<");
} else if (c == '>') {
buf.append(">");
} else if (c == '&') {
buf.append("&");
} else if (c == '\'') {
buf.append("'");
} else {
buf.append(c);
}
}
return buf;
}
to make strings XML safe before writing them out.
Cheers,
--gandalf.
I use a method like
/**
* Encode a string safe for embedding in a default XML structure
* without the need for specific entity mappings. A String is
* returned.
**/
public static String encodeXMLSafe(String text) {
return _encodeXMLSafe(text).toStr
}
/**
* Encode a string safe for embedding in a default XML structure
* without the need for specific entity mappings. A StringBuffer
* is returned.
**/
public static StringBuffer _encodeXMLSafe(String text) {
int len = text.length();
StringBuffer buf = new StringBuffer(len*12/10);
for (int i = 0; i < len; i++) {
char c = text.charAt(i);
if (c == '<') {
buf.append("<");
} else if (c == '>') {
buf.append(">");
} else if (c == '&') {
buf.append("&");
} else if (c == '\'') {
buf.append("'");
} else {
buf.append(c);
}
}
return buf;
}
to make strings XML safe before writing them out.
Cheers,
--gandalf.
Will the above not replace *all* characters with entities, including these in the XML tags (i.e. the <mytag> will be <mytag>)?
No comment has been added lately, so it's time to clean up this TA.
I will leave a recommendation in the Cleanup topic area that this question is:
- points to heyhey_
Please leave any comments here within the
next seven days.
PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER !
girionis
Cleanup Volunteer
I will leave a recommendation in the Cleanup topic area that this question is:
- points to heyhey_
Please leave any comments here within the
next seven days.
PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER !
girionis
Cleanup Volunteer
& becomes &