Improve company productivity with a Business Account.Sign Up

x
?
Solved

URGENT: Convert Special Characters in XML

Posted on 2004-08-31
21
Medium Priority
?
1,818 Views
Last Modified: 2013-11-19
Hi, I'm working on a project.
When user submit a form, cgi will generate a email in xml format, it is something like this.

----------------------------------------------------------------------
<?xml version='1.0' encoding='ISO-8859-1'?>
<!-- start xml -->
<DATA>
<ACTION>
<CALL key="event-registration"></CALL>
</ACTION>
<CORE>
<ELEMENT name="euid">982</ELEMENT>
<ELEMENT name="stage">thankyou</ELEMENT>
<ELEMENT name="share_info">A & B & C </ELEMENT>
<ELEMENT name="special1">No<-----Agree  </ELEMENT>
</CORE>
</DATA>
<!---end of xml--->
----------------------------------------------------------------------
After the system recieve the email, my parser will use SAXParser and a selected defaultHandler to parse the xml data:

-----------------------------
                SAXHandler p = new SAXHandler(mail);
                SAXParserFactory factory = SAXParserFactory.newInstance();
                SAXParser parser = factory.newSAXParser();
                parser.parse(tempXML, p);
-----------------------------
As you can see, the element contains the special character (& and >).
For some special reason, I can't handle the special characters in cgi, so I have to take care of it in my java parser.
Can anyone know how to do it in a very simple way using SAXParser??
Assume you can convert it with String or file...

Thanks
0
Comment
Question by:joeyoungkc
  • 9
  • 7
  • 5
21 Comments
 
LVL 86

Expert Comment

by:CEHJ
ID: 11946595
You can use this class to escape the xml:

      static class HTMLEscape {
            /**
             *  Description of the Method
             *
             * @param  s  Description of the Parameter
             * @return    Description of the Return Value
             */
            public static String escape(String s) {
                  int len = s.length();
                  StringBuffer sb = new StringBuffer(len * 5 / 4);

                  for (int i = 0; i < len; i++) {
                        char c = s.charAt(i);
                        String elem = htmlchars[c & 0xff];

                        sb.append(elem == null ? "" + c : elem);
                  }
                  return sb.toString();
            }


            private static String htmlchars[] = new String[256];

            static {
                  String entry[] = {
                              "nbsp", "iexcl", "cent", "pound", "curren", "yen", "brvbar",
                              "sect", "uml", "copy", "ordf", "laquo", "not", "shy", "reg",
                              "macr", "deg", "plusmn", "sup2", "sup3", "acute", "micro",
                              "para", "middot", "cedil", "sup1", "ordm", "raquo", "frac14",
                              "frac12", "frac34", "iquest",
                              "Agrave", "Aacute", "Acirc", "Atilde", "Auml", "Aring", "AElig",
                              "CCedil", "Egrave", "Eacute", "Ecirc", "Euml", "Igrave", "Iacute",
                              "Icirc", "Iuml", "ETH", "Ntilde", "Ograve", "Oacute", "Ocirc",
                              "Otilde", "Ouml", "times", "Oslash", "Ugrave", "Uacute", "Ucirc",
                              "Uuml", "Yacute", "THORN", "szlig",
                              "agrave", "aacute", "acirc", "atilde", "auml", "aring", "aelig",
                              "ccedil", "egrave", "eacute", "ecirc", "euml", "igrave", "iacute",
                              "icirc", "iuml", "eth", "ntilde", "ograve", "oacute", "ocirc",
                              "otilde", "ouml", "divid", "oslash", "ugrave", "uacute", "ucirc",
                              "uuml", "yacute", "thorn", "yuml"
                              };

                  htmlchars['&'] = "&amp;";
                  htmlchars['<'] = "&lt;";
                  htmlchars['>'] = "&gt;";

                  for (int c = '\u00A0', i = 0; c <= '\u00FF'; c++, i++) {
                        htmlchars[c] = "&" + entry[i] + ";";
                  }

                  for (int c = '\u0083', i = 131; c <= '\u009f'; c++, i++) {
                        htmlchars[c] = "&#" + i + ";";
                  }

                  htmlchars['\u0088'] = htmlchars['\u008D'] = htmlchars['\u008E'] = null;
                  htmlchars['\u008F'] = htmlchars['\u0090'] = htmlchars['\u0098'] = null;
                  htmlchars['\u009D'] = null;
            }

      }

}

0
 

Author Comment

by:joeyoungkc
ID: 11946784
It seems way too complicate.
Isn't it I only need to change the pre-defined charcter for xml
&     -> &amp;
 
<    ->  &lt;
 
>     ->   &gt;
 
"   ->    &quot;
 
'  ->   &apos;

???

 
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11946808
>>It seems way too complicate.

Why? All you need to do is call

s = HTMLEscape.escape(s);
0
Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
LVL 92

Expert Comment

by:objects
ID: 11946814
try this, or use replaceAll() method in string:

class XMLUtil{

      private static String escapeChar(char c){
         switch(c){
            case('<')  : return "&lt;";
            case('&gt;')  : return "&gt;";  
            case('&')  : return "&amp;";
            case('\'') : return "&apos;";
            case('\"') : return "&quot;";                        
        }
        return null;    
      }
     
      public static String encodeChars(String string){
     
         if(string==null)
         return "null";
         int length = string.length();
         char[] characters = new char[length];
         string.getChars(0, length, characters, 0);
         StringBuffer encoded = new StringBuffer();
         String escape;
         for(int i = 0;i<length;i++){
            escape = escapeChar(characters[i]);
            if(escape == null) encoded.append(characters[i]);
               else encoded.append(escape);
         }
         return encoded.toString();
      }
               
      public static void main(String[] args){
         String test = "AP = ' QT = \" AMP = & LT = < GT = &gt; ";
         System.out.println(encodeChars(test));
      }
}
0
 
LVL 92

Expert Comment

by:objects
ID: 11946821
eg.

s = s.replaceAll(""\"", "&quot;");
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11946828
Not sure why you'd want to remove the functionality of replacing other characters that need replacing ...
0
 
LVL 92

Expert Comment

by:objects
ID: 11946874
If you need to add support for more characters, simply add them to  the esacpeChar switch stement, or add another replaceAll() call if using that.
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11946898
>>If you need to add support for more characters

You do need to, or you won't find out until it's too late ;-) That's why the class i posted is written in that way
0
 

Author Comment

by:joeyoungkc
ID: 11946912
The problem is the org line is
<ELEMENT name="share_info">A & B & C </ELEMENT>
and i want to change it to
<ELEMENT name="share_info">A & B &amp; C </ELEMENT>

but our guys' method will change to

&lt;ELEMENT name="postal_zip"&gt;A & B &gt; C&lt;/ELEMENT&gt;

....
0
 
LVL 92

Expert Comment

by:objects
ID: 11946919
you need to parse the line, and only convert the value, not the tags.
0
 
LVL 92

Expert Comment

by:objects
ID: 11946932
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11946950
Try
String[] text = line.split("<[^>]+>");
if (text.length == 1) {
    String s = HTMLEscape(text[0]);
}
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11946971
>>String s = HTMLEscape(text[0]);

was meant to be

String s = HTMLEscape.escape(text[0]);
0
 
LVL 92

Accepted Solution

by:
objects earned 1200 total points
ID: 11947217
> For some special reason, I can't handle the special characters in cgi,

whats the special reason :)
0
 

Author Comment

by:joeyoungkc
ID: 11947302
The reason is there has so many cgis and i don't want to change it one by one
=)
So the ideal solution is to solve it during parsing.i used
String[] text = line.split("<[^>]+>");
if (text.length == 1) {
    String s = HTMLEscape(text[0]);
}

and modify a lot and it works.

Thanks a lot.
Joe
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11947319
joeyoungkc, can you tell me how

>>whats the special reason :)

can be an answer to this question?
0
 

Author Comment

by:joeyoungkc
ID: 11947368
The reason is there has so many cgis that creates same kinds of xml and i don't want to change it one by one, so the best way is to handle the problem in the parser.
0
 

Author Comment

by:joeyoungkc
ID: 11947378
>>whats the special reason :)
that was typo....
0
 
LVL 92

Expert Comment

by:objects
ID: 11948745
> So the ideal solution is to solve it during parsing.i used
> String[] text = line.split("<[^>]+>");
> if (text.length == 1) {
>     String s = HTMLEscape(text[0]);
> }

Thats not ideal, in fact its not even safe. You'll still end up with corrupted XML data.

How many cgi's are involved?  The change required to each is fairly minor, and then you won't have to worry about anything during parsing.
0
 
LVL 92

Expert Comment

by:objects
ID: 11948947
And the time spent changing the cgi's is going to save you time sorting out problems in the future (not to mention time spent trying to work out how to handle it during parsing). And in the long run you may find you have to change the cgi's anyway :)

How is the cgi currently generating the xml, and how does it get passed for parsing?
0
 
LVL 92

Expert Comment

by:objects
ID: 11949212
If you can't be convinced then the following regex will work a lot better for pulling the value out. You may need to tweak it a little depending on exactly what you need to deal with and how it is delivered to you but you should get the idea. Give me a yell if you have any questions :)

Pattern p = Pattern.compile("<ELEMENT name=\"(.+?)\">(.*?)</ELEMENT>");
Matcher m = p.matcher(s);
if (m.matches())
{
   String name = m.group(1);
   String  value = m.group(2);
   System.out.println(name+"="+value);
   // XMLUtil handles all the encoding that you need to worry about
   // though depending on what you're doing with the parsed data you may not need to worry about it at all.
   value = XMLUtil.encodeChars(value);

}

Though as I've already mentioned I'd strongly suggest biting the bullet and fixing your cgi's.

<ELEMENT name=\"share_info\"><![CDATA[A & B & C ]]></ELEMENT>
0

Featured Post

Upgrade your Question Security!

Your question, your audience. Choose who sees your identity—and your question—with question security.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Go is an acronym of golang, is a programming language developed Google in 2007. Go is a new language that is mostly in the C family, with significant input from Pascal/Modula/Oberon family. Hence Go arisen as low-level language with fast compilation…
A discussion about automated testing of Web Applications utilizing Selenium, along with illustrated configuration steps for the Jenkins open source tool.
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.
HTML5 has deprecated a few of the older ways of showing media as well as offering up a new way to create games and animations. Audio, video, and canvas are just a few of the adjustments made between XHTML and HTML5. As we learned in our last micr…

608 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question