Solved

URGENT: Convert Special Characters in XML

Posted on 2004-08-31
21
1,795 Views
Last Modified: 2013-11-19
Hi, I'm working on a project.
When user submit a form, cgi will generate a email in xml format, it is something like this.

----------------------------------------------------------------------
<?xml version='1.0' encoding='ISO-8859-1'?>
<!-- start xml -->
<DATA>
<ACTION>
<CALL key="event-registration"></CALL>
</ACTION>
<CORE>
<ELEMENT name="euid">982</ELEMENT>
<ELEMENT name="stage">thankyou</ELEMENT>
<ELEMENT name="share_info">A & B & C </ELEMENT>
<ELEMENT name="special1">No<-----Agree  </ELEMENT>
</CORE>
</DATA>
<!---end of xml--->
----------------------------------------------------------------------
After the system recieve the email, my parser will use SAXParser and a selected defaultHandler to parse the xml data:

-----------------------------
                SAXHandler p = new SAXHandler(mail);
                SAXParserFactory factory = SAXParserFactory.newInstance();
                SAXParser parser = factory.newSAXParser();
                parser.parse(tempXML, p);
-----------------------------
As you can see, the element contains the special character (& and >).
For some special reason, I can't handle the special characters in cgi, so I have to take care of it in my java parser.
Can anyone know how to do it in a very simple way using SAXParser??
Assume you can convert it with String or file...

Thanks
0
Comment
Question by:joeyoungkc
  • 9
  • 7
  • 5
21 Comments
 
LVL 86

Expert Comment

by:CEHJ
ID: 11946595
You can use this class to escape the xml:

      static class HTMLEscape {
            /**
             *  Description of the Method
             *
             * @param  s  Description of the Parameter
             * @return    Description of the Return Value
             */
            public static String escape(String s) {
                  int len = s.length();
                  StringBuffer sb = new StringBuffer(len * 5 / 4);

                  for (int i = 0; i < len; i++) {
                        char c = s.charAt(i);
                        String elem = htmlchars[c & 0xff];

                        sb.append(elem == null ? "" + c : elem);
                  }
                  return sb.toString();
            }


            private static String htmlchars[] = new String[256];

            static {
                  String entry[] = {
                              "nbsp", "iexcl", "cent", "pound", "curren", "yen", "brvbar",
                              "sect", "uml", "copy", "ordf", "laquo", "not", "shy", "reg",
                              "macr", "deg", "plusmn", "sup2", "sup3", "acute", "micro",
                              "para", "middot", "cedil", "sup1", "ordm", "raquo", "frac14",
                              "frac12", "frac34", "iquest",
                              "Agrave", "Aacute", "Acirc", "Atilde", "Auml", "Aring", "AElig",
                              "CCedil", "Egrave", "Eacute", "Ecirc", "Euml", "Igrave", "Iacute",
                              "Icirc", "Iuml", "ETH", "Ntilde", "Ograve", "Oacute", "Ocirc",
                              "Otilde", "Ouml", "times", "Oslash", "Ugrave", "Uacute", "Ucirc",
                              "Uuml", "Yacute", "THORN", "szlig",
                              "agrave", "aacute", "acirc", "atilde", "auml", "aring", "aelig",
                              "ccedil", "egrave", "eacute", "ecirc", "euml", "igrave", "iacute",
                              "icirc", "iuml", "eth", "ntilde", "ograve", "oacute", "ocirc",
                              "otilde", "ouml", "divid", "oslash", "ugrave", "uacute", "ucirc",
                              "uuml", "yacute", "thorn", "yuml"
                              };

                  htmlchars['&'] = "&amp;";
                  htmlchars['<'] = "&lt;";
                  htmlchars['>'] = "&gt;";

                  for (int c = '\u00A0', i = 0; c <= '\u00FF'; c++, i++) {
                        htmlchars[c] = "&" + entry[i] + ";";
                  }

                  for (int c = '\u0083', i = 131; c <= '\u009f'; c++, i++) {
                        htmlchars[c] = "&#" + i + ";";
                  }

                  htmlchars['\u0088'] = htmlchars['\u008D'] = htmlchars['\u008E'] = null;
                  htmlchars['\u008F'] = htmlchars['\u0090'] = htmlchars['\u0098'] = null;
                  htmlchars['\u009D'] = null;
            }

      }

}

0
 

Author Comment

by:joeyoungkc
ID: 11946784
It seems way too complicate.
Isn't it I only need to change the pre-defined charcter for xml
&     -> &amp;
 
<    ->  &lt;
 
>     ->   &gt;
 
"   ->    &quot;
 
'  ->   &apos;

???

 
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11946808
>>It seems way too complicate.

Why? All you need to do is call

s = HTMLEscape.escape(s);
0
 
LVL 92

Expert Comment

by:objects
ID: 11946814
try this, or use replaceAll() method in string:

class XMLUtil{

      private static String escapeChar(char c){
         switch(c){
            case('<')  : return "&lt;";
            case('&gt;')  : return "&gt;";  
            case('&')  : return "&amp;";
            case('\'') : return "&apos;";
            case('\"') : return "&quot;";                        
        }
        return null;    
      }
     
      public static String encodeChars(String string){
     
         if(string==null)
         return "null";
         int length = string.length();
         char[] characters = new char[length];
         string.getChars(0, length, characters, 0);
         StringBuffer encoded = new StringBuffer();
         String escape;
         for(int i = 0;i<length;i++){
            escape = escapeChar(characters[i]);
            if(escape == null) encoded.append(characters[i]);
               else encoded.append(escape);
         }
         return encoded.toString();
      }
               
      public static void main(String[] args){
         String test = "AP = ' QT = \" AMP = & LT = < GT = &gt; ";
         System.out.println(encodeChars(test));
      }
}
0
 
LVL 92

Expert Comment

by:objects
ID: 11946821
eg.

s = s.replaceAll(""\"", "&quot;");
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11946828
Not sure why you'd want to remove the functionality of replacing other characters that need replacing ...
0
 
LVL 92

Expert Comment

by:objects
ID: 11946874
If you need to add support for more characters, simply add them to  the esacpeChar switch stement, or add another replaceAll() call if using that.
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11946898
>>If you need to add support for more characters

You do need to, or you won't find out until it's too late ;-) That's why the class i posted is written in that way
0
 

Author Comment

by:joeyoungkc
ID: 11946912
The problem is the org line is
<ELEMENT name="share_info">A & B & C </ELEMENT>
and i want to change it to
<ELEMENT name="share_info">A & B &amp; C </ELEMENT>

but our guys' method will change to

&lt;ELEMENT name="postal_zip"&gt;A & B &gt; C&lt;/ELEMENT&gt;

....
0
 
LVL 92

Expert Comment

by:objects
ID: 11946919
you need to parse the line, and only convert the value, not the tags.
0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 
LVL 92

Expert Comment

by:objects
ID: 11946932
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11946950
Try
String[] text = line.split("<[^>]+>");
if (text.length == 1) {
    String s = HTMLEscape(text[0]);
}
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11946971
>>String s = HTMLEscape(text[0]);

was meant to be

String s = HTMLEscape.escape(text[0]);
0
 
LVL 92

Accepted Solution

by:
objects earned 300 total points
ID: 11947217
> For some special reason, I can't handle the special characters in cgi,

whats the special reason :)
0
 

Author Comment

by:joeyoungkc
ID: 11947302
The reason is there has so many cgis and i don't want to change it one by one
=)
So the ideal solution is to solve it during parsing.i used
String[] text = line.split("<[^>]+>");
if (text.length == 1) {
    String s = HTMLEscape(text[0]);
}

and modify a lot and it works.

Thanks a lot.
Joe
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11947319
joeyoungkc, can you tell me how

>>whats the special reason :)

can be an answer to this question?
0
 

Author Comment

by:joeyoungkc
ID: 11947368
The reason is there has so many cgis that creates same kinds of xml and i don't want to change it one by one, so the best way is to handle the problem in the parser.
0
 

Author Comment

by:joeyoungkc
ID: 11947378
>>whats the special reason :)
that was typo....
0
 
LVL 92

Expert Comment

by:objects
ID: 11948745
> So the ideal solution is to solve it during parsing.i used
> String[] text = line.split("<[^>]+>");
> if (text.length == 1) {
>     String s = HTMLEscape(text[0]);
> }

Thats not ideal, in fact its not even safe. You'll still end up with corrupted XML data.

How many cgi's are involved?  The change required to each is fairly minor, and then you won't have to worry about anything during parsing.
0
 
LVL 92

Expert Comment

by:objects
ID: 11948947
And the time spent changing the cgi's is going to save you time sorting out problems in the future (not to mention time spent trying to work out how to handle it during parsing). And in the long run you may find you have to change the cgi's anyway :)

How is the cgi currently generating the xml, and how does it get passed for parsing?
0
 
LVL 92

Expert Comment

by:objects
ID: 11949212
If you can't be convinced then the following regex will work a lot better for pulling the value out. You may need to tweak it a little depending on exactly what you need to deal with and how it is delivered to you but you should get the idea. Give me a yell if you have any questions :)

Pattern p = Pattern.compile("<ELEMENT name=\"(.+?)\">(.*?)</ELEMENT>");
Matcher m = p.matcher(s);
if (m.matches())
{
   String name = m.group(1);
   String  value = m.group(2);
   System.out.println(name+"="+value);
   // XMLUtil handles all the encoding that you need to worry about
   // though depending on what you're doing with the parsed data you may not need to worry about it at all.
   value = XMLUtil.encodeChars(value);

}

Though as I've already mentioned I'd strongly suggest biting the bullet and fixing your cgi's.

<ELEMENT name=\"share_info\"><![CDATA[A & B & C ]]></ELEMENT>
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
cookies analysis tools 2 74
Need to track down Infection in a Server 2008 domain user profile 7 48
print map entry 34 56
even odd program using while loop 3 31
Introduction Since I wrote the original article about Handling Date and Time in PHP and MySQL (http://www.experts-exchange.com/articles/201/Handling-Date-and-Time-in-PHP-and-MySQL.html) several years ago, it seemed like now was a good time to updat…
JavaScript has plenty of pieces of code people often just copy/paste from somewhere but never quite fully understand. Self-Executing functions are just one good example that I'll try to demystify here.
This theoretical tutorial explains exceptions, reasons for exceptions, different categories of exception and exception hierarchy.
The viewer will the learn the benefit of plain text editors and code an HTML5 based template for use in further tutorials.

863 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

23 Experts available now in Live!

Get 1:1 Help Now