Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people, just like you, are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
Solved

URGENT: Convert Special Characters in XML

Posted on 2004-08-31
21
1,802 Views
Last Modified: 2013-11-19
Hi, I'm working on a project.
When user submit a form, cgi will generate a email in xml format, it is something like this.

----------------------------------------------------------------------
<?xml version='1.0' encoding='ISO-8859-1'?>
<!-- start xml -->
<DATA>
<ACTION>
<CALL key="event-registration"></CALL>
</ACTION>
<CORE>
<ELEMENT name="euid">982</ELEMENT>
<ELEMENT name="stage">thankyou</ELEMENT>
<ELEMENT name="share_info">A & B & C </ELEMENT>
<ELEMENT name="special1">No<-----Agree  </ELEMENT>
</CORE>
</DATA>
<!---end of xml--->
----------------------------------------------------------------------
After the system recieve the email, my parser will use SAXParser and a selected defaultHandler to parse the xml data:

-----------------------------
                SAXHandler p = new SAXHandler(mail);
                SAXParserFactory factory = SAXParserFactory.newInstance();
                SAXParser parser = factory.newSAXParser();
                parser.parse(tempXML, p);
-----------------------------
As you can see, the element contains the special character (& and >).
For some special reason, I can't handle the special characters in cgi, so I have to take care of it in my java parser.
Can anyone know how to do it in a very simple way using SAXParser??
Assume you can convert it with String or file...

Thanks
0
Comment
Question by:joeyoungkc
  • 9
  • 7
  • 5
21 Comments
 
LVL 86

Expert Comment

by:CEHJ
ID: 11946595
You can use this class to escape the xml:

      static class HTMLEscape {
            /**
             *  Description of the Method
             *
             * @param  s  Description of the Parameter
             * @return    Description of the Return Value
             */
            public static String escape(String s) {
                  int len = s.length();
                  StringBuffer sb = new StringBuffer(len * 5 / 4);

                  for (int i = 0; i < len; i++) {
                        char c = s.charAt(i);
                        String elem = htmlchars[c & 0xff];

                        sb.append(elem == null ? "" + c : elem);
                  }
                  return sb.toString();
            }


            private static String htmlchars[] = new String[256];

            static {
                  String entry[] = {
                              "nbsp", "iexcl", "cent", "pound", "curren", "yen", "brvbar",
                              "sect", "uml", "copy", "ordf", "laquo", "not", "shy", "reg",
                              "macr", "deg", "plusmn", "sup2", "sup3", "acute", "micro",
                              "para", "middot", "cedil", "sup1", "ordm", "raquo", "frac14",
                              "frac12", "frac34", "iquest",
                              "Agrave", "Aacute", "Acirc", "Atilde", "Auml", "Aring", "AElig",
                              "CCedil", "Egrave", "Eacute", "Ecirc", "Euml", "Igrave", "Iacute",
                              "Icirc", "Iuml", "ETH", "Ntilde", "Ograve", "Oacute", "Ocirc",
                              "Otilde", "Ouml", "times", "Oslash", "Ugrave", "Uacute", "Ucirc",
                              "Uuml", "Yacute", "THORN", "szlig",
                              "agrave", "aacute", "acirc", "atilde", "auml", "aring", "aelig",
                              "ccedil", "egrave", "eacute", "ecirc", "euml", "igrave", "iacute",
                              "icirc", "iuml", "eth", "ntilde", "ograve", "oacute", "ocirc",
                              "otilde", "ouml", "divid", "oslash", "ugrave", "uacute", "ucirc",
                              "uuml", "yacute", "thorn", "yuml"
                              };

                  htmlchars['&'] = "&amp;";
                  htmlchars['<'] = "&lt;";
                  htmlchars['>'] = "&gt;";

                  for (int c = '\u00A0', i = 0; c <= '\u00FF'; c++, i++) {
                        htmlchars[c] = "&" + entry[i] + ";";
                  }

                  for (int c = '\u0083', i = 131; c <= '\u009f'; c++, i++) {
                        htmlchars[c] = "&#" + i + ";";
                  }

                  htmlchars['\u0088'] = htmlchars['\u008D'] = htmlchars['\u008E'] = null;
                  htmlchars['\u008F'] = htmlchars['\u0090'] = htmlchars['\u0098'] = null;
                  htmlchars['\u009D'] = null;
            }

      }

}

0
 

Author Comment

by:joeyoungkc
ID: 11946784
It seems way too complicate.
Isn't it I only need to change the pre-defined charcter for xml
&     -> &amp;
 
<    ->  &lt;
 
>     ->   &gt;
 
"   ->    &quot;
 
'  ->   &apos;

???

 
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11946808
>>It seems way too complicate.

Why? All you need to do is call

s = HTMLEscape.escape(s);
0
Networking for the Cloud Era

Join Microsoft and Riverbed for a discussion and demonstration of enhancements to SteelConnect:
-One-click orchestration and cloud connectivity in Azure environments
-Tight integration of SD-WAN and WAN optimization capabilities
-Scalability and resiliency equal to a data center

 
LVL 92

Expert Comment

by:objects
ID: 11946814
try this, or use replaceAll() method in string:

class XMLUtil{

      private static String escapeChar(char c){
         switch(c){
            case('<')  : return "&lt;";
            case('&gt;')  : return "&gt;";  
            case('&')  : return "&amp;";
            case('\'') : return "&apos;";
            case('\"') : return "&quot;";                        
        }
        return null;    
      }
     
      public static String encodeChars(String string){
     
         if(string==null)
         return "null";
         int length = string.length();
         char[] characters = new char[length];
         string.getChars(0, length, characters, 0);
         StringBuffer encoded = new StringBuffer();
         String escape;
         for(int i = 0;i<length;i++){
            escape = escapeChar(characters[i]);
            if(escape == null) encoded.append(characters[i]);
               else encoded.append(escape);
         }
         return encoded.toString();
      }
               
      public static void main(String[] args){
         String test = "AP = ' QT = \" AMP = & LT = < GT = &gt; ";
         System.out.println(encodeChars(test));
      }
}
0
 
LVL 92

Expert Comment

by:objects
ID: 11946821
eg.

s = s.replaceAll(""\"", "&quot;");
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11946828
Not sure why you'd want to remove the functionality of replacing other characters that need replacing ...
0
 
LVL 92

Expert Comment

by:objects
ID: 11946874
If you need to add support for more characters, simply add them to  the esacpeChar switch stement, or add another replaceAll() call if using that.
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11946898
>>If you need to add support for more characters

You do need to, or you won't find out until it's too late ;-) That's why the class i posted is written in that way
0
 

Author Comment

by:joeyoungkc
ID: 11946912
The problem is the org line is
<ELEMENT name="share_info">A & B & C </ELEMENT>
and i want to change it to
<ELEMENT name="share_info">A & B &amp; C </ELEMENT>

but our guys' method will change to

&lt;ELEMENT name="postal_zip"&gt;A & B &gt; C&lt;/ELEMENT&gt;

....
0
 
LVL 92

Expert Comment

by:objects
ID: 11946919
you need to parse the line, and only convert the value, not the tags.
0
 
LVL 92

Expert Comment

by:objects
ID: 11946932
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11946950
Try
String[] text = line.split("<[^>]+>");
if (text.length == 1) {
    String s = HTMLEscape(text[0]);
}
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11946971
>>String s = HTMLEscape(text[0]);

was meant to be

String s = HTMLEscape.escape(text[0]);
0
 
LVL 92

Accepted Solution

by:
objects earned 300 total points
ID: 11947217
> For some special reason, I can't handle the special characters in cgi,

whats the special reason :)
0
 

Author Comment

by:joeyoungkc
ID: 11947302
The reason is there has so many cgis and i don't want to change it one by one
=)
So the ideal solution is to solve it during parsing.i used
String[] text = line.split("<[^>]+>");
if (text.length == 1) {
    String s = HTMLEscape(text[0]);
}

and modify a lot and it works.

Thanks a lot.
Joe
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11947319
joeyoungkc, can you tell me how

>>whats the special reason :)

can be an answer to this question?
0
 

Author Comment

by:joeyoungkc
ID: 11947368
The reason is there has so many cgis that creates same kinds of xml and i don't want to change it one by one, so the best way is to handle the problem in the parser.
0
 

Author Comment

by:joeyoungkc
ID: 11947378
>>whats the special reason :)
that was typo....
0
 
LVL 92

Expert Comment

by:objects
ID: 11948745
> So the ideal solution is to solve it during parsing.i used
> String[] text = line.split("<[^>]+>");
> if (text.length == 1) {
>     String s = HTMLEscape(text[0]);
> }

Thats not ideal, in fact its not even safe. You'll still end up with corrupted XML data.

How many cgi's are involved?  The change required to each is fairly minor, and then you won't have to worry about anything during parsing.
0
 
LVL 92

Expert Comment

by:objects
ID: 11948947
And the time spent changing the cgi's is going to save you time sorting out problems in the future (not to mention time spent trying to work out how to handle it during parsing). And in the long run you may find you have to change the cgi's anyway :)

How is the cgi currently generating the xml, and how does it get passed for parsing?
0
 
LVL 92

Expert Comment

by:objects
ID: 11949212
If you can't be convinced then the following regex will work a lot better for pulling the value out. You may need to tweak it a little depending on exactly what you need to deal with and how it is delivered to you but you should get the idea. Give me a yell if you have any questions :)

Pattern p = Pattern.compile("<ELEMENT name=\"(.+?)\">(.*?)</ELEMENT>");
Matcher m = p.matcher(s);
if (m.matches())
{
   String name = m.group(1);
   String  value = m.group(2);
   System.out.println(name+"="+value);
   // XMLUtil handles all the encoding that you need to worry about
   // though depending on what you're doing with the parsed data you may not need to worry about it at all.
   value = XMLUtil.encodeChars(value);

}

Though as I've already mentioned I'd strongly suggest biting the bullet and fixing your cgi's.

<ELEMENT name=\"share_info\"><![CDATA[A & B & C ]]></ELEMENT>
0

Featured Post

Networking for the Cloud Era

Join Microsoft and Riverbed for a discussion and demonstration of enhancements to SteelConnect:
-One-click orchestration and cloud connectivity in Azure environments
-Tight integration of SD-WAN and WAN optimization capabilities
-Scalability and resiliency equal to a data center

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
jdbc error in jsp application 20 63
hibernate example for saving data 19 53
Notify sent to other threads in Java 9 33
Java program running SQL query 5 37
Java Flight Recorder and Java Mission Control together create a complete tool chain to continuously collect low level and detailed runtime information enabling after-the-fact incident analysis. Java Flight Recorder is a profiling and event collectio…
Introduction This article is the last of three articles that explain why and how the Experts Exchange QA Team does test automation for our web site. This article covers our test design approach and then goes through a simple test case example, how …
This theoretical tutorial explains exceptions, reasons for exceptions, different categories of exception and exception hierarchy.
The viewer will learn how to count occurrences of each item in an array.

861 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question