Solved

URGENT: Convert Special Characters in XML

Posted on 2004-08-31
21
1,793 Views
Last Modified: 2013-11-19
Hi, I'm working on a project.
When user submit a form, cgi will generate a email in xml format, it is something like this.

----------------------------------------------------------------------
<?xml version='1.0' encoding='ISO-8859-1'?>
<!-- start xml -->
<DATA>
<ACTION>
<CALL key="event-registration"></CALL>
</ACTION>
<CORE>
<ELEMENT name="euid">982</ELEMENT>
<ELEMENT name="stage">thankyou</ELEMENT>
<ELEMENT name="share_info">A & B & C </ELEMENT>
<ELEMENT name="special1">No<-----Agree  </ELEMENT>
</CORE>
</DATA>
<!---end of xml--->
----------------------------------------------------------------------
After the system recieve the email, my parser will use SAXParser and a selected defaultHandler to parse the xml data:

-----------------------------
                SAXHandler p = new SAXHandler(mail);
                SAXParserFactory factory = SAXParserFactory.newInstance();
                SAXParser parser = factory.newSAXParser();
                parser.parse(tempXML, p);
-----------------------------
As you can see, the element contains the special character (& and >).
For some special reason, I can't handle the special characters in cgi, so I have to take care of it in my java parser.
Can anyone know how to do it in a very simple way using SAXParser??
Assume you can convert it with String or file...

Thanks
0
Comment
Question by:joeyoungkc
  • 9
  • 7
  • 5
21 Comments
 
LVL 86

Expert Comment

by:CEHJ
ID: 11946595
You can use this class to escape the xml:

      static class HTMLEscape {
            /**
             *  Description of the Method
             *
             * @param  s  Description of the Parameter
             * @return    Description of the Return Value
             */
            public static String escape(String s) {
                  int len = s.length();
                  StringBuffer sb = new StringBuffer(len * 5 / 4);

                  for (int i = 0; i < len; i++) {
                        char c = s.charAt(i);
                        String elem = htmlchars[c & 0xff];

                        sb.append(elem == null ? "" + c : elem);
                  }
                  return sb.toString();
            }


            private static String htmlchars[] = new String[256];

            static {
                  String entry[] = {
                              "nbsp", "iexcl", "cent", "pound", "curren", "yen", "brvbar",
                              "sect", "uml", "copy", "ordf", "laquo", "not", "shy", "reg",
                              "macr", "deg", "plusmn", "sup2", "sup3", "acute", "micro",
                              "para", "middot", "cedil", "sup1", "ordm", "raquo", "frac14",
                              "frac12", "frac34", "iquest",
                              "Agrave", "Aacute", "Acirc", "Atilde", "Auml", "Aring", "AElig",
                              "CCedil", "Egrave", "Eacute", "Ecirc", "Euml", "Igrave", "Iacute",
                              "Icirc", "Iuml", "ETH", "Ntilde", "Ograve", "Oacute", "Ocirc",
                              "Otilde", "Ouml", "times", "Oslash", "Ugrave", "Uacute", "Ucirc",
                              "Uuml", "Yacute", "THORN", "szlig",
                              "agrave", "aacute", "acirc", "atilde", "auml", "aring", "aelig",
                              "ccedil", "egrave", "eacute", "ecirc", "euml", "igrave", "iacute",
                              "icirc", "iuml", "eth", "ntilde", "ograve", "oacute", "ocirc",
                              "otilde", "ouml", "divid", "oslash", "ugrave", "uacute", "ucirc",
                              "uuml", "yacute", "thorn", "yuml"
                              };

                  htmlchars['&'] = "&amp;";
                  htmlchars['<'] = "&lt;";
                  htmlchars['>'] = "&gt;";

                  for (int c = '\u00A0', i = 0; c <= '\u00FF'; c++, i++) {
                        htmlchars[c] = "&" + entry[i] + ";";
                  }

                  for (int c = '\u0083', i = 131; c <= '\u009f'; c++, i++) {
                        htmlchars[c] = "&#" + i + ";";
                  }

                  htmlchars['\u0088'] = htmlchars['\u008D'] = htmlchars['\u008E'] = null;
                  htmlchars['\u008F'] = htmlchars['\u0090'] = htmlchars['\u0098'] = null;
                  htmlchars['\u009D'] = null;
            }

      }

}

0
 

Author Comment

by:joeyoungkc
ID: 11946784
It seems way too complicate.
Isn't it I only need to change the pre-defined charcter for xml
&     -> &amp;
 
<    ->  &lt;
 
>     ->   &gt;
 
"   ->    &quot;
 
'  ->   &apos;

???

 
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11946808
>>It seems way too complicate.

Why? All you need to do is call

s = HTMLEscape.escape(s);
0
 
LVL 92

Expert Comment

by:objects
ID: 11946814
try this, or use replaceAll() method in string:

class XMLUtil{

      private static String escapeChar(char c){
         switch(c){
            case('<')  : return "&lt;";
            case('&gt;')  : return "&gt;";  
            case('&')  : return "&amp;";
            case('\'') : return "&apos;";
            case('\"') : return "&quot;";                        
        }
        return null;    
      }
     
      public static String encodeChars(String string){
     
         if(string==null)
         return "null";
         int length = string.length();
         char[] characters = new char[length];
         string.getChars(0, length, characters, 0);
         StringBuffer encoded = new StringBuffer();
         String escape;
         for(int i = 0;i<length;i++){
            escape = escapeChar(characters[i]);
            if(escape == null) encoded.append(characters[i]);
               else encoded.append(escape);
         }
         return encoded.toString();
      }
               
      public static void main(String[] args){
         String test = "AP = ' QT = \" AMP = & LT = < GT = &gt; ";
         System.out.println(encodeChars(test));
      }
}
0
 
LVL 92

Expert Comment

by:objects
ID: 11946821
eg.

s = s.replaceAll(""\"", "&quot;");
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11946828
Not sure why you'd want to remove the functionality of replacing other characters that need replacing ...
0
 
LVL 92

Expert Comment

by:objects
ID: 11946874
If you need to add support for more characters, simply add them to  the esacpeChar switch stement, or add another replaceAll() call if using that.
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11946898
>>If you need to add support for more characters

You do need to, or you won't find out until it's too late ;-) That's why the class i posted is written in that way
0
 

Author Comment

by:joeyoungkc
ID: 11946912
The problem is the org line is
<ELEMENT name="share_info">A & B & C </ELEMENT>
and i want to change it to
<ELEMENT name="share_info">A & B &amp; C </ELEMENT>

but our guys' method will change to

&lt;ELEMENT name="postal_zip"&gt;A & B &gt; C&lt;/ELEMENT&gt;

....
0
 
LVL 92

Expert Comment

by:objects
ID: 11946919
you need to parse the line, and only convert the value, not the tags.
0
What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

 
LVL 92

Expert Comment

by:objects
ID: 11946932
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11946950
Try
String[] text = line.split("<[^>]+>");
if (text.length == 1) {
    String s = HTMLEscape(text[0]);
}
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11946971
>>String s = HTMLEscape(text[0]);

was meant to be

String s = HTMLEscape.escape(text[0]);
0
 
LVL 92

Accepted Solution

by:
objects earned 300 total points
ID: 11947217
> For some special reason, I can't handle the special characters in cgi,

whats the special reason :)
0
 

Author Comment

by:joeyoungkc
ID: 11947302
The reason is there has so many cgis and i don't want to change it one by one
=)
So the ideal solution is to solve it during parsing.i used
String[] text = line.split("<[^>]+>");
if (text.length == 1) {
    String s = HTMLEscape(text[0]);
}

and modify a lot and it works.

Thanks a lot.
Joe
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 11947319
joeyoungkc, can you tell me how

>>whats the special reason :)

can be an answer to this question?
0
 

Author Comment

by:joeyoungkc
ID: 11947368
The reason is there has so many cgis that creates same kinds of xml and i don't want to change it one by one, so the best way is to handle the problem in the parser.
0
 

Author Comment

by:joeyoungkc
ID: 11947378
>>whats the special reason :)
that was typo....
0
 
LVL 92

Expert Comment

by:objects
ID: 11948745
> So the ideal solution is to solve it during parsing.i used
> String[] text = line.split("<[^>]+>");
> if (text.length == 1) {
>     String s = HTMLEscape(text[0]);
> }

Thats not ideal, in fact its not even safe. You'll still end up with corrupted XML data.

How many cgi's are involved?  The change required to each is fairly minor, and then you won't have to worry about anything during parsing.
0
 
LVL 92

Expert Comment

by:objects
ID: 11948947
And the time spent changing the cgi's is going to save you time sorting out problems in the future (not to mention time spent trying to work out how to handle it during parsing). And in the long run you may find you have to change the cgi's anyway :)

How is the cgi currently generating the xml, and how does it get passed for parsing?
0
 
LVL 92

Expert Comment

by:objects
ID: 11949212
If you can't be convinced then the following regex will work a lot better for pulling the value out. You may need to tweak it a little depending on exactly what you need to deal with and how it is delivered to you but you should get the idea. Give me a yell if you have any questions :)

Pattern p = Pattern.compile("<ELEMENT name=\"(.+?)\">(.*?)</ELEMENT>");
Matcher m = p.matcher(s);
if (m.matches())
{
   String name = m.group(1);
   String  value = m.group(2);
   System.out.println(name+"="+value);
   // XMLUtil handles all the encoding that you need to worry about
   // though depending on what you're doing with the parsed data you may not need to worry about it at all.
   value = XMLUtil.encodeChars(value);

}

Though as I've already mentioned I'd strongly suggest biting the bullet and fixing your cgi's.

<ELEMENT name=\"share_info\"><![CDATA[A & B & C ]]></ELEMENT>
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
java set up 1 47
Starting to learn JAVA, 7 48
JDeveloper 12c for 32 bit 4 35
site launch date and last modified date 3 54
I found this questions asking how to do this in many different forums, so I will describe here how to implement a solution using PHP and AJAX. The logical flow for the problem should be: Write an event handler for the first drop down box to get …
Go is an acronym of golang, is a programming language developed Google in 2007. Go is a new language that is mostly in the C family, with significant input from Pascal/Modula/Oberon family. Hence Go arisen as low-level language with fast compilation…
This tutorial covers a practical example of lazy loading technique and early loading technique in a Singleton Design Pattern.
This video teaches viewers about errors in exception handling.

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now