• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 995
  • Last Modified:

XML Escape Algorithm

I am looking for a robust c/c++ algorithm that will escape xml strings correctly.

I would like to include non printable characters such as the euro british pound character etc. Obviously & < > must also work.

I cannot believe that there are not plenty of implementations, but I am struggling to find one.

Thanks
Regards
Craig.
0
cmain
Asked:
cmain
  • 3
  • 3
  • 2
  • +3
1 Solution
 
AxterCommented:
>>I am looking for a robust c/c++ algorithm that
>>will escape xml strings correctly.

What do you mean by escape?
0
 
jkrCommented:
The "http://xmlsoft.org/xml.html" (http://xmlsoft.org/xml.html) comes with an encoder, maybe you can get some 'inspiration' from there. The encoding itself is described RFC 2396 (http://www.faqs.org/rfcs/rfc2396.html)
0
 
cmainAuthor Commented:
What I mean by escape is.

& -> &amp;
> -> &gt;
< -> &lt;

These are simple escape sequences. Other sequences are required for characters that are not printable, such as the british pound character.

The RFC covers the format of URL/URI type strings, I am talking about more general text node values in an xml document.
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
jkrCommented:
>> am talking about more general text node values in an xml document

Check http://xmlsoft.org/xml.html again - this lib also does that (IIRC)
0
 
DanRollinsCommented:
There's nothing to it:
In attribute values, replace every occurrence of
    &, <, ", cr, lf, tab
with
    &amp;, &lt;, &quot;, &#xD;, &#xA;, &#x9;

In text elements, replace every occurrence of
    &, <, >, cr
with

   &amp;, &lt;, &gt;, &#xD;

If you want, you can also convert any or ALL characters to the
    &#xNNN;  or &#NNN;

sequence, so if you feel shaky about what to do with British Pound Symbol, just encoide all values >= 128 with the hex code equivallent.

I can write the code for you if you describe the platform on which your encoder will be running (i.e., are you expecting to use STL, or will you just pass char* or will you be using MFC and it's veratile CString type?

-- Dan
0
 
Roshan DavisCommented:
<YOURTAG xmlns:dt="urn:schemas-microsoft-com:datatypes" dt:dt="bin.base64">Microsoft Says - You can use any binary data here....</YOURTAG>

GOOD LUCK
0
 
cmainAuthor Commented:
Hi Dan,

I am using STL (std::string).
I am now intrigued as you how you would do it, so go ahead; and the question is yours.

Regards
Craig
0
 
DanRollinsCommented:
#include <string>
using namespace std;

string EncodeForXml( string sSrc )
{
     string sRet;
     const char* p= sSrc.c_str();
     
     while( *p ) {
          switch( *p ) {
          case ';':  sRet += "&amp;";   break;
          case '<':  sRet += "&;lt;";   break;
          case '>':  sRet += "&;gt;";   break;
          case '\"': sRet += "&;quot;"; break;
          default:
               if ( (*p < ' ') || (*p > 127 ) ) {
                    char szNum[5];
                    sprintf( szNum, "&x%X;", (unsigned char)*p );
                    sRet += szNum;
               }
               else {
                    sRet += *p;
               }
          }
          p++;
     }
     return sRet;
}

void main()
{
     string sSrc(
          "Some XML contains text like 6 < 7 & other XML has \"7 > 6\" in quotes! \n"
          "Some XML contains $ but the Britsh prefer the £ sign." // note, also \xa3 is a pound sterling thingy
     );
     string sDest= EncodeForXml( sSrc );
}

AFAIK, it's fine to 'over encode' that is, one can use &gt; or &x1B; even if it is not required; so that's what my code does.

If you are encoding huge string, the above algorithm is too slow because of all of the string concatenation.  But then, if you are sending such big strings, it is easier and probably better to surround with
    <!CDATA[[ ... ]]>
and do no encoding at all.

-- Dan
0
 
cmainAuthor Commented:
Thanks Dan,

Yes, I have been using CDATA sections in places where they are required.

The reason I have asked for the code sample is actually to encode some attribute values that may contain user typed values. > < and & are quite common.

Thanks for all the help.
0
 
matsondawsonCommented:
Hi,

DanRollins answer is somewhat broken.
Note the buffer overflow, missing encoding for single quote, incorrect encodings for <>".
Also (*p > 127 )  where *p is a signed char having no effect.
You should probably use iterators to iterate chars in strings as well to save system resources.
Here is a fixed version.

#include <string>
#include <sstream>
using namespace std;

/**
 * Escape characters that will interfere with xml.
 *
 * @param sSrc The src string to escape.
 * @return sSrc encoded for insertion into xml.
 */
string encodeForXml( string sSrc )
{
    ostringstream sRet;

    for( string::const_iterator iter = sSrc.begin(); iter!=sSrc.end(); iter++ )
    {
         unsigned char c = (unsigned char)*iter;

         switch( c )
         {
             case ';': sRet << "&amp;"; break;
             case '<': sRet << "&lt;"; break;
             case '>': sRet << "&gt;"; break;
             case '"': sRet << "&quot;"; break;
             case '\'': sRet << "&apos;"; break;

             default:
              if ( c<32 || c>127 )
              {
                   sRet << "&#" << (unsigned int)c << ";";
              }
              else
              {
                   sRet << c;
              }
         }
    }

    return sRet.str();
}

Cheers,
Matt ( matsondawson )
0
 
DanRollinsCommented:
Thanks for the imporvement -- I think.  

It's true that a signed char > 127 would be as rare as hen's teeth, my routine would still encode it because it would then be less than ' ' (which is 32).  And IMHO iterators were invented just so STL purists could have something to talk about at hackerz conventions and such
:-)

-- Dan
0
 
matsondawsonCommented:
Cool,
Your right, make it a signed char and remove the c>127.
I admit I'm a little uncomfortable with operators in c++, because I feel they hide the internal workings of the whole system.
But sometimes you just have to let it all go and join the dark side : )

Cheers,
Matt
0

Featured Post

Upgrade your Question Security!

Your question, your audience. Choose who sees your identity—and your question—with question security.

  • 3
  • 3
  • 2
  • +3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now