Solved

XML Escape Algorithm

Posted on 2002-07-01
12
974 Views
Last Modified: 2013-11-19
I am looking for a robust c/c++ algorithm that will escape xml strings correctly.

I would like to include non printable characters such as the euro british pound character etc. Obviously & < > must also work.

I cannot believe that there are not plenty of implementations, but I am struggling to find one.

Thanks
Regards
Craig.
0
Comment
Question by:cmain
  • 3
  • 3
  • 2
  • +3
12 Comments
 
LVL 30

Expert Comment

by:Axter
ID: 7121637
>>I am looking for a robust c/c++ algorithm that
>>will escape xml strings correctly.

What do you mean by escape?
0
 
LVL 86

Expert Comment

by:jkr
ID: 7121673
The "http://xmlsoft.org/xml.html" (http://xmlsoft.org/xml.html) comes with an encoder, maybe you can get some 'inspiration' from there. The encoding itself is described RFC 2396 (http://www.faqs.org/rfcs/rfc2396.html)
0
 
LVL 1

Author Comment

by:cmain
ID: 7122402
What I mean by escape is.

& -> &amp;
> -> &gt;
< -> &lt;

These are simple escape sequences. Other sequences are required for characters that are not printable, such as the british pound character.

The RFC covers the format of URL/URI type strings, I am talking about more general text node values in an xml document.
0
 
LVL 86

Expert Comment

by:jkr
ID: 7122421
>> am talking about more general text node values in an xml document

Check http://xmlsoft.org/xml.html again - this lib also does that (IIRC)
0
 
LVL 49

Expert Comment

by:DanRollins
ID: 7123618
There's nothing to it:
In attribute values, replace every occurrence of
    &, <, ", cr, lf, tab
with
    &amp;, &lt;, &quot;, &#xD;, &#xA;, &#x9;

In text elements, replace every occurrence of
    &, <, >, cr
with

   &amp;, &lt;, &gt;, &#xD;

If you want, you can also convert any or ALL characters to the
    &#xNNN;  or &#NNN;

sequence, so if you feel shaky about what to do with British Pound Symbol, just encoide all values >= 128 with the hex code equivallent.

I can write the code for you if you describe the platform on which your encoder will be running (i.e., are you expecting to use STL, or will you just pass char* or will you be using MFC and it's veratile CString type?

-- Dan
0
 
LVL 23

Expert Comment

by:Roshan Davis
ID: 7123739
<YOURTAG xmlns:dt="urn:schemas-microsoft-com:datatypes" dt:dt="bin.base64">Microsoft Says - You can use any binary data here....</YOURTAG>

GOOD LUCK
0
6 Surprising Benefits of Threat Intelligence

All sorts of threat intelligence is available on the web. Intelligence you can learn from, and use to anticipate and prepare for future attacks.

 
LVL 1

Author Comment

by:cmain
ID: 7123895
Hi Dan,

I am using STL (std::string).
I am now intrigued as you how you would do it, so go ahead; and the question is yours.

Regards
Craig
0
 
LVL 49

Accepted Solution

by:
DanRollins earned 500 total points
ID: 7124155
#include <string>
using namespace std;

string EncodeForXml( string sSrc )
{
     string sRet;
     const char* p= sSrc.c_str();
     
     while( *p ) {
          switch( *p ) {
          case ';':  sRet += "&amp;";   break;
          case '<':  sRet += "&;lt;";   break;
          case '>':  sRet += "&;gt;";   break;
          case '\"': sRet += "&;quot;"; break;
          default:
               if ( (*p < ' ') || (*p > 127 ) ) {
                    char szNum[5];
                    sprintf( szNum, "&x%X;", (unsigned char)*p );
                    sRet += szNum;
               }
               else {
                    sRet += *p;
               }
          }
          p++;
     }
     return sRet;
}

void main()
{
     string sSrc(
          "Some XML contains text like 6 < 7 & other XML has \"7 > 6\" in quotes! \n"
          "Some XML contains $ but the Britsh prefer the £ sign." // note, also \xa3 is a pound sterling thingy
     );
     string sDest= EncodeForXml( sSrc );
}

AFAIK, it's fine to 'over encode' that is, one can use &gt; or &x1B; even if it is not required; so that's what my code does.

If you are encoding huge string, the above algorithm is too slow because of all of the string concatenation.  But then, if you are sending such big strings, it is easier and probably better to surround with
    <!CDATA[[ ... ]]>
and do no encoding at all.

-- Dan
0
 
LVL 1

Author Comment

by:cmain
ID: 7124186
Thanks Dan,

Yes, I have been using CDATA sections in places where they are required.

The reason I have asked for the code sample is actually to encode some attribute values that may contain user typed values. > < and & are quite common.

Thanks for all the help.
0
 

Expert Comment

by:matsondawson
ID: 9041131
Hi,

DanRollins answer is somewhat broken.
Note the buffer overflow, missing encoding for single quote, incorrect encodings for <>".
Also (*p > 127 )  where *p is a signed char having no effect.
You should probably use iterators to iterate chars in strings as well to save system resources.
Here is a fixed version.

#include <string>
#include <sstream>
using namespace std;

/**
 * Escape characters that will interfere with xml.
 *
 * @param sSrc The src string to escape.
 * @return sSrc encoded for insertion into xml.
 */
string encodeForXml( string sSrc )
{
    ostringstream sRet;

    for( string::const_iterator iter = sSrc.begin(); iter!=sSrc.end(); iter++ )
    {
         unsigned char c = (unsigned char)*iter;

         switch( c )
         {
             case ';': sRet << "&amp;"; break;
             case '<': sRet << "&lt;"; break;
             case '>': sRet << "&gt;"; break;
             case '"': sRet << "&quot;"; break;
             case '\'': sRet << "&apos;"; break;

             default:
              if ( c<32 || c>127 )
              {
                   sRet << "&#" << (unsigned int)c << ";";
              }
              else
              {
                   sRet << c;
              }
         }
    }

    return sRet.str();
}

Cheers,
Matt ( matsondawson )
0
 
LVL 49

Expert Comment

by:DanRollins
ID: 9043613
Thanks for the imporvement -- I think.  

It's true that a signed char > 127 would be as rare as hen's teeth, my routine would still encode it because it would then be less than ' ' (which is 32).  And IMHO iterators were invented just so STL purists could have something to talk about at hackerz conventions and such
:-)

-- Dan
0
 

Expert Comment

by:matsondawson
ID: 9044889
Cool,
Your right, make it a signed char and remove the c>127.
I admit I'm a little uncomfortable with operators in c++, because I feel they hide the internal workings of the whole system.
But sometimes you just have to let it all go and join the dark side : )

Cheers,
Matt
0

Featured Post

Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

Preface This article introduces an authentication and authorization system for a website.  It is understood by the author and the project contributors that there is no such thing as a "one size fits all" system.  That being said, there is a certa…
Introduction Since I wrote the original article about Handling Date and Time in PHP and MySQL (http://www.experts-exchange.com/articles/201/Handling-Date-and-Time-in-PHP-and-MySQL.html) several years ago, it seemed like now was a good time to updat…
The viewer will learn the benefit of using external CSS files and the relationship between class and ID selectors. Create your external css file by saving it as style.css then set up your style tags: (CODE) Reference the nav tag and set your prop…
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

706 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now