XML Escape Algorithm

I am looking for a robust c/c++ algorithm that will escape xml strings correctly.

I would like to include non printable characters such as the euro british pound character etc. Obviously & < > must also work.

I cannot believe that there are not plenty of implementations, but I am struggling to find one.

Thanks
Regards
Craig.
LVL 1
cmainAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

AxterCommented:
>>I am looking for a robust c/c++ algorithm that
>>will escape xml strings correctly.

What do you mean by escape?
0
jkrCommented:
The "http://xmlsoft.org/xml.html" (http://xmlsoft.org/xml.html) comes with an encoder, maybe you can get some 'inspiration' from there. The encoding itself is described RFC 2396 (http://www.faqs.org/rfcs/rfc2396.html)
0
cmainAuthor Commented:
What I mean by escape is.

& -> &amp;
> -> &gt;
< -> &lt;

These are simple escape sequences. Other sequences are required for characters that are not printable, such as the british pound character.

The RFC covers the format of URL/URI type strings, I am talking about more general text node values in an xml document.
0
Exploring ASP.NET Core: Fundamentals

Learn to build web apps and services, IoT apps, and mobile backends by covering the fundamentals of ASP.NET Core and  exploring the core foundations for app libraries.

jkrCommented:
>> am talking about more general text node values in an xml document

Check http://xmlsoft.org/xml.html again - this lib also does that (IIRC)
0
DanRollinsCommented:
There's nothing to it:
In attribute values, replace every occurrence of
    &, <, ", cr, lf, tab
with
    &amp;, &lt;, &quot;, &#xD;, &#xA;, &#x9;

In text elements, replace every occurrence of
    &, <, >, cr
with

   &amp;, &lt;, &gt;, &#xD;

If you want, you can also convert any or ALL characters to the
    &#xNNN;  or &#NNN;

sequence, so if you feel shaky about what to do with British Pound Symbol, just encoide all values >= 128 with the hex code equivallent.

I can write the code for you if you describe the platform on which your encoder will be running (i.e., are you expecting to use STL, or will you just pass char* or will you be using MFC and it's veratile CString type?

-- Dan
0
Roshan DavisCommented:
<YOURTAG xmlns:dt="urn:schemas-microsoft-com:datatypes" dt:dt="bin.base64">Microsoft Says - You can use any binary data here....</YOURTAG>

GOOD LUCK
0
cmainAuthor Commented:
Hi Dan,

I am using STL (std::string).
I am now intrigued as you how you would do it, so go ahead; and the question is yours.

Regards
Craig
0
DanRollinsCommented:
#include <string>
using namespace std;

string EncodeForXml( string sSrc )
{
     string sRet;
     const char* p= sSrc.c_str();
     
     while( *p ) {
          switch( *p ) {
          case ';':  sRet += "&amp;";   break;
          case '<':  sRet += "&;lt;";   break;
          case '>':  sRet += "&;gt;";   break;
          case '\"': sRet += "&;quot;"; break;
          default:
               if ( (*p < ' ') || (*p > 127 ) ) {
                    char szNum[5];
                    sprintf( szNum, "&x%X;", (unsigned char)*p );
                    sRet += szNum;
               }
               else {
                    sRet += *p;
               }
          }
          p++;
     }
     return sRet;
}

void main()
{
     string sSrc(
          "Some XML contains text like 6 < 7 & other XML has \"7 > 6\" in quotes! \n"
          "Some XML contains $ but the Britsh prefer the £ sign." // note, also \xa3 is a pound sterling thingy
     );
     string sDest= EncodeForXml( sSrc );
}

AFAIK, it's fine to 'over encode' that is, one can use &gt; or &x1B; even if it is not required; so that's what my code does.

If you are encoding huge string, the above algorithm is too slow because of all of the string concatenation.  But then, if you are sending such big strings, it is easier and probably better to surround with
    <!CDATA[[ ... ]]>
and do no encoding at all.

-- Dan
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
cmainAuthor Commented:
Thanks Dan,

Yes, I have been using CDATA sections in places where they are required.

The reason I have asked for the code sample is actually to encode some attribute values that may contain user typed values. > < and & are quite common.

Thanks for all the help.
0
matsondawsonCommented:
Hi,

DanRollins answer is somewhat broken.
Note the buffer overflow, missing encoding for single quote, incorrect encodings for <>".
Also (*p > 127 )  where *p is a signed char having no effect.
You should probably use iterators to iterate chars in strings as well to save system resources.
Here is a fixed version.

#include <string>
#include <sstream>
using namespace std;

/**
 * Escape characters that will interfere with xml.
 *
 * @param sSrc The src string to escape.
 * @return sSrc encoded for insertion into xml.
 */
string encodeForXml( string sSrc )
{
    ostringstream sRet;

    for( string::const_iterator iter = sSrc.begin(); iter!=sSrc.end(); iter++ )
    {
         unsigned char c = (unsigned char)*iter;

         switch( c )
         {
             case ';': sRet << "&amp;"; break;
             case '<': sRet << "&lt;"; break;
             case '>': sRet << "&gt;"; break;
             case '"': sRet << "&quot;"; break;
             case '\'': sRet << "&apos;"; break;

             default:
              if ( c<32 || c>127 )
              {
                   sRet << "&#" << (unsigned int)c << ";";
              }
              else
              {
                   sRet << c;
              }
         }
    }

    return sRet.str();
}

Cheers,
Matt ( matsondawson )
0
DanRollinsCommented:
Thanks for the imporvement -- I think.  

It's true that a signed char > 127 would be as rare as hen's teeth, my routine would still encode it because it would then be less than ' ' (which is 32).  And IMHO iterators were invented just so STL purists could have something to talk about at hackerz conventions and such
:-)

-- Dan
0
matsondawsonCommented:
Cool,
Your right, make it a signed char and remove the c>127.
I admit I'm a little uncomfortable with operators in c++, because I feel they hide the internal workings of the whole system.
But sometimes you just have to let it all go and join the dark side : )

Cheers,
Matt
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Web Languages and Standards

From novice to tech pro — start learning today.