?
Solved

XML Escape Algorithm

Posted on 2002-07-01
12
Medium Priority
?
986 Views
Last Modified: 2013-11-19
I am looking for a robust c/c++ algorithm that will escape xml strings correctly.

I would like to include non printable characters such as the euro british pound character etc. Obviously & < > must also work.

I cannot believe that there are not plenty of implementations, but I am struggling to find one.

Thanks
Regards
Craig.
0
Comment
Question by:cmain
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 3
  • 2
  • +3
12 Comments
 
LVL 30

Expert Comment

by:Axter
ID: 7121637
>>I am looking for a robust c/c++ algorithm that
>>will escape xml strings correctly.

What do you mean by escape?
0
 
LVL 86

Expert Comment

by:jkr
ID: 7121673
The "http://xmlsoft.org/xml.html" (http://xmlsoft.org/xml.html) comes with an encoder, maybe you can get some 'inspiration' from there. The encoding itself is described RFC 2396 (http://www.faqs.org/rfcs/rfc2396.html)
0
 
LVL 1

Author Comment

by:cmain
ID: 7122402
What I mean by escape is.

& -> &amp;
> -> &gt;
< -> &lt;

These are simple escape sequences. Other sequences are required for characters that are not printable, such as the british pound character.

The RFC covers the format of URL/URI type strings, I am talking about more general text node values in an xml document.
0
Don't Cry: How Liquid Web is Ensuring Security

WannaCry is just the start. Read how Liquid Web is protecting itself and its customers against new threats.

 
LVL 86

Expert Comment

by:jkr
ID: 7122421
>> am talking about more general text node values in an xml document

Check http://xmlsoft.org/xml.html again - this lib also does that (IIRC)
0
 
LVL 49

Expert Comment

by:DanRollins
ID: 7123618
There's nothing to it:
In attribute values, replace every occurrence of
    &, <, ", cr, lf, tab
with
    &amp;, &lt;, &quot;, &#xD;, &#xA;, &#x9;

In text elements, replace every occurrence of
    &, <, >, cr
with

   &amp;, &lt;, &gt;, &#xD;

If you want, you can also convert any or ALL characters to the
    &#xNNN;  or &#NNN;

sequence, so if you feel shaky about what to do with British Pound Symbol, just encoide all values >= 128 with the hex code equivallent.

I can write the code for you if you describe the platform on which your encoder will be running (i.e., are you expecting to use STL, or will you just pass char* or will you be using MFC and it's veratile CString type?

-- Dan
0
 
LVL 23

Expert Comment

by:Roshan Davis
ID: 7123739
<YOURTAG xmlns:dt="urn:schemas-microsoft-com:datatypes" dt:dt="bin.base64">Microsoft Says - You can use any binary data here....</YOURTAG>

GOOD LUCK
0
 
LVL 1

Author Comment

by:cmain
ID: 7123895
Hi Dan,

I am using STL (std::string).
I am now intrigued as you how you would do it, so go ahead; and the question is yours.

Regards
Craig
0
 
LVL 49

Accepted Solution

by:
DanRollins earned 2000 total points
ID: 7124155
#include <string>
using namespace std;

string EncodeForXml( string sSrc )
{
     string sRet;
     const char* p= sSrc.c_str();
     
     while( *p ) {
          switch( *p ) {
          case ';':  sRet += "&amp;";   break;
          case '<':  sRet += "&;lt;";   break;
          case '>':  sRet += "&;gt;";   break;
          case '\"': sRet += "&;quot;"; break;
          default:
               if ( (*p < ' ') || (*p > 127 ) ) {
                    char szNum[5];
                    sprintf( szNum, "&x%X;", (unsigned char)*p );
                    sRet += szNum;
               }
               else {
                    sRet += *p;
               }
          }
          p++;
     }
     return sRet;
}

void main()
{
     string sSrc(
          "Some XML contains text like 6 < 7 & other XML has \"7 > 6\" in quotes! \n"
          "Some XML contains $ but the Britsh prefer the £ sign." // note, also \xa3 is a pound sterling thingy
     );
     string sDest= EncodeForXml( sSrc );
}

AFAIK, it's fine to 'over encode' that is, one can use &gt; or &x1B; even if it is not required; so that's what my code does.

If you are encoding huge string, the above algorithm is too slow because of all of the string concatenation.  But then, if you are sending such big strings, it is easier and probably better to surround with
    <!CDATA[[ ... ]]>
and do no encoding at all.

-- Dan
0
 
LVL 1

Author Comment

by:cmain
ID: 7124186
Thanks Dan,

Yes, I have been using CDATA sections in places where they are required.

The reason I have asked for the code sample is actually to encode some attribute values that may contain user typed values. > < and & are quite common.

Thanks for all the help.
0
 

Expert Comment

by:matsondawson
ID: 9041131
Hi,

DanRollins answer is somewhat broken.
Note the buffer overflow, missing encoding for single quote, incorrect encodings for <>".
Also (*p > 127 )  where *p is a signed char having no effect.
You should probably use iterators to iterate chars in strings as well to save system resources.
Here is a fixed version.

#include <string>
#include <sstream>
using namespace std;

/**
 * Escape characters that will interfere with xml.
 *
 * @param sSrc The src string to escape.
 * @return sSrc encoded for insertion into xml.
 */
string encodeForXml( string sSrc )
{
    ostringstream sRet;

    for( string::const_iterator iter = sSrc.begin(); iter!=sSrc.end(); iter++ )
    {
         unsigned char c = (unsigned char)*iter;

         switch( c )
         {
             case ';': sRet << "&amp;"; break;
             case '<': sRet << "&lt;"; break;
             case '>': sRet << "&gt;"; break;
             case '"': sRet << "&quot;"; break;
             case '\'': sRet << "&apos;"; break;

             default:
              if ( c<32 || c>127 )
              {
                   sRet << "&#" << (unsigned int)c << ";";
              }
              else
              {
                   sRet << c;
              }
         }
    }

    return sRet.str();
}

Cheers,
Matt ( matsondawson )
0
 
LVL 49

Expert Comment

by:DanRollins
ID: 9043613
Thanks for the imporvement -- I think.  

It's true that a signed char > 127 would be as rare as hen's teeth, my routine would still encode it because it would then be less than ' ' (which is 32).  And IMHO iterators were invented just so STL purists could have something to talk about at hackerz conventions and such
:-)

-- Dan
0
 

Expert Comment

by:matsondawson
ID: 9044889
Cool,
Your right, make it a signed char and remove the c>127.
I admit I'm a little uncomfortable with operators in c++, because I feel they hide the internal workings of the whole system.
But sometimes you just have to let it all go and join the dark side : )

Cheers,
Matt
0

Featured Post

Get real performance insights from real users

Key features:
- Total Pages Views and Load times
- Top Pages Viewed and Load Times
- Real Time Site Page Build Performance
- Users’ Browser and Platform Performance
- Geographic User Breakdown
- And more

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Have you tried to learn about Unicode, UTF-8, and multibyte text encoding and all the articles are just too "academic" or too technical? This article aims to make the whole topic easy for just about anyone to understand.
Basic understanding on "OO- Object Orientation" is needed for designing a logical solution to solve a problem. Basic OOAD is a prerequisite for a coder to ensure that they follow the basic design of OO. This would help developers to understand the b…
The viewer will learn how to pass data into a function in C++. This is one step further in using functions. Instead of only printing text onto the console, the function will be able to perform calculations with argumentents given by the user.
The viewer will learn how to clear a vector as well as how to detect empty vectors in C++.
Suggested Courses

752 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question