Want to win a PS4? Go Premium and enter to win our High-Tech Treats giveaway. Enter to Win

x
?
Solved

XML Escape Algorithm

Posted on 2002-07-01
12
Medium Priority
?
987 Views
Last Modified: 2013-11-19
I am looking for a robust c/c++ algorithm that will escape xml strings correctly.

I would like to include non printable characters such as the euro british pound character etc. Obviously & < > must also work.

I cannot believe that there are not plenty of implementations, but I am struggling to find one.

Thanks
Regards
Craig.
0
Comment
Question by:cmain
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 3
  • 2
  • +3
12 Comments
 
LVL 30

Expert Comment

by:Axter
ID: 7121637
>>I am looking for a robust c/c++ algorithm that
>>will escape xml strings correctly.

What do you mean by escape?
0
 
LVL 86

Expert Comment

by:jkr
ID: 7121673
The "http://xmlsoft.org/xml.html" (http://xmlsoft.org/xml.html) comes with an encoder, maybe you can get some 'inspiration' from there. The encoding itself is described RFC 2396 (http://www.faqs.org/rfcs/rfc2396.html)
0
 
LVL 1

Author Comment

by:cmain
ID: 7122402
What I mean by escape is.

& -> &amp;
> -> &gt;
< -> &lt;

These are simple escape sequences. Other sequences are required for characters that are not printable, such as the british pound character.

The RFC covers the format of URL/URI type strings, I am talking about more general text node values in an xml document.
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
LVL 86

Expert Comment

by:jkr
ID: 7122421
>> am talking about more general text node values in an xml document

Check http://xmlsoft.org/xml.html again - this lib also does that (IIRC)
0
 
LVL 49

Expert Comment

by:DanRollins
ID: 7123618
There's nothing to it:
In attribute values, replace every occurrence of
    &, <, ", cr, lf, tab
with
    &amp;, &lt;, &quot;, &#xD;, &#xA;, &#x9;

In text elements, replace every occurrence of
    &, <, >, cr
with

   &amp;, &lt;, &gt;, &#xD;

If you want, you can also convert any or ALL characters to the
    &#xNNN;  or &#NNN;

sequence, so if you feel shaky about what to do with British Pound Symbol, just encoide all values >= 128 with the hex code equivallent.

I can write the code for you if you describe the platform on which your encoder will be running (i.e., are you expecting to use STL, or will you just pass char* or will you be using MFC and it's veratile CString type?

-- Dan
0
 
LVL 23

Expert Comment

by:Roshan Davis
ID: 7123739
<YOURTAG xmlns:dt="urn:schemas-microsoft-com:datatypes" dt:dt="bin.base64">Microsoft Says - You can use any binary data here....</YOURTAG>

GOOD LUCK
0
 
LVL 1

Author Comment

by:cmain
ID: 7123895
Hi Dan,

I am using STL (std::string).
I am now intrigued as you how you would do it, so go ahead; and the question is yours.

Regards
Craig
0
 
LVL 49

Accepted Solution

by:
DanRollins earned 2000 total points
ID: 7124155
#include <string>
using namespace std;

string EncodeForXml( string sSrc )
{
     string sRet;
     const char* p= sSrc.c_str();
     
     while( *p ) {
          switch( *p ) {
          case ';':  sRet += "&amp;";   break;
          case '<':  sRet += "&;lt;";   break;
          case '>':  sRet += "&;gt;";   break;
          case '\"': sRet += "&;quot;"; break;
          default:
               if ( (*p < ' ') || (*p > 127 ) ) {
                    char szNum[5];
                    sprintf( szNum, "&x%X;", (unsigned char)*p );
                    sRet += szNum;
               }
               else {
                    sRet += *p;
               }
          }
          p++;
     }
     return sRet;
}

void main()
{
     string sSrc(
          "Some XML contains text like 6 < 7 & other XML has \"7 > 6\" in quotes! \n"
          "Some XML contains $ but the Britsh prefer the £ sign." // note, also \xa3 is a pound sterling thingy
     );
     string sDest= EncodeForXml( sSrc );
}

AFAIK, it's fine to 'over encode' that is, one can use &gt; or &x1B; even if it is not required; so that's what my code does.

If you are encoding huge string, the above algorithm is too slow because of all of the string concatenation.  But then, if you are sending such big strings, it is easier and probably better to surround with
    <!CDATA[[ ... ]]>
and do no encoding at all.

-- Dan
0
 
LVL 1

Author Comment

by:cmain
ID: 7124186
Thanks Dan,

Yes, I have been using CDATA sections in places where they are required.

The reason I have asked for the code sample is actually to encode some attribute values that may contain user typed values. > < and & are quite common.

Thanks for all the help.
0
 

Expert Comment

by:matsondawson
ID: 9041131
Hi,

DanRollins answer is somewhat broken.
Note the buffer overflow, missing encoding for single quote, incorrect encodings for <>".
Also (*p > 127 )  where *p is a signed char having no effect.
You should probably use iterators to iterate chars in strings as well to save system resources.
Here is a fixed version.

#include <string>
#include <sstream>
using namespace std;

/**
 * Escape characters that will interfere with xml.
 *
 * @param sSrc The src string to escape.
 * @return sSrc encoded for insertion into xml.
 */
string encodeForXml( string sSrc )
{
    ostringstream sRet;

    for( string::const_iterator iter = sSrc.begin(); iter!=sSrc.end(); iter++ )
    {
         unsigned char c = (unsigned char)*iter;

         switch( c )
         {
             case ';': sRet << "&amp;"; break;
             case '<': sRet << "&lt;"; break;
             case '>': sRet << "&gt;"; break;
             case '"': sRet << "&quot;"; break;
             case '\'': sRet << "&apos;"; break;

             default:
              if ( c<32 || c>127 )
              {
                   sRet << "&#" << (unsigned int)c << ";";
              }
              else
              {
                   sRet << c;
              }
         }
    }

    return sRet.str();
}

Cheers,
Matt ( matsondawson )
0
 
LVL 49

Expert Comment

by:DanRollins
ID: 9043613
Thanks for the imporvement -- I think.  

It's true that a signed char > 127 would be as rare as hen's teeth, my routine would still encode it because it would then be less than ' ' (which is 32).  And IMHO iterators were invented just so STL purists could have something to talk about at hackerz conventions and such
:-)

-- Dan
0
 

Expert Comment

by:matsondawson
ID: 9044889
Cool,
Your right, make it a signed char and remove the c>127.
I admit I'm a little uncomfortable with operators in c++, because I feel they hide the internal workings of the whole system.
But sometimes you just have to let it all go and join the dark side : )

Cheers,
Matt
0

Featured Post

Tech or Treat!

Submit an article about your scariest tech experience—and the solution—and you’ll be automatically entered to win one of 4 fantastic tech gadgets.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction Knockoutjs (Knockout) is a JavaScript framework (Model View ViewModel or MVVM framework).   The main ideology behind Knockout is to control from JavaScript how a page looks whilst creating an engaging user experience in the least …
Basic understanding on "OO- Object Orientation" is needed for designing a logical solution to solve a problem. Basic OOAD is a prerequisite for a coder to ensure that they follow the basic design of OO. This would help developers to understand the b…
The viewer will learn how to user default arguments when defining functions. This method of defining functions will be contrasted with the non-default-argument of defining functions.
The viewer will be introduced to the member functions push_back and pop_back of the vector class. The video will teach the difference between the two as well as how to use each one along with its functionality.

604 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question