Encode Special Characters in XML

Hello I'm trying to write a method that will allow me to encode my string  (that may contain special characters) in a format that will be valid for XML.

Here is what I have so far: This works well. But there are still some of the text for the XML that is being cleaned up that is not working.

Is there a better way to do this?

Thanks
CPG
public static string CleanProductText(string sProductTextToClean) {
         string sDescription = HtmlRemoval.StripTagsCharArray(sProductTextToClean);
         sDescription = sDescription.Replace("•", "").Replace("\r\n", "").Replace(" ","").Replace("•" ,"");
         if (sDescription.Length > 250) {
            sDescription = sDescription.Substring(0, 250);
         }
         return Common.RemoveDiacritics(Common.RemoveSpecialCharacters(sDescription));
      }

/// <summary>
   /// Remove HTML tags from string using char array.
   /// </summary>
   public static string StripTagsCharArray(string source) {
      char[] array = new char[source.Length];
      int arrayIndex = 0;
      bool inside = false;

      for (int i = 0; i < source.Length; i++) {
         char let = source[i];
         if (let == '<') {
            inside = true;
            continue;
         }
         if (let == '>') {
            inside = false;
            continue;
         }
         if (!inside) {
            array[arrayIndex] = let;
            arrayIndex++;
         }
      }
      return new string(array, 0, arrayIndex);
   }


public static string RemoveSpecialCharacters(string dirty) {
      //° =  "&#176;" 
      //® = "&#174;" 
      //± = "&#177;"
      //i with ¨ = "&#239;" 
      //© = "&#169;"
      //¾ = "&#190;" 
      if(dirty.IndexOf('®') != -1) {
         dirty = dirty.Replace("®", "&#174;");
      }
      if(dirty.IndexOf('µ') != -1) {
         dirty = dirty.Replace("µ", "&#181;");
      }
      if (dirty.IndexOf('°') != -1) {
         dirty = dirty.Replace("°", "&#176;");
      }
      if (dirty.IndexOf('±') != -1) {
         dirty = dirty.Replace("±", "&#177;");
      }
      if (dirty.IndexOf('ï') != -1) {
         dirty = dirty.Replace("ï", "i");
      }
      if (dirty.IndexOf('©') != -1) {
         dirty = dirty.Replace("©", "&#169;");
      }
      if (dirty.IndexOf('¾') != -1) {
         dirty = dirty.Replace("¾", "3/4");
      }
      if (dirty.IndexOf('½') != -1) {
         dirty = dirty.Replace("½", "1/2");
      }
      return dirty;
   }
public static string RemoveDiacritics(String s) {
      string normalizedString = s.Normalize(NormalizationForm.FormD);
      StringBuilder stringBuilder = new StringBuilder();

      for (int i = 0; i < normalizedString.Length; i++) {
         Char c = normalizedString[i];
         if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
            stringBuilder.Append(c);
      }
      return stringBuilder.ToString();
   }

Open in new window

LVL 13
copyPasteGhostAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

 
robastaCommented:
Do you really have to remove the characters? You can use CData to have the XML parse ignore them (and still be valid XML). http://www.w3schools.com/xml/xml_cdata.asp

C# how to : http://www.discussweb.com/c-programming/2041-how-write-contents-cdata-xml-using-c.html


0
 
copyPasteGhostAuthor Commented:
they are in CDATA sections.

The party we are uploading the XML too is very strict.

Good idea though.
0
Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

 
StealthyDevCommented:
Wherever you want, you need to first encode and put.

While reading, properly decode it.
0
 
copyPasteGhostAuthor Commented:
@senthurpandian - I'm using this to put my products on google. I will not be doing the decoding...is your solution still valid?
0
 
StealthyDevCommented:
No then you cannot do that way.

Where are you putting in Google? Google it-selves should have an encoding then. Just try their documentation.

Or you need to skip that particular Special Char. :-/
0
 
copyPasteGhostAuthor Commented:
that's the messed up part I have giving them this:

<item>
      <title><![CDATA[Active Robots 2.5GHz Folding Antenna w/SMA Connector]]></title>
      <link><![CDATA[http://www.myDomain.com/active-robots-2-5-ghz.html]]></link>
      <description><![CDATA[ High quality 2.4~2.5GHz antenna Adjustable elbow Suitable for transmitter, receiver and transceiver applications. The Active Robots 2.5GHz Folding Antenna w/SMA Connector is a high quality 2.4~2.5GHz antenna suitable for transmitter, receiver and tran]]></description>
      <g:price>7.87</g:price>
      <g:image_link><![CDATA[http://www.myDomain.com/big/en/active-robots-2-5-ghz.jpg]]></g:image_link>
      <g:id><![CDATA[RB-Act-13]]></g:id>
      <g:payment_accepted>Cash</g:payment_accepted>
      <g:payment_accepted>Visa</g:payment_accepted>
      <g:payment_accepted>Amex</g:payment_accepted>
      <g:payment_accepted>Paypal</g:payment_accepted>
      <g:payment_accepted>Mastercard</g:payment_accepted>
      <g:brand>Robots Ltd.</g:brand>
      <g:condition>New</g:condition>
      <g:manufacturer>Robots Ltd.</g:manufacturer>
      <g:mpn><![CDATA[ANT-2.5G]]></g:mpn>
      <g:product_type><![CDATA[Antennas]]></g:product_type>
    </item>

And when they parse it, it fails and send I sent this:

High quality 2.4~2.5GHz antenna Adjustable elbow Suitable for transmitter, receiver and transceiver applicationsThe Active Robots 2.5GHz Folding Antenna w/SMA Connector is a high quality 2.4~2.5GHz antenna suitable for transmitter, receiver and tran

Any idea?
0
 
copyPasteGhostAuthor Commented:
And when they parse it, it fails and says I sent this: *

Note the  after the  2.4~2.5GHz Which doesn't appear in my source....
0
 
StealthyDevCommented:
Sorry dude, i cant find anything from your previous post.

If you can tell me where you are uploading this to Google, i can help you.

Besides, i can tell, you will be given, something like this:

http://code.google.com/p/html-entities/
0
 
copyPasteGhostAuthor Commented:
we are uploading to the google merchant center.

http://www.google.ca/search?q=merchant+center&rls=com.microsoft:en-us&ie=UTF-8&oe=UTF-8&startIndex=&startPage=1&redir_esc=&ei=VozMS-OFKcT6lwfRyb2PBg

Check it out let me know what you can find.

Thanks.
0
 
StealthyDevCommented:
Is this the one you are looking for?

http://www.google.com/support/merchants/bin/answer.py?hl=en&answer=160079

UTF-8 Encoder:
http://java.sun.com/docs/books/tutorial/i18n/text/string.html

Or you can use your-own packages.

Best regards.

0
 
StealthyDevCommented:
0
 
copyPasteGhostAuthor Commented:
ok strings in .NET are encoded in UTF-16 by default.

I have the XML file set to the UTF-8 Encoding. that might be the problem...

I've tried something like this:

static public string EncodeToUTF8(string toEncode) {
      UTF8Encoding encoding = new UTF8Encoding();
      byte[] postBytes = encoding.GetBytes(toEncode);
      return encoding.GetString(postBytes);
   }

I'm testing it now. I'll let you know how it works..
0
 
copyPasteGhostAuthor Commented:
ok it's still not working!

Here is the Super Cleaner Method

It appears that all my errors are about a  that is appering when google parses my file. The  is not there when the file is sent and it magically appears on their side...

Any ideas? I know this is a tough one...
public static string CleanProductText(string sProductTextToClean) {
         string sDescription = HtmlRemoval.StripTagsCharArray(sProductTextToClean);
         sDescription = sDescription.Replace("&#8226;", "").Replace("\r\n", "").Replace("&nbsp;","").Replace("•" ,"");
         if (sDescription.Length > 250) {
            sDescription = sDescription.Substring(0, 250);
         }
         return Common.EncodeToUTF8(Common.RemoveDiacritics(Common.RemoveSpecialCharacters(sDescription)));
      }

Open in new window

0
 
StealthyDevCommented:
Write the same XML into a file before sending to Google.

Please attach the same file here.
0

Experts Exchange Solution brought to you by ConnectWise

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
 
copyPasteGhostAuthor Commented:
turns out it was a bug on google's side...go firgure!

Thanks anyways!
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.