Encode Special Characters in XML

Hello I'm trying to write a method that will allow me to encode my string  (that may contain special characters) in a format that will be valid for XML.

Here is what I have so far: This works well. But there are still some of the text for the XML that is being cleaned up that is not working.

Is there a better way to do this?

Thanks
CPG
public static string CleanProductText(string sProductTextToClean) {
         string sDescription = HtmlRemoval.StripTagsCharArray(sProductTextToClean);
         sDescription = sDescription.Replace("•", "").Replace("\r\n", "").Replace(" ","").Replace("•" ,"");
         if (sDescription.Length > 250) {
            sDescription = sDescription.Substring(0, 250);
         }
         return Common.RemoveDiacritics(Common.RemoveSpecialCharacters(sDescription));
      }

/// <summary>
   /// Remove HTML tags from string using char array.
   /// </summary>
   public static string StripTagsCharArray(string source) {
      char[] array = new char[source.Length];
      int arrayIndex = 0;
      bool inside = false;

      for (int i = 0; i < source.Length; i++) {
         char let = source[i];
         if (let == '<') {
            inside = true;
            continue;
         }
         if (let == '>') {
            inside = false;
            continue;
         }
         if (!inside) {
            array[arrayIndex] = let;
            arrayIndex++;
         }
      }
      return new string(array, 0, arrayIndex);
   }


public static string RemoveSpecialCharacters(string dirty) {
      //° =  "&#176;" 
      //® = "&#174;" 
      //± = "&#177;"
      //i with ¨ = "&#239;" 
      //© = "&#169;"
      //¾ = "&#190;" 
      if(dirty.IndexOf('®') != -1) {
         dirty = dirty.Replace("®", "&#174;");
      }
      if(dirty.IndexOf('µ') != -1) {
         dirty = dirty.Replace("µ", "&#181;");
      }
      if (dirty.IndexOf('°') != -1) {
         dirty = dirty.Replace("°", "&#176;");
      }
      if (dirty.IndexOf('±') != -1) {
         dirty = dirty.Replace("±", "&#177;");
      }
      if (dirty.IndexOf('ï') != -1) {
         dirty = dirty.Replace("ï", "i");
      }
      if (dirty.IndexOf('©') != -1) {
         dirty = dirty.Replace("©", "&#169;");
      }
      if (dirty.IndexOf('¾') != -1) {
         dirty = dirty.Replace("¾", "3/4");
      }
      if (dirty.IndexOf('½') != -1) {
         dirty = dirty.Replace("½", "1/2");
      }
      return dirty;
   }
public static string RemoveDiacritics(String s) {
      string normalizedString = s.Normalize(NormalizationForm.FormD);
      StringBuilder stringBuilder = new StringBuilder();

      for (int i = 0; i < normalizedString.Length; i++) {
         Char c = normalizedString[i];
         if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
            stringBuilder.Append(c);
      }
      return stringBuilder.ToString();
   }

Open in new window

LVL 13
copyPasteGhostAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

robastaCommented:
Do you really have to remove the characters? You can use CData to have the XML parse ignore them (and still be valid XML). http://www.w3schools.com/xml/xml_cdata.asp

C# how to : http://www.discussweb.com/c-programming/2041-how-write-contents-cdata-xml-using-c.html


0
copyPasteGhostAuthor Commented:
they are in CDATA sections.

The party we are uploading the XML too is very strict.

Good idea though.
0
Exploring ASP.NET Core: Fundamentals

Learn to build web apps and services, IoT apps, and mobile backends by covering the fundamentals of ASP.NET Core and  exploring the core foundations for app libraries.

StealthyDevCommented:
Wherever you want, you need to first encode and put.

While reading, properly decode it.
0
copyPasteGhostAuthor Commented:
@senthurpandian - I'm using this to put my products on google. I will not be doing the decoding...is your solution still valid?
0
StealthyDevCommented:
No then you cannot do that way.

Where are you putting in Google? Google it-selves should have an encoding then. Just try their documentation.

Or you need to skip that particular Special Char. :-/
0
copyPasteGhostAuthor Commented:
that's the messed up part I have giving them this:

<item>
      <title><![CDATA[Active Robots 2.5GHz Folding Antenna w/SMA Connector]]></title>
      <link><![CDATA[http://www.myDomain.com/active-robots-2-5-ghz.html]]></link>
      <description><![CDATA[ High quality 2.4~2.5GHz antenna Adjustable elbow Suitable for transmitter, receiver and transceiver applications. The Active Robots 2.5GHz Folding Antenna w/SMA Connector is a high quality 2.4~2.5GHz antenna suitable for transmitter, receiver and tran]]></description>
      <g:price>7.87</g:price>
      <g:image_link><![CDATA[http://www.myDomain.com/big/en/active-robots-2-5-ghz.jpg]]></g:image_link>
      <g:id><![CDATA[RB-Act-13]]></g:id>
      <g:payment_accepted>Cash</g:payment_accepted>
      <g:payment_accepted>Visa</g:payment_accepted>
      <g:payment_accepted>Amex</g:payment_accepted>
      <g:payment_accepted>Paypal</g:payment_accepted>
      <g:payment_accepted>Mastercard</g:payment_accepted>
      <g:brand>Robots Ltd.</g:brand>
      <g:condition>New</g:condition>
      <g:manufacturer>Robots Ltd.</g:manufacturer>
      <g:mpn><![CDATA[ANT-2.5G]]></g:mpn>
      <g:product_type><![CDATA[Antennas]]></g:product_type>
    </item>

And when they parse it, it fails and send I sent this:

High quality 2.4~2.5GHz antenna Adjustable elbow Suitable for transmitter, receiver and transceiver applicationsThe Active Robots 2.5GHz Folding Antenna w/SMA Connector is a high quality 2.4~2.5GHz antenna suitable for transmitter, receiver and tran

Any idea?
0
copyPasteGhostAuthor Commented:
And when they parse it, it fails and says I sent this: *

Note the  after the  2.4~2.5GHz Which doesn't appear in my source....
0
StealthyDevCommented:
Sorry dude, i cant find anything from your previous post.

If you can tell me where you are uploading this to Google, i can help you.

Besides, i can tell, you will be given, something like this:

http://code.google.com/p/html-entities/
0
copyPasteGhostAuthor Commented:
we are uploading to the google merchant center.

http://www.google.ca/search?q=merchant+center&rls=com.microsoft:en-us&ie=UTF-8&oe=UTF-8&startIndex=&startPage=1&redir_esc=&ei=VozMS-OFKcT6lwfRyb2PBg

Check it out let me know what you can find.

Thanks.
0
StealthyDevCommented:
Is this the one you are looking for?

http://www.google.com/support/merchants/bin/answer.py?hl=en&answer=160079

UTF-8 Encoder:
http://java.sun.com/docs/books/tutorial/i18n/text/string.html

Or you can use your-own packages.

Best regards.

0
StealthyDevCommented:
0
copyPasteGhostAuthor Commented:
ok strings in .NET are encoded in UTF-16 by default.

I have the XML file set to the UTF-8 Encoding. that might be the problem...

I've tried something like this:

static public string EncodeToUTF8(string toEncode) {
      UTF8Encoding encoding = new UTF8Encoding();
      byte[] postBytes = encoding.GetBytes(toEncode);
      return encoding.GetString(postBytes);
   }

I'm testing it now. I'll let you know how it works..
0
copyPasteGhostAuthor Commented:
ok it's still not working!

Here is the Super Cleaner Method

It appears that all my errors are about a  that is appering when google parses my file. The  is not there when the file is sent and it magically appears on their side...

Any ideas? I know this is a tough one...
public static string CleanProductText(string sProductTextToClean) {
         string sDescription = HtmlRemoval.StripTagsCharArray(sProductTextToClean);
         sDescription = sDescription.Replace("&#8226;", "").Replace("\r\n", "").Replace("&nbsp;","").Replace("•" ,"");
         if (sDescription.Length > 250) {
            sDescription = sDescription.Substring(0, 250);
         }
         return Common.EncodeToUTF8(Common.RemoveDiacritics(Common.RemoveSpecialCharacters(sDescription)));
      }

Open in new window

0
StealthyDevCommented:
Write the same XML into a file before sending to Google.

Please attach the same file here.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
copyPasteGhostAuthor Commented:
turns out it was a bug on google's side...go firgure!

Thanks anyways!
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
.NET Programming

From novice to tech pro — start learning today.