• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 637
  • Last Modified:

Working with Unicode Characters in XML

I am working with a program called EnCase 7 that has a scripting engine.  The scripting engine is a barebones implementation of C++ - from what I can tell.  I am trying to write out an XML file from EnCase to load into C#.  This XML file has data in EnCase that I want to move into .Net so I can do additional processing.  Also, EnCase does not have a complement that creates XML files, but a standard file writer that I am using to create XML files.  All the text files I am working with is UTF-16.
Most of the XML files that I create load into .Net with no issues.  However, I am having issues with characters that .Net does not like, for example &, <, > and Unicode character \x01.  This morning I found two more þ and   .  Below is a function I created to replace these characters with their correct HTML replacement.  

  String uft16CleanUp(String cleanString)
  {
    cleanString.Replace("\x01", " ", 0, -1);
    cleanString.Replace("&nbsp;", " ", 0, -1);
    cleanString.Replace("<", "&lt;", 0, -1);
    cleanString.Replace(">", "&gt;", 0, -1);
    cleanString.Replace("&", "&amp;", 0, -1);
    return cleanString;
  }

Does anyone have a better suggestion to go about this?  I am currently “fishing” for characters and taking up a lot of time.  I want to know if there is a predefined list of characters that XML needs converted or if there is a Unicode range that I should automatically convert.
Any help would be greatly appreciated.
0
rye004
Asked:
rye004
  • 3
  • 2
1 Solution
 
anarki_jimbelCommented:
There are built-in methods that help you to escape characters.  I don't want to repeat the link, please read:
http://weblogs.sqlteam.com/mladenp/archive/2008/10/21/Different-ways-how-to-escape-an-XML-string-in-C.aspx

Hope it helps.
0
 
rye004Author Commented:
Thank you for sending this to me.  Unfortunately I need to do this in C++ and not C#.  The scripting engine that I am using in EnCase uses a limited implementation of C++, therefore it does not have add the additional features that .Net has.
0
 
anarki_jimbelCommented:
Hmm... OK.

Basically, there are only five characters to escape:

"   &quot;
'   &apos;
<   &lt;
>   &gt;
&   &amp;

Your code is missing first two.

However, this list does not take any Unicode characters into account...
0
 
rye004Author Commented:
My problem with doing that is I keep finding additional characters (like þ) that needed to be encoded.
I am curious, as a test I wanted to see if I can make some type of foreach loop through all possible Unicode character and see if each one needs to be encoded.  If it needs to be encoded I can make a list of characters that I will need to make the change for.
Do you know how I would make a foreach look through all possible Uncode characters in C#?
Hopefully this makes sense.
0
 
anarki_jimbelCommented:
I will use a solution from http://stackoverflow.com/questions/1668571/how-to-generate-all-the-characters-in-the-utf-8-charset-in-net .
I have modified it a bit to use a file from a hard disc (see attached). The file is loaded from http://unicode.org/Public/UNIDATA/UnicodeData.txt.

Below is just a button click handler for a simple form. The application prints to an output window.

code point = character code

        private void button1_Click(object sender, EventArgs e)
        {
            System.IO.StreamReader sr = new System.IO.StreamReader( @"C:\Test\UnicodeData.txt" );
            string definedCodePoints = sr.ReadToEnd();
            System.IO.StringReader reader = new System.IO.StringReader(definedCodePoints);
            System.Text.UTF8Encoding encoder = new System.Text.UTF8Encoding();
            while (true)
            {
                string line = reader.ReadLine();
                if (line == null) break;
                int codePoint = Convert.ToInt32(line.Substring(0, line.IndexOf(";")), 16);
                if (codePoint >= 0xD800 && codePoint <= 0xDFFF)
                {
                    //surrogate boundary; not valid codePoint, but listed in the document
                }
                else
                {
                    string utf16 = char.ConvertFromUtf32(codePoint);
                    byte[] utf8 = encoder.GetBytes(utf16);
                    //TODO: something with the UTF-8-encoded character

                    System.Diagnostics.Debug.WriteLine("'" + utf16 + "'");
                } 
            }
            System.Diagnostics.Debug.WriteLine("Finished");
        }

Open in new window


or print just codes in hex form

                else
                {
                   System.Diagnostics.Debug.WriteLine("'" + string.Format("{0:x4}",codePoint) + "'");
                }

Open in new window

UnicodeData.txt
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Cloud Class® Course: Microsoft Windows 7 Basic

This introductory course to Windows 7 environment will teach you about working with the Windows operating system. You will learn about basic functions including start menu; the desktop; managing files, folders, and libraries.

  • 3
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now