Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium


Working with Unicode Characters in XML

Posted on 2012-09-11
Medium Priority
Last Modified: 2012-10-29
I am working with a program called EnCase 7 that has a scripting engine.  The scripting engine is a barebones implementation of C++ - from what I can tell.  I am trying to write out an XML file from EnCase to load into C#.  This XML file has data in EnCase that I want to move into .Net so I can do additional processing.  Also, EnCase does not have a complement that creates XML files, but a standard file writer that I am using to create XML files.  All the text files I am working with is UTF-16.
Most of the XML files that I create load into .Net with no issues.  However, I am having issues with characters that .Net does not like, for example &, <, > and Unicode character \x01.  This morning I found two more þ and   .  Below is a function I created to replace these characters with their correct HTML replacement.  

  String uft16CleanUp(String cleanString)
    cleanString.Replace("\x01", " ", 0, -1);
    cleanString.Replace("&nbsp;", " ", 0, -1);
    cleanString.Replace("<", "&lt;", 0, -1);
    cleanString.Replace(">", "&gt;", 0, -1);
    cleanString.Replace("&", "&amp;", 0, -1);
    return cleanString;

Does anyone have a better suggestion to go about this?  I am currently “fishing” for characters and taking up a lot of time.  I want to know if there is a predefined list of characters that XML needs converted or if there is a Unicode range that I should automatically convert.
Any help would be greatly appreciated.
Question by:rye004
  • 3
  • 2
LVL 30

Expert Comment

ID: 38388812
There are built-in methods that help you to escape characters.  I don't want to repeat the link, please read:

Hope it helps.

Author Comment

ID: 38388886
Thank you for sending this to me.  Unfortunately I need to do this in C++ and not C#.  The scripting engine that I am using in EnCase uses a limited implementation of C++, therefore it does not have add the additional features that .Net has.
LVL 30

Expert Comment

ID: 38389001
Hmm... OK.

Basically, there are only five characters to escape:

"   &quot;
'   &apos;
<   &lt;
>   &gt;
&   &amp;

Your code is missing first two.

However, this list does not take any Unicode characters into account...

Author Comment

ID: 38392836
My problem with doing that is I keep finding additional characters (like þ) that needed to be encoded.
I am curious, as a test I wanted to see if I can make some type of foreach loop through all possible Unicode character and see if each one needs to be encoded.  If it needs to be encoded I can make a list of characters that I will need to make the change for.
Do you know how I would make a foreach look through all possible Uncode characters in C#?
Hopefully this makes sense.
LVL 30

Accepted Solution

anarki_jimbel earned 1500 total points
ID: 38393326
I will use a solution from http://stackoverflow.com/questions/1668571/how-to-generate-all-the-characters-in-the-utf-8-charset-in-net .
I have modified it a bit to use a file from a hard disc (see attached). The file is loaded from http://unicode.org/Public/UNIDATA/UnicodeData.txt.

Below is just a button click handler for a simple form. The application prints to an output window.

code point = character code

        private void button1_Click(object sender, EventArgs e)
            System.IO.StreamReader sr = new System.IO.StreamReader( @"C:\Test\UnicodeData.txt" );
            string definedCodePoints = sr.ReadToEnd();
            System.IO.StringReader reader = new System.IO.StringReader(definedCodePoints);
            System.Text.UTF8Encoding encoder = new System.Text.UTF8Encoding();
            while (true)
                string line = reader.ReadLine();
                if (line == null) break;
                int codePoint = Convert.ToInt32(line.Substring(0, line.IndexOf(";")), 16);
                if (codePoint >= 0xD800 && codePoint <= 0xDFFF)
                    //surrogate boundary; not valid codePoint, but listed in the document
                    string utf16 = char.ConvertFromUtf32(codePoint);
                    byte[] utf8 = encoder.GetBytes(utf16);
                    //TODO: something with the UTF-8-encoded character

                    System.Diagnostics.Debug.WriteLine("'" + utf16 + "'");

Open in new window

or print just codes in hex form

                   System.Diagnostics.Debug.WriteLine("'" + string.Format("{0:x4}",codePoint) + "'");

Open in new window


Featured Post

Free recovery tool for Microsoft Active Directory

Veeam Explorer for Microsoft Active Directory provides fast and reliable object-level recovery for Active Directory from a single-pass, agentless backup or storage snapshot — without the need to restore an entire virtual machine or use third-party tools.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

More often than not, we developers are confronted with a need: a need to make some kind of magic happen via code. Whether it is for a client, for the boss, or for our own personal projects, the need must be satisfied. Most of the time, the Framework…
Today I had a very interesting conundrum that had to get solved quickly. Needless to say, it wasn't resolved quickly because when we needed it we were very rushed, but as soon as the conference call was over and I took a step back I saw the correct …
The viewer will learn how to clear a vector as well as how to detect empty vectors in C++.
The viewer will be introduced to the technique of using vectors in C++. The video will cover how to define a vector, store values in the vector and retrieve data from the values stored in the vector.
Suggested Courses

580 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question