Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
?
Solved

Working with Unicode Characters in XML

Posted on 2012-09-11
5
Medium Priority
?
614 Views
Last Modified: 2012-10-29
I am working with a program called EnCase 7 that has a scripting engine.  The scripting engine is a barebones implementation of C++ - from what I can tell.  I am trying to write out an XML file from EnCase to load into C#.  This XML file has data in EnCase that I want to move into .Net so I can do additional processing.  Also, EnCase does not have a complement that creates XML files, but a standard file writer that I am using to create XML files.  All the text files I am working with is UTF-16.
Most of the XML files that I create load into .Net with no issues.  However, I am having issues with characters that .Net does not like, for example &, <, > and Unicode character \x01.  This morning I found two more þ and   .  Below is a function I created to replace these characters with their correct HTML replacement.  

  String uft16CleanUp(String cleanString)
  {
    cleanString.Replace("\x01", " ", 0, -1);
    cleanString.Replace("&nbsp;", " ", 0, -1);
    cleanString.Replace("<", "&lt;", 0, -1);
    cleanString.Replace(">", "&gt;", 0, -1);
    cleanString.Replace("&", "&amp;", 0, -1);
    return cleanString;
  }

Does anyone have a better suggestion to go about this?  I am currently “fishing” for characters and taking up a lot of time.  I want to know if there is a predefined list of characters that XML needs converted or if there is a Unicode range that I should automatically convert.
Any help would be greatly appreciated.
0
Comment
Question by:rye004
  • 3
  • 2
5 Comments
 
LVL 30

Expert Comment

by:anarki_jimbel
ID: 38388812
There are built-in methods that help you to escape characters.  I don't want to repeat the link, please read:
http://weblogs.sqlteam.com/mladenp/archive/2008/10/21/Different-ways-how-to-escape-an-XML-string-in-C.aspx

Hope it helps.
0
 

Author Comment

by:rye004
ID: 38388886
Thank you for sending this to me.  Unfortunately I need to do this in C++ and not C#.  The scripting engine that I am using in EnCase uses a limited implementation of C++, therefore it does not have add the additional features that .Net has.
0
 
LVL 30

Expert Comment

by:anarki_jimbel
ID: 38389001
Hmm... OK.

Basically, there are only five characters to escape:

"   &quot;
'   &apos;
<   &lt;
>   &gt;
&   &amp;

Your code is missing first two.

However, this list does not take any Unicode characters into account...
0
 

Author Comment

by:rye004
ID: 38392836
My problem with doing that is I keep finding additional characters (like þ) that needed to be encoded.
I am curious, as a test I wanted to see if I can make some type of foreach loop through all possible Unicode character and see if each one needs to be encoded.  If it needs to be encoded I can make a list of characters that I will need to make the change for.
Do you know how I would make a foreach look through all possible Uncode characters in C#?
Hopefully this makes sense.
0
 
LVL 30

Accepted Solution

by:
anarki_jimbel earned 1500 total points
ID: 38393326
I will use a solution from http://stackoverflow.com/questions/1668571/how-to-generate-all-the-characters-in-the-utf-8-charset-in-net .
I have modified it a bit to use a file from a hard disc (see attached). The file is loaded from http://unicode.org/Public/UNIDATA/UnicodeData.txt.

Below is just a button click handler for a simple form. The application prints to an output window.

code point = character code

        private void button1_Click(object sender, EventArgs e)
        {
            System.IO.StreamReader sr = new System.IO.StreamReader( @"C:\Test\UnicodeData.txt" );
            string definedCodePoints = sr.ReadToEnd();
            System.IO.StringReader reader = new System.IO.StringReader(definedCodePoints);
            System.Text.UTF8Encoding encoder = new System.Text.UTF8Encoding();
            while (true)
            {
                string line = reader.ReadLine();
                if (line == null) break;
                int codePoint = Convert.ToInt32(line.Substring(0, line.IndexOf(";")), 16);
                if (codePoint >= 0xD800 && codePoint <= 0xDFFF)
                {
                    //surrogate boundary; not valid codePoint, but listed in the document
                }
                else
                {
                    string utf16 = char.ConvertFromUtf32(codePoint);
                    byte[] utf8 = encoder.GetBytes(utf16);
                    //TODO: something with the UTF-8-encoded character

                    System.Diagnostics.Debug.WriteLine("'" + utf16 + "'");
                } 
            }
            System.Diagnostics.Debug.WriteLine("Finished");
        }

Open in new window


or print just codes in hex form

                else
                {
                   System.Diagnostics.Debug.WriteLine("'" + string.Format("{0:x4}",codePoint) + "'");
                }

Open in new window

UnicodeData.txt
0

Featured Post

Free recovery tool for Microsoft Active Directory

Veeam Explorer for Microsoft Active Directory provides fast and reliable object-level recovery for Active Directory from a single-pass, agentless backup or storage snapshot — without the need to restore an entire virtual machine or use third-party tools.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

More often than not, we developers are confronted with a need: a need to make some kind of magic happen via code. Whether it is for a client, for the boss, or for our own personal projects, the need must be satisfied. Most of the time, the Framework…
Today I had a very interesting conundrum that had to get solved quickly. Needless to say, it wasn't resolved quickly because when we needed it we were very rushed, but as soon as the conference call was over and I took a step back I saw the correct …
The viewer will learn how to clear a vector as well as how to detect empty vectors in C++.
The viewer will be introduced to the technique of using vectors in C++. The video will cover how to define a vector, store values in the vector and retrieve data from the values stored in the vector.
Suggested Courses

580 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question