• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 5095
  • Last Modified:

URGENT: Removing Bad Characters from an XML File

I have an xml file that has some characters which are no allowed in XML files (I don't know what these characters are, but its throwing an error because the char code is 0x16 when I try to call XmlDocument.Load.  When I display the character, it is just the standard bad char square.  

My Question is this, what is the best way to go through the file, replace all of the bad chars with white space?  What are invalid characters in an xml file?  
0
jjacksn
Asked:
jjacksn
  • 2
  • 2
1 Solution
 
_TAD_Commented:


I am pretty sure that XML (by default) only uses the standard ASCII character set.  If you want to use a character that has a decimal value larger than 127 then you have use the hex equivilent   &#A3   (displays "#").


In order to scrub your XML file you'll have to open the file in UTF-8  format and examine each character one by one and determine if the decimal value is greater than 127


exerpt from http://www.biglist.com/lists/xsl-list/archives/200203/msg01152.html 

<----->

The POUND SIGN is character number A3 (hex) in Unicode. "U+00A3" is how you
can write it unambiguously in prose.

Encoding provides a way of representing that A3 as bytes.

iso-8859-1:  A3
     utf-8:  C2 A3
    utf-16:  00 A3 (little endian)
             A3 00 (big endian)

utf-8 and utf-16 can represent any Unicode character, but other encodings are
more limited, usually only representing 256 characters max.

If a character cannot be represented in a particular encoding, you write it as
a sequence of characters that can be represented in any encoding (spaces added
for clarity):

   & # x A 3 ;    or    & # 1 6 3 ;

<---- END ---->
0
 
jjacksnAuthor Commented:
How would I open/manipulate it in UTF-8 format?  would this be faster than opening the file, getting the stream, examing each char value to see if it is greater than 127, and putting in a whitespace to replace it if it is?  The file is less than 5 megs.  
0
 
jjacksnAuthor Commented:
I ended up just doing this.  could this do something bad I don't know about?

StreamReader sr = new StreamReader("c:\\database2.xml");
string s = sr.ReadToEnd();
MessageBox.Show(s.Substring(206600, 500));
int count = 0;
for(int i = 206600; i < s.Length; i++)
{
       int j = (int)s[i];
       //Remove all invalid characters from the asci text.
       if(!((31 < j && j < 127) || j == 9))
       {
       count ++;
      s = s.Replace((char)j, (char)32);
        }
}
MessageBox.Show(count + "");
MessageBox.Show(s.Substring(206300, 500));

The thing was about 500kb and 12 replacements calls were made.

is there any harm in this method?
0
 
_TAD_Commented:

What you did is perfectly acceptable.

However... If your XML file contained something like the Uk Dollar symbol (its a funny little 'c' like character with a horozontal line through the middle), it now contains a space instead.  This won't "hurt" your program, but it makes for loss of readability on some of your elements.  I guess it really depends on what those characters were.


If you can run it again, run the process and collect the decimal or hex equivilent of each character in a list so you can see what is being replaced.  If you are replacing tab stops or something trivial, then you don't even need to replace that character with a space.  You can just pull the character.  If on the other hand it is a real character (just not standard ascii) then you may want to replace that character with the &#<hex equivilent>
0
 
SalubritasCommented:
I just solved this problem by setting the encoding of the xml file to ascii, rather than the default utf-8:

<?xml version="1.0" encoding="ascii"?>
0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

  • 2
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now