Solved

URGENT:  Removing Bad Characters from an XML File

Posted on 2003-12-03
5
4,959 Views
Last Modified: 2012-05-04
I have an xml file that has some characters which are no allowed in XML files (I don't know what these characters are, but its throwing an error because the char code is 0x16 when I try to call XmlDocument.Load.  When I display the character, it is just the standard bad char square.  

My Question is this, what is the best way to go through the file, replace all of the bad chars with white space?  What are invalid characters in an xml file?  
0
Comment
Question by:jjacksn
  • 2
  • 2
5 Comments
 
LVL 22

Expert Comment

by:_TAD_
ID: 9872458


I am pretty sure that XML (by default) only uses the standard ASCII character set.  If you want to use a character that has a decimal value larger than 127 then you have use the hex equivilent   &#A3   (displays "#").


In order to scrub your XML file you'll have to open the file in UTF-8  format and examine each character one by one and determine if the decimal value is greater than 127


exerpt from http://www.biglist.com/lists/xsl-list/archives/200203/msg01152.html 

<----->

The POUND SIGN is character number A3 (hex) in Unicode. "U+00A3" is how you
can write it unambiguously in prose.

Encoding provides a way of representing that A3 as bytes.

iso-8859-1:  A3
     utf-8:  C2 A3
    utf-16:  00 A3 (little endian)
             A3 00 (big endian)

utf-8 and utf-16 can represent any Unicode character, but other encodings are
more limited, usually only representing 256 characters max.

If a character cannot be represented in a particular encoding, you write it as
a sequence of characters that can be represented in any encoding (spaces added
for clarity):

   & # x A 3 ;    or    & # 1 6 3 ;

<---- END ---->
0
 
LVL 5

Author Comment

by:jjacksn
ID: 9872578
How would I open/manipulate it in UTF-8 format?  would this be faster than opening the file, getting the stream, examing each char value to see if it is greater than 127, and putting in a whitespace to replace it if it is?  The file is less than 5 megs.  
0
 
LVL 5

Author Comment

by:jjacksn
ID: 9872711
I ended up just doing this.  could this do something bad I don't know about?

StreamReader sr = new StreamReader("c:\\database2.xml");
string s = sr.ReadToEnd();
MessageBox.Show(s.Substring(206600, 500));
int count = 0;
for(int i = 206600; i < s.Length; i++)
{
       int j = (int)s[i];
       //Remove all invalid characters from the asci text.
       if(!((31 < j && j < 127) || j == 9))
       {
       count ++;
      s = s.Replace((char)j, (char)32);
        }
}
MessageBox.Show(count + "");
MessageBox.Show(s.Substring(206300, 500));

The thing was about 500kb and 12 replacements calls were made.

is there any harm in this method?
0
 
LVL 22

Accepted Solution

by:
_TAD_ earned 500 total points
ID: 9874196

What you did is perfectly acceptable.

However... If your XML file contained something like the Uk Dollar symbol (its a funny little 'c' like character with a horozontal line through the middle), it now contains a space instead.  This won't "hurt" your program, but it makes for loss of readability on some of your elements.  I guess it really depends on what those characters were.


If you can run it again, run the process and collect the decimal or hex equivilent of each character in a list so you can see what is being replaced.  If you are replacing tab stops or something trivial, then you don't even need to replace that character with a space.  You can just pull the character.  If on the other hand it is a real character (just not standard ascii) then you may want to replace that character with the &#<hex equivilent>
0
 

Expert Comment

by:Salubritas
ID: 11183858
I just solved this problem by setting the encoding of the xml file to ascii, rather than the default utf-8:

<?xml version="1.0" encoding="ascii"?>
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Article by: Ivo
C# And Nullable Types Since 2.0 C# has Nullable(T) Generic Structure. The idea behind is to allow value type objects to have null values just like reference types have. This concerns scenarios where not all data sources have values (like a databa…
Calculating holidays and working days is a function that is often needed yet it is not one found within the Framework. This article presents one approach to building a working-day calculator for use in .NET.
A short tutorial showing how to set up an email signature in Outlook on the Web (previously known as OWA). For free email signatures designs, visit https://www.mail-signatures.com/articles/signature-templates/?sts=6651 If you want to manage em…

860 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question