Solved

URGENT:  Removing Bad Characters from an XML File

Posted on 2003-12-03
5
4,960 Views
Last Modified: 2012-05-04
I have an xml file that has some characters which are no allowed in XML files (I don't know what these characters are, but its throwing an error because the char code is 0x16 when I try to call XmlDocument.Load.  When I display the character, it is just the standard bad char square.  

My Question is this, what is the best way to go through the file, replace all of the bad chars with white space?  What are invalid characters in an xml file?  
0
Comment
Question by:jjacksn
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
  • 2
5 Comments
 
LVL 22

Expert Comment

by:_TAD_
ID: 9872458


I am pretty sure that XML (by default) only uses the standard ASCII character set.  If you want to use a character that has a decimal value larger than 127 then you have use the hex equivilent   &#A3   (displays "#").


In order to scrub your XML file you'll have to open the file in UTF-8  format and examine each character one by one and determine if the decimal value is greater than 127


exerpt from http://www.biglist.com/lists/xsl-list/archives/200203/msg01152.html 

<----->

The POUND SIGN is character number A3 (hex) in Unicode. "U+00A3" is how you
can write it unambiguously in prose.

Encoding provides a way of representing that A3 as bytes.

iso-8859-1:  A3
     utf-8:  C2 A3
    utf-16:  00 A3 (little endian)
             A3 00 (big endian)

utf-8 and utf-16 can represent any Unicode character, but other encodings are
more limited, usually only representing 256 characters max.

If a character cannot be represented in a particular encoding, you write it as
a sequence of characters that can be represented in any encoding (spaces added
for clarity):

   & # x A 3 ;    or    & # 1 6 3 ;

<---- END ---->
0
 
LVL 5

Author Comment

by:jjacksn
ID: 9872578
How would I open/manipulate it in UTF-8 format?  would this be faster than opening the file, getting the stream, examing each char value to see if it is greater than 127, and putting in a whitespace to replace it if it is?  The file is less than 5 megs.  
0
 
LVL 5

Author Comment

by:jjacksn
ID: 9872711
I ended up just doing this.  could this do something bad I don't know about?

StreamReader sr = new StreamReader("c:\\database2.xml");
string s = sr.ReadToEnd();
MessageBox.Show(s.Substring(206600, 500));
int count = 0;
for(int i = 206600; i < s.Length; i++)
{
       int j = (int)s[i];
       //Remove all invalid characters from the asci text.
       if(!((31 < j && j < 127) || j == 9))
       {
       count ++;
      s = s.Replace((char)j, (char)32);
        }
}
MessageBox.Show(count + "");
MessageBox.Show(s.Substring(206300, 500));

The thing was about 500kb and 12 replacements calls were made.

is there any harm in this method?
0
 
LVL 22

Accepted Solution

by:
_TAD_ earned 500 total points
ID: 9874196

What you did is perfectly acceptable.

However... If your XML file contained something like the Uk Dollar symbol (its a funny little 'c' like character with a horozontal line through the middle), it now contains a space instead.  This won't "hurt" your program, but it makes for loss of readability on some of your elements.  I guess it really depends on what those characters were.


If you can run it again, run the process and collect the decimal or hex equivilent of each character in a list so you can see what is being replaced.  If you are replacing tab stops or something trivial, then you don't even need to replace that character with a space.  You can just pull the character.  If on the other hand it is a real character (just not standard ascii) then you may want to replace that character with the &#<hex equivilent>
0
 

Expert Comment

by:Salubritas
ID: 11183858
I just solved this problem by setting the encoding of the xml file to ascii, rather than the default utf-8:

<?xml version="1.0" encoding="ascii"?>
0

Featured Post

Free Tool: Postgres Monitoring System

A PHP and Perl based system to collect and display usage statistics from PostgreSQL databases.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction Although it is an old technology, serial ports are still being used by many hardware manufacturers. If you develop applications in C#, Microsoft .NET framework has SerialPort class to communicate with the serial ports.  I needed to…
Introduction Hi all and welcome to my first article on Experts Exchange. A while ago, someone asked me if i could do some tutorials on object oriented programming. I decided to do them on C#. Now you may ask me, why's that? Well, one of the re…
Are you ready to implement Active Directory best practices without reading 300+ pages? You're in luck. In this webinar hosted by Skyport Systems, you gain insight into Microsoft's latest comprehensive guide, with tips on the best and easiest way…

726 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question