Solved

URGENT:  Removing Bad Characters from an XML File

Posted on 2003-12-03
5
4,953 Views
Last Modified: 2012-05-04
I have an xml file that has some characters which are no allowed in XML files (I don't know what these characters are, but its throwing an error because the char code is 0x16 when I try to call XmlDocument.Load.  When I display the character, it is just the standard bad char square.  

My Question is this, what is the best way to go through the file, replace all of the bad chars with white space?  What are invalid characters in an xml file?  
0
Comment
Question by:jjacksn
  • 2
  • 2
5 Comments
 
LVL 22

Expert Comment

by:_TAD_
ID: 9872458


I am pretty sure that XML (by default) only uses the standard ASCII character set.  If you want to use a character that has a decimal value larger than 127 then you have use the hex equivilent   &#A3   (displays "#").


In order to scrub your XML file you'll have to open the file in UTF-8  format and examine each character one by one and determine if the decimal value is greater than 127


exerpt from http://www.biglist.com/lists/xsl-list/archives/200203/msg01152.html

<----->

The POUND SIGN is character number A3 (hex) in Unicode. "U+00A3" is how you
can write it unambiguously in prose.

Encoding provides a way of representing that A3 as bytes.

iso-8859-1:  A3
     utf-8:  C2 A3
    utf-16:  00 A3 (little endian)
             A3 00 (big endian)

utf-8 and utf-16 can represent any Unicode character, but other encodings are
more limited, usually only representing 256 characters max.

If a character cannot be represented in a particular encoding, you write it as
a sequence of characters that can be represented in any encoding (spaces added
for clarity):

   & # x A 3 ;    or    & # 1 6 3 ;

<---- END ---->
0
 
LVL 5

Author Comment

by:jjacksn
ID: 9872578
How would I open/manipulate it in UTF-8 format?  would this be faster than opening the file, getting the stream, examing each char value to see if it is greater than 127, and putting in a whitespace to replace it if it is?  The file is less than 5 megs.  
0
 
LVL 5

Author Comment

by:jjacksn
ID: 9872711
I ended up just doing this.  could this do something bad I don't know about?

StreamReader sr = new StreamReader("c:\\database2.xml");
string s = sr.ReadToEnd();
MessageBox.Show(s.Substring(206600, 500));
int count = 0;
for(int i = 206600; i < s.Length; i++)
{
       int j = (int)s[i];
       //Remove all invalid characters from the asci text.
       if(!((31 < j && j < 127) || j == 9))
       {
       count ++;
      s = s.Replace((char)j, (char)32);
        }
}
MessageBox.Show(count + "");
MessageBox.Show(s.Substring(206300, 500));

The thing was about 500kb and 12 replacements calls were made.

is there any harm in this method?
0
 
LVL 22

Accepted Solution

by:
_TAD_ earned 500 total points
ID: 9874196

What you did is perfectly acceptable.

However... If your XML file contained something like the Uk Dollar symbol (its a funny little 'c' like character with a horozontal line through the middle), it now contains a space instead.  This won't "hurt" your program, but it makes for loss of readability on some of your elements.  I guess it really depends on what those characters were.


If you can run it again, run the process and collect the decimal or hex equivilent of each character in a list so you can see what is being replaced.  If you are replacing tab stops or something trivial, then you don't even need to replace that character with a space.  You can just pull the character.  If on the other hand it is a real character (just not standard ascii) then you may want to replace that character with the &#<hex equivilent>
0
 

Expert Comment

by:Salubritas
ID: 11183858
I just solved this problem by setting the encoding of the xml file to ascii, rather than the default utf-8:

<?xml version="1.0" encoding="ascii"?>
0

Featured Post

6 Surprising Benefits of Threat Intelligence

All sorts of threat intelligence is available on the web. Intelligence you can learn from, and use to anticipate and prepare for future attacks.

Join & Write a Comment

Summary: Persistence is the capability of an application to store the state of objects and recover it when necessary. This article compares the two common types of serialization in aspects of data access, readability, and runtime cost. A ready-to…
We all know that functional code is the leg that any good program stands on when it comes right down to it, however, if your program lacks a good user interface your product may not have the appeal needed to keep your customers happy. This issue can…
It is a freely distributed piece of software for such tasks as photo retouching, image composition and image authoring. It works on many operating systems, in many languages.
This video explains how to create simple products associated to Magento configurable product and offers fast way of their generation with Store Manager for Magento tool.

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now