Want to win a PS4? Go Premium and enter to win our High-Tech Treats giveaway. Enter to Win

x
?
Solved

URGENT:  Removing Bad Characters from an XML File

Posted on 2003-12-03
5
Medium Priority
?
5,025 Views
Last Modified: 2012-05-04
I have an xml file that has some characters which are no allowed in XML files (I don't know what these characters are, but its throwing an error because the char code is 0x16 when I try to call XmlDocument.Load.  When I display the character, it is just the standard bad char square.  

My Question is this, what is the best way to go through the file, replace all of the bad chars with white space?  What are invalid characters in an xml file?  
0
Comment
Question by:jjacksn
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
  • 2
5 Comments
 
LVL 22

Expert Comment

by:_TAD_
ID: 9872458


I am pretty sure that XML (by default) only uses the standard ASCII character set.  If you want to use a character that has a decimal value larger than 127 then you have use the hex equivilent   &#A3   (displays "#").


In order to scrub your XML file you'll have to open the file in UTF-8  format and examine each character one by one and determine if the decimal value is greater than 127


exerpt from http://www.biglist.com/lists/xsl-list/archives/200203/msg01152.html 

<----->

The POUND SIGN is character number A3 (hex) in Unicode. "U+00A3" is how you
can write it unambiguously in prose.

Encoding provides a way of representing that A3 as bytes.

iso-8859-1:  A3
     utf-8:  C2 A3
    utf-16:  00 A3 (little endian)
             A3 00 (big endian)

utf-8 and utf-16 can represent any Unicode character, but other encodings are
more limited, usually only representing 256 characters max.

If a character cannot be represented in a particular encoding, you write it as
a sequence of characters that can be represented in any encoding (spaces added
for clarity):

   & # x A 3 ;    or    & # 1 6 3 ;

<---- END ---->
0
 
LVL 5

Author Comment

by:jjacksn
ID: 9872578
How would I open/manipulate it in UTF-8 format?  would this be faster than opening the file, getting the stream, examing each char value to see if it is greater than 127, and putting in a whitespace to replace it if it is?  The file is less than 5 megs.  
0
 
LVL 5

Author Comment

by:jjacksn
ID: 9872711
I ended up just doing this.  could this do something bad I don't know about?

StreamReader sr = new StreamReader("c:\\database2.xml");
string s = sr.ReadToEnd();
MessageBox.Show(s.Substring(206600, 500));
int count = 0;
for(int i = 206600; i < s.Length; i++)
{
       int j = (int)s[i];
       //Remove all invalid characters from the asci text.
       if(!((31 < j && j < 127) || j == 9))
       {
       count ++;
      s = s.Replace((char)j, (char)32);
        }
}
MessageBox.Show(count + "");
MessageBox.Show(s.Substring(206300, 500));

The thing was about 500kb and 12 replacements calls were made.

is there any harm in this method?
0
 
LVL 22

Accepted Solution

by:
_TAD_ earned 2000 total points
ID: 9874196

What you did is perfectly acceptable.

However... If your XML file contained something like the Uk Dollar symbol (its a funny little 'c' like character with a horozontal line through the middle), it now contains a space instead.  This won't "hurt" your program, but it makes for loss of readability on some of your elements.  I guess it really depends on what those characters were.


If you can run it again, run the process and collect the decimal or hex equivilent of each character in a list so you can see what is being replaced.  If you are replacing tab stops or something trivial, then you don't even need to replace that character with a space.  You can just pull the character.  If on the other hand it is a real character (just not standard ascii) then you may want to replace that character with the &#<hex equivilent>
0
 

Expert Comment

by:Salubritas
ID: 11183858
I just solved this problem by setting the encoding of the xml file to ascii, rather than the default utf-8:

<?xml version="1.0" encoding="ascii"?>
0

Featured Post

VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In order to hide the "ugly" records selectors (triangles) in the rowheaders, here are some suggestions. Microsoft doesn't have a direct method/property to do it. You can only hide the rowheader column. First solution, the easy way The first sol…
We all know that functional code is the leg that any good program stands on when it comes right down to it, however, if your program lacks a good user interface your product may not have the appeal needed to keep your customers happy. This issue can…
This tutorial will teach you the special effect of super speed similar to the fictional character Wally West aka "The Flash" After Shake : http://www.videocopilot.net/presets/after_shake/ All lightning effects with instructions : http://www.mediaf…
How to fix incompatible JVM issue while installing Eclipse While installing Eclipse in windows, got one error like above and unable to proceed with the installation. This video describes how to successfully install Eclipse. How to solve incompa…
Suggested Courses

636 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question