Solved

character conversion error

Posted on 2004-04-08
12
8,391 Views
Last Modified: 2013-12-03
hey all. i put this question here for lack of a better place. i'm having some trouble with some xml character conversion.  i'm getting an error that says:

Caused by: org.xml.sax.SAXParseException: Character conversion error: "Unconvertible UTF-8 character beginning with 0xa0" (line numb
er may be too low).

my question is, does anyone know a good website or something where i can find out what character "0xa0" is?  i don't know how to do a search and replace for this in an ascii program like notepad. i have access to an hp unix server if there is some handy utility in there i can use. please advise. thx.
0
Comment
Question by:benpung
  • 6
  • 5
12 Comments
 
LVL 26

Expert Comment

by:rdcpro
ID: 10788138
You have an encoding problem, and a search and replace solution won't fix it.  This is probably the second octet of the UTF-16 non-breaking space character, which is 0x00A0.  A0 is not an allowed UTF-8 character, hence your encoding is not UTF-8.  I'm betting you're parsing a string, which will be UTF-16.

Regards,
Mike Sharp
0
 
LVL 1

Author Comment

by:benpung
ID: 10790419
i have gotten a similiar error before.  users enter special characters into reports (which become xml files) b/c they copy and paste text instead of typing it.  for example, yesterday i had to remove some of the § symbols (Alt + 0167). my problem is that i don't know how to find the character 0xa0. i'm not sure what character that is.  i think it's a bullet point or something, but i'm not sure.  i need a way to find and remove that character from the xml file prior to parsing it.  this is only a test run so if i could just find and delete the character manually that would suffice.
0
 
LVL 26

Expert Comment

by:rdcpro
ID: 10792098
You won't find the character, because it doesn't exist in UTF-8.  Your statement that you had to remove some § symbols confirms to me that you have an encoding problem.  Your actual encoding is UTF-16.  The 0xa0 is the second octet of the common non-breaking space character, which would be   in UTF-8, if the encoding were correct.  You can't fix this by removing offending characters.   You have to find the source of the encoding issue.  Trust me, I've felt the pain many times, and I can smell an encoding problem a mile off.  I just yesterday fixed a similar problem for a client that had been bugging them for some time.  

Open the XML in IE6, and you'll see where the character is, but UTF-8 is capable of representing *any* character in existance, and removing special characters is never necessary. Your problem likely stems from the fact that the XML is, at some point, existing as a string.

Are you using the transformNode() method anywhere??  How is the XML being generated?  

Regards,
Mike Sharp

0
 
LVL 26

Expert Comment

by:rdcpro
ID: 10794126
Sorry, the 0A is actually the second octet of the UTF-16 linefeed character in the CR-LF sequence, which is x000D x000A.  The non-breaking space character is x0020

Regards,
Mike Sharp
0
 
LVL 1

Author Comment

by:benpung
ID: 10794549
the original xml is being generated by a system. the xml is actually a web report.  i am then taking that report and parsing it into a different xml file so i can insert the data into an oracle table.  i'm doing the parsing on a unix server with a java program.  if you want to see the program or xsl i'm using let me know and i can post a copy next week.  usually when this happens it's because a user enters an illegal character into the front end, which then shows up in the xml report, which i then try to parse, and then i get this error.  usually i can just find the illegal character in the front end and get rid of it.  if what you say is correct and i can't find/eliminate this character in the front end, how do i go about finding and eliminating the problem? i'm relatively new to xml so maybe i'm not being much help explaining what i'm doing. if you have any specific questions please ask.

ben
0
 
LVL 26

Expert Comment

by:rdcpro
ID: 10795068
In terms of XML, an illegal character is one that cannot be represented in the current encoding.  0x0A is an example of this for UTF-8.  There is no way to enter this character (aside from programmatically, of course) through a UI.  Probably what's happening is your parser doesn't correctly interpret your encoding, and assumes UTF-8 (which is the default).  Most parsers use either the Byte Order Mark or else they look at the character unit boundaries to determine the encoding.  If it incorrectly assumes UTF-8, when it runs across a character that is outside of the standard 255 ASCII characters, it will probably throw the exception.  

Is this XML available as a file?  If it's not too big (ie: <150k or so) you could email it to me, and I'll see if I can determine the encoding.  

The way to find and eliminate the problem is to find and eliminate the point at which the encoding is going wrong.  This is not always an easy task.  

If you open the file in Notepad or something, you should be able to find the offending character.  If the XML is not in file form, then look for a carriage return/linefeed in the UI, as I think that's what's getting scrambled.  But remember, this isn't a fix...it just removes the character.  

Encoding schemes use multiple bytes to represent characters above 255.  UTF-8 uses from 1 to 4 bytes, depending on the character.  You read the first byte to determine how many following bytes there will be.  The first byte also identifies a range of code points.  UTF-16 originally used 2 bytes for each character, but that turned out to be insufficient.  Unicode 3.0 has defined a surrogate byte system for UTF-16 as well, that works like UTF-8.  The only encoding scheme that does not use this variable byte length system is UTF-32 which is essentally the same as UCS-4.  In UTF-32, all code points are represented by 4 bytes.  

If you're reading the source XML into a string variable, then parsing it using your java parser, Then I'm guessing that's where the problem is occurring.  The XML declaration in the XML is saying UTF-8, but the conversion to a string changed it to UCS-2, which is essentially UTF-16.  The 0x000A is a legal byte in UTF-16, but not in UTF-8.  

If this is the case, you must either specify the encoding on your parser to be UTF-16 (if that's possible--don't know what you're using) or you must avoid the use of the string in the middle of the process, and read the XML stream directly.  If the XML is copied to your local filesystem, then it could be changing the encoding there as well.

There may be "business rules" that say a user shouldn't use a particular character (meaning character as an end user would think of it, but really they're thinking of a Glyph), but as long as the encoding is right, it shouldn't matter to the XML (except for low-ascii and markup characters like <,>, ', " and &).

Mark Davis, the President of the Unicode Consortium, has some excellent stuff on Unicode. Here's an article you should read:

http://www-106.ibm.com/developerworks/library/utfencodingforms/index.html

and he's got some great tools for playing with Character Codes at:
http://www.macchiato.com/unicode/convert.html
and
http://www.macchiato.com/unicode/charts.html

And finally, this site:
http://skew.org/xml/tutorial/

will tell you more about encoding than you EVER wanted to know!

Regards,
Mike Sharp

0
Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

 
LVL 1

Author Comment

by:benpung
ID: 10817208
that is some very good information, thanks.  i haven't had time to go through all the links yet, but i wanted to reply.  the xml is available as a file, but it's BIG (~3MB) so i can't post or email.  

you said "If you open the file in Notepad or something, you should be able to find the offending character. "

when i open a 3MB file, it's huge.  it would take me forever to scroll through it looking for the character.  is there a way i can use find/replace to find this thing?

as far as my parser goes, i'm using the apache.xalan parser that comes with java1.4.  i just have a simple java program that applies my xsl file to my xml file and produces an output file. i will continue to play with this and see if i can get anywhere with it.  if you have any suggestions on a good way to find the character (b/c i think that's where i need to start in order to solve the problem) i would appreciate it. thanks.

ben
0
 
LVL 32

Expert Comment

by:shalomc
ID: 10857203
ben,
Could you at least post here the first 10 lines?

0
 
LVL 1

Author Comment

by:benpung
ID: 10858396
sure, i can do that. here are the first 10 lines:

<xml xmlns:s="uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns:rs="urn:schemas-microsoft-com:rowset" xmlns:z="#RowsetSchema" ERRFLAG="I" RPTFILE="">
      <s:Schema id="RowsetSchema">
            <s:ElementType name="row" content="eltOnly" rs:updatable="true">
                  <s:AttributeType name="BeginDate" rs:number="1" rs:nullable="true" rs:write="true">
                        <s:datatype dt:type="dateTime" rs:dbtype="variantdate" dt:maxLength="16" rs:precision="0" rs:fixedlength="true" rs:maybenull="false"/>
                  </s:AttributeType>
                  <s:AttributeType name="EndDate" rs:number="2" rs:nullable="true" rs:write="true">
                        <s:datatype dt:type="dateTime" rs:dbtype="variantdate" dt:maxLength="16" rs:precision="0" rs:fixedlength="true" rs:maybenull="false"/>
                  </s:AttributeType>
0
 
LVL 26

Accepted Solution

by:
rdcpro earned 50 total points
ID: 10859914
What would be more  useful would be to see the code that you use to handle this file.  

The above is simply part of the inline schema, but it does tell me that your encoding problem is likely occurring when you handle this XML.

Regards,
Mike Sharp
0
 
LVL 1

Author Comment

by:benpung
ID: 10863654
here is the xsl file to parse the code:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rs='urn:schemas-microsoft-com:rowset' xmlns:z='#RowsetSchema'
        version="1.0"><xsl:output method="xml" indent="no"/>

<xsl:template match="rs:data">
<xml>
<xsl:text>&#10;</xsl:text>
<ROWSET>Table_Name="my_table"
Data_Source="my_data_source"
Refresh_Set="my set"
<xsl:apply-templates />
</ROWSET>
<xsl:text>&#10;</xsl:text>
</xml>
<xsl:text>&#10;</xsl:text>
</xsl:template>
<xsl:template match="z:row">
<ROW><PROGRAM_AREA> <xsl:value-of select="@c2"/> </PROGRAM_AREA><SUBPROGRAM_AREA> <xsl:value-of select="@c3"/> </SUBPROGRAM_AREA><REFERENCE_MASTER> <xsl:value-of select="@c4"/> </REFERENCE_MASTER><SITE_MASTER> <xsl:value-of select="@c5"/> </SITE_MASTER><SITE_MSTR_BRIEF_NM> <xsl:value-of select="@c6"/> </SITE_MSTR_BRIEF_NM><ACTIVITY_CATEGORY><xsl:value-of select="@c7"/></ACTIVITY_CATEGORY><EQUIPMENT_CATEGORY><xsl:value-of select="@c8"/></EQUIPMENT_CATEGORY><REF_MASTER_REQUIREMENT_BRIEF><xsl:value-of select="@c9"/></REF_MASTER_REQUIREMENT_BRIEF><APPLIED_REQUIREMENT> <xsl:value-of select="@c10"/> </APPLIED_REQUIREMENT><SITE_NAME>  <xsl:value-of select="@c11"/></SITE_NAME> </ROW>
<xsl:text>&#10;</xsl:text>
</xsl:template>
</xsl:stylesheet>

and here is the java code that applies the above xsl file to the xml report and produces a new file:

import java.io.*;
import java.lang.*;
import javax.xml.transform.*;
import javax.xml.transform.stream.*;

public class TransformXml {

    /**
     * Accept three command line arguments: the name of an XML
     * file, the name of an XSLT stylesheet, and the output file.
     * The result of the transformation is written to the output file.
     */

    public static void main (String[] args)
         throws javax.xml.transform.TransformerException {
     if (args.length != 3) {
       System.err.println("Usage:");
       System.err.println("  java " + TransformXml.class.getName(  )
                 + " xmlFileName xsltFileName outputFileName");
       System.exit(1);
     }

     File xmlFile = new File(args[0]);
     File xsltFile = new File(args[1]);
     File outputFile = new File (args[2]);

     javax.xml.transform.Source xmlSource =
       new javax.xml.transform.stream.StreamSource(xmlFile);
     javax.xml.transform.Source xsltSource =
       new javax.xml.transform.stream.StreamSource(xsltFile);
     javax.xml.transform.Result result =
       new javax.xml.transform.stream.StreamResult(outputFile);

     // create an instance of TransformerFactory
     javax.xml.transform.TransformerFactory transFact =
       javax.xml.transform.TransformerFactory.newInstance();

     javax.xml.transform.Transformer trans =
        transFact.newTransformer(xsltSource);

     trans.transform(xmlSource, result);
    xmlFile = null;
    xsltFile = null;
    outputFile = null;
    trans = null;
    System.gc();
 }
}

if you need anything else let me know.
0
 
LVL 1

Author Comment

by:benpung
ID: 11112286
i just did a manual work around to find the values in the front end and warn the users not to be entering things by copy and pasting, or at least if they do, to verify that they are not getting characters that they don't know how to enter from the keyboard (b/c they don't use the ALT + XXXX to get these characters).  you did well worth 50 pts of effort trying to help so points to you. thx.
0

Featured Post

Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Suggested Solutions

The Problem How to write an Xquery that works like a SQL outer join, providing placeholders for absent data on the outer side?  I give a bit more background at the end. The situation expressed as relational data Let’s work through this.  I’ve …
The Client Need Led Us to RSS I recently had an investment company ask me how they might notify their constituents about their newsworthy publications.  Probably you would think "Facebook" or "Twitter" but this is an interesting client.  Their cons…
Sending a Secure fax is easy with eFax Corporate (http://www.enterprise.efax.com). First, Just open a new email message.  In the To field, type your recipient's fax number @efaxsend.com. You can even send a secure international fax — just include t…
Excel styles will make formatting consistent and let you apply and change formatting faster. In this tutorial, you'll learn how to use Excel's built-in styles, how to modify styles, and how to create your own. You'll also learn how to use your custo…

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now