Solved

JDOM, UTF-8 and the Pound Sterling character

Posted on 2002-07-02
23
812 Views
Last Modified: 2012-05-04
Can anyone give any advice on how to pass the pound sterling ( £ ) sign in to JDOM? I receive a message stream from the microsoft environment ( it's an ASP page ) which looks like a valid XML document but it always fails as JDOM tries to build up its document.

The input message has a valid xml header which says that it's in UTF-8 format, but when JDOM tries to parse the message it fials with the following message

org.jdom.JDOMException: Error in building: The data "The Deal Consideration of ?
.00 is greater than currently permitted in Rapier for this Channel.  Rule: 17490
" is not legal for a JDOM attribute: 0xA3 is not a legal XML character.
        at org.jdom.input.SAXBuilder.build(SAXBuilder.java:373)
        at org.jdom.input.SAXBuilder.build(SAXBuilder.java:673)
        at SPAR.RapierRequest.doRequest(RapierRequest.java:107)
        at SPAR.RapierDeal.DoDeal(RapierDeal.java:101)
        at SPAR.RapierDeal.<init>(RapierDeal.java:35)
        at SPAR.DealExecute.CallRapier(DealExecute.java:186)
        at SPAR.DealExecute.<init>(DealExecute.java:53)
        at SPAR.DealRequestInterface.<init>(DealRequestInterface.java:51)
        at SPAR.OrchidRouter.doPost(OrchidRouter.java:170)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:760)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
        at org.apache.tomcat.core.ServletWrapper.handleRequest(ServletWrapper.ja
va:503)
        at org.apache.tomcat.core.ContextManager.service(ContextManager.java:559
)
        at org.apache.tomcat.service.http.HttpConnectionHandler.processConnectio
n(HttpConnectionHandler.java:160)


If we change the encoding of the message we get the same error but a different Hex value being reported.

What's mopre confusing is that if I post information in to JDOM from a browser ( IE 5 ) it quite happily deals with the £ sign.

I'm _very_ confused by this......

Dave
0
Comment
Question by:howesd
  • 8
  • 7
  • 6
  • +1
23 Comments
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
Try substituting the &pound; entity for the pound sign. Soon ASCII entities won't be valid anyway.
0
 
LVL 16

Expert Comment

by:heyhey_
Comment Utility
IMHO "0xA3" is not a valid UTF8 symbol.

what code do you use for parsing ?
0
 
LVL 35

Expert Comment

by:girionis
Comment Utility
 You could also try to substitute the '£' with the entity:

 &#163; which is the decimal representation of the '£'.

  Hope it helps.
0
 
LVL 1

Author Comment

by:howesd
Comment Utility
HeyHey

I could be wrong about the 0xA3 ..... I was doing it from memory - but basically you're just agreeing with my parser - it thinks 0xA3 isn't legal as well!

My code is a follows ( snippets )

       Document docResponse = null;
       
    // Build the rapier address details.
       String rapier_addr = SCDets.getString("rapier_xml_ip_address");
       String rapier_port = SCDets.getString("rapier_xml_host_port");
       String rapier_page = "/" + SCDets.getString("rapier_xml_default_page");
         
       int int_rapier_port = new Integer(rapier_port.trim()).intValue();

       URL u = new URL("http",rapier_addr.trim(),int_rapier_port,rapier_page);
       
    // Build the http connection and set some properties on it.
       URLConnection uc = u.openConnection();
       HttpURLConnection http = (HttpURLConnection) uc;

       http.setDoOutput(true);
       http.setDoInput(true);
       http.setRequestMethod("POST");
       String authString = SCDets.getString("rapier_xml_user_password");
       String auth = "Basic " + new sun.misc.BASE64Encoder().encode(authString.getBytes());
       logger.debug("Auth String = " + auth);
       logger.debug(RapierLogger.toString());
       
       http.setRequestProperty("Proxy-Authorization", auth);

       OutputStream out = http.getOutputStream();
       OutputStreamWriter wout = new OutputStreamWriter(out);

        XMLOutputter xOut = new XMLOutputter();
       
    // Rapier doesn't work if you send it encoding and xml declaration stuff.
        xOut.setOmitEncoding(true);
        xOut.setOmitDeclaration(true);

       xOut.output(RapierMessage,wout);

       wout.flush();
       wout.close();

        RapierLogger.info(xOut.outputString(RapierMessage));
       
       SAXBuilder builder = new SAXBuilder();
       docResponse = builder.build(http.getInputStream());

       RapierLogger.info(new XMLOutputter().outputString(docResponse));

/*
 * Close the http resources as soon as possible.
 */
 
       http.disconnect();

....

My code sends its message out OK ( which doesn't include any £ signs ) but fails on the docResponse = builder.build(http.getInputStream()) statement. From other places on the net I've seen discussions about the £ sign saying "The UTF8 representation of the character 163 ( the pound sterling symbol ) is a sequence of 2 bytes 0xC2 and 0xA3". It looks to me like the sending application is not encoding the stream successfully but the first line of the message I receive ( <?xml version="1.0" encoding="UT
F-8"?>^M ) indicates that it is UTF-8 encoded.

It beats me .....





0
 
LVL 1

Author Comment

by:howesd
Comment Utility
girionis

I don't really have any control over the message stream which I reveive. It comes from a different application which the developers are a little wary of altering.

Dave
0
 
LVL 1

Author Comment

by:howesd
Comment Utility
girionis

I don't really have any control over the message stream which I reveive. It comes from a different application which the developers are a little wary of altering.

Dave
0
 
LVL 35

Expert Comment

by:girionis
Comment Utility
 Could you not search and replace the pound symbol with the decimal entity when you receive the stream of data (and before you do any XML processing)?
0
 
LVL 16

Expert Comment

by:heyhey_
Comment Utility
to summarize:

- you have a stream with unknown encoding (that claims to be UTF8 encoded);
- your parser has problems with that stream, because of the encoding;

the only possible solution is to "fix" the stream, i.e.
1. fix the .asp that generates it (not possible ?)
2. "guess" the imput stream encoding and covert it to UTF8
0
 
LVL 35

Expert Comment

by:girionis
Comment Utility
 Could you not search and replace the pound symbol with the decimal entity when you receive the stream of data (and before you do any XML processing)?
0
 
LVL 35

Expert Comment

by:girionis
Comment Utility
Sorry for the double post. I just hit refresh accidentally.
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
For that matter, are you sure the Rapier page is valid XML?
0
Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

 
LVL 1

Author Comment

by:howesd
Comment Utility
I'm pretty certain that the Rapier page is valid XML - there are lots of other applications in the organisation which can process it's output. However, I think these are all using the Microsoft parser to read the messages so I'm beginning to think that it's  an example of where MS haven't stuck to the standards but their tools all work as they all fail to conform in the same way.

The developers of the other side of the interface have changed their code to remove the £ sign and I am now able to process their messages happily, but I'm fairly certain that there will be other characters that we haven't yet come across which will cause the interface to blow up.

I'm going to leave this question open for a couple of days if you don't mind to see if there's any other input.

Dave
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
>>I'm beginning to think that it's  an example of where MS haven't stuck to the standards but their tools all work as they all fail to conform in the same way.

I'm certainly no apologist for M$, but of course, the reasons there are no problems otherwise could be merely to do with intelligent, defensive coding in their parsers [ugh it's painful to say that :-)].

>>changed their code to remove the £ sign

What have they put in instead?

>>but I'm fairly certain that there will be other characters that we haven't yet come across which will cause the interface to blow up

Anything > 0x7E should do it!

Somebody's asked me to post something I wrote to 'un-pretty-print' html, which I can adapt slightly so that *you* can code defensively against bad input if you want ...
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
>>IMHO "0xA3" is not a valid UTF8 symbol

Quite right heyhey.
0
 
LVL 35

Expert Comment

by:girionis
Comment Utility
 I had the same problem as well. This is due to M$ not conforming to the standards and screwing everything up. The thing is that I could read the stream of data but I could not display the characters properly on the screen and I wrote a little method that replaces all the M$ special characters with their corresponding decimal entity:

char [] characters = streamOfData.toCharArray();
int arrayLength = characters.length;
StringBuffer chars = new StringBuffer();
for (int i=0; i<arrayLength; i++)
{
    if ( (128 <= (int) characters[i] && (int) characters[i] <= 160) || ((int) characters[i]) >= 256 )
       chars.append("&#" + (int) characters[i] + ";");
    else
        chars.append(characters[i]);
}

  Maybe it could help you as well.
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
>>This is due to M$ not conforming to the standards and screwing everything up

But in howesd's case, I would guess that the problem is caused by improper coding of the source document. There shouldn't be any raw pound sterling characters appearing in the source. Wouldn't be a problem for me - I haven't got one on this keyboard, hence the words :-)
0
 
LVL 35

Expert Comment

by:girionis
Comment Utility
 In the past I never had problems with the pound symbol doing XML processing (either DOM or SAX) and it was there, the raw pound symbol insted of the &pound; entity or the &#163; one. But you can naver be sure... Even a different locale on the computer can have disastrous results.

  The only thing left is to try using the pound entity and see what happens.
0
 
LVL 1

Author Comment

by:howesd
Comment Utility
I did some experimenting on this, writing a java application which would replace the "sending" application that's giving me the problem. What I found was that if I created a JDOM Document element, told it to set the encoding to UTF-8 and put a £ in the document, I would still get the problem of not neing able to parse the document in my receiving application. However, when I told the output stream writer that I'm sending the document through that it had to use UTF-8 encoding, it all started working correctly.

This is leading me to believe that the ASP is creating a document and putting the UTF-8 header at the top, but the actual transport mechanism which is sending the document out doesn't do any encoding.

And in answer to CEHJ's question, they've just taken the £ sign out of the message and not replaced it with anything at all.

Dave
0
 
LVL 86

Accepted Solution

by:
CEHJ earned 300 total points
Comment Utility
>>
This is leading me to believe that the ASP is creating a document and putting the UTF-8 header at the top, but the actual transport mechanism which is sending the document out doesn't do any encoding.
>>

I think you're probably right. The Rapier men's points are probably blunter than they make out :-)
0
 
LVL 35

Expert Comment

by:girionis
Comment Utility
No comment has been added lately, so it's time to clean up this TA.

I will leave a recommendation in the Cleanup topic area that this question is:

- To be deleted and points NOT refunded

Please leave any comments here within the
next seven days.

PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER !

girionis
Cleanup Volunteer
0
 
LVL 1

Author Comment

by:howesd
Comment Utility
I meant to accept CEHJ's comment as an answer a long time ago, purely on the strength of his excellent joke :)
0
 
LVL 35

Expert Comment

by:girionis
Comment Utility
 Better late than never :-)
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
;-)
0

Featured Post

Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Suggested Solutions

Java contains several comparison operators (e.g., <, <=, >, >=, ==, !=) that allow you to compare primitive values. However, these operators cannot be used to compare the contents of objects. Interface Comparable is used to allow objects of a cl…
This was posted to the Netbeans forum a Feb, 2010 and I also sent it to Verisign. Who didn't help much in my struggles to get my application signed. ------------------------- Start The idea here is to target your cell phones with the correct…
Viewers will learn one way to get user input in Java. Introduce the Scanner object: Declare the variable that stores the user input: An example prompting the user for input: Methods you need to invoke in order to properly get  user input:
Viewers will learn about if statements in Java and their use The if statement: The condition required to create an if statement: Variations of if statements: An example using if statements:

763 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

6 Experts available now in Live!

Get 1:1 Help Now