Link to home
Start Free TrialLog in
Avatar of howesd
howesd

asked on

JDOM, UTF-8 and the Pound Sterling character

Can anyone give any advice on how to pass the pound sterling ( £ ) sign in to JDOM? I receive a message stream from the microsoft environment ( it's an ASP page ) which looks like a valid XML document but it always fails as JDOM tries to build up its document.

The input message has a valid xml header which says that it's in UTF-8 format, but when JDOM tries to parse the message it fials with the following message

org.jdom.JDOMException: Error in building: The data "The Deal Consideration of ?
.00 is greater than currently permitted in Rapier for this Channel.  Rule: 17490
" is not legal for a JDOM attribute: 0xA3 is not a legal XML character.
        at org.jdom.input.SAXBuilder.build(SAXBuilder.java:373)
        at org.jdom.input.SAXBuilder.build(SAXBuilder.java:673)
        at SPAR.RapierRequest.doRequest(RapierRequest.java:107)
        at SPAR.RapierDeal.DoDeal(RapierDeal.java:101)
        at SPAR.RapierDeal.<init>(RapierDeal.java:35)
        at SPAR.DealExecute.CallRapier(DealExecute.java:186)
        at SPAR.DealExecute.<init>(DealExecute.java:53)
        at SPAR.DealRequestInterface.<init>(DealRequestInterface.java:51)
        at SPAR.OrchidRouter.doPost(OrchidRouter.java:170)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:760)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
        at org.apache.tomcat.core.ServletWrapper.handleRequest(ServletWrapper.ja
va:503)
        at org.apache.tomcat.core.ContextManager.service(ContextManager.java:559
)
        at org.apache.tomcat.service.http.HttpConnectionHandler.processConnectio
n(HttpConnectionHandler.java:160)


If we change the encoding of the message we get the same error but a different Hex value being reported.

What's mopre confusing is that if I post information in to JDOM from a browser ( IE 5 ) it quite happily deals with the £ sign.

I'm _very_ confused by this......

Dave
Avatar of CEHJ
CEHJ
Flag of United Kingdom of Great Britain and Northern Ireland image

Try substituting the &pound; entity for the pound sign. Soon ASCII entities won't be valid anyway.
Avatar of heyhey_
heyhey_

IMHO "0xA3" is not a valid UTF8 symbol.

what code do you use for parsing ?
 You could also try to substitute the '£' with the entity:

 &#163; which is the decimal representation of the '£'.

  Hope it helps.
Avatar of howesd

ASKER

HeyHey

I could be wrong about the 0xA3 ..... I was doing it from memory - but basically you're just agreeing with my parser - it thinks 0xA3 isn't legal as well!

My code is a follows ( snippets )

       Document docResponse = null;
       
    // Build the rapier address details.
       String rapier_addr = SCDets.getString("rapier_xml_ip_address");
       String rapier_port = SCDets.getString("rapier_xml_host_port");
       String rapier_page = "/" + SCDets.getString("rapier_xml_default_page");
         
       int int_rapier_port = new Integer(rapier_port.trim()).intValue();

       URL u = new URL("http",rapier_addr.trim(),int_rapier_port,rapier_page);
       
    // Build the http connection and set some properties on it.
       URLConnection uc = u.openConnection();
       HttpURLConnection http = (HttpURLConnection) uc;

       http.setDoOutput(true);
       http.setDoInput(true);
       http.setRequestMethod("POST");
       String authString = SCDets.getString("rapier_xml_user_password");
       String auth = "Basic " + new sun.misc.BASE64Encoder().encode(authString.getBytes());
       logger.debug("Auth String = " + auth);
       logger.debug(RapierLogger.toString());
       
       http.setRequestProperty("Proxy-Authorization", auth);

       OutputStream out = http.getOutputStream();
       OutputStreamWriter wout = new OutputStreamWriter(out);

        XMLOutputter xOut = new XMLOutputter();
       
    // Rapier doesn't work if you send it encoding and xml declaration stuff.
        xOut.setOmitEncoding(true);
        xOut.setOmitDeclaration(true);

       xOut.output(RapierMessage,wout);

       wout.flush();
       wout.close();

        RapierLogger.info(xOut.outputString(RapierMessage));
       
       SAXBuilder builder = new SAXBuilder();
       docResponse = builder.build(http.getInputStream());

       RapierLogger.info(new XMLOutputter().outputString(docResponse));

/*
 * Close the http resources as soon as possible.
 */
 
       http.disconnect();

....

My code sends its message out OK ( which doesn't include any £ signs ) but fails on the docResponse = builder.build(http.getInputStream()) statement. From other places on the net I've seen discussions about the £ sign saying "The UTF8 representation of the character 163 ( the pound sterling symbol ) is a sequence of 2 bytes 0xC2 and 0xA3". It looks to me like the sending application is not encoding the stream successfully but the first line of the message I receive ( <?xml version="1.0" encoding="UT
F-8"?>^M ) indicates that it is UTF-8 encoded.

It beats me .....





Avatar of howesd

ASKER

girionis

I don't really have any control over the message stream which I reveive. It comes from a different application which the developers are a little wary of altering.

Dave
Avatar of howesd

ASKER

girionis

I don't really have any control over the message stream which I reveive. It comes from a different application which the developers are a little wary of altering.

Dave
 Could you not search and replace the pound symbol with the decimal entity when you receive the stream of data (and before you do any XML processing)?
to summarize:

- you have a stream with unknown encoding (that claims to be UTF8 encoded);
- your parser has problems with that stream, because of the encoding;

the only possible solution is to "fix" the stream, i.e.
1. fix the .asp that generates it (not possible ?)
2. "guess" the imput stream encoding and covert it to UTF8
 Could you not search and replace the pound symbol with the decimal entity when you receive the stream of data (and before you do any XML processing)?
Sorry for the double post. I just hit refresh accidentally.
For that matter, are you sure the Rapier page is valid XML?
Avatar of howesd

ASKER

I'm pretty certain that the Rapier page is valid XML - there are lots of other applications in the organisation which can process it's output. However, I think these are all using the Microsoft parser to read the messages so I'm beginning to think that it's  an example of where MS haven't stuck to the standards but their tools all work as they all fail to conform in the same way.

The developers of the other side of the interface have changed their code to remove the £ sign and I am now able to process their messages happily, but I'm fairly certain that there will be other characters that we haven't yet come across which will cause the interface to blow up.

I'm going to leave this question open for a couple of days if you don't mind to see if there's any other input.

Dave
>>I'm beginning to think that it's  an example of where MS haven't stuck to the standards but their tools all work as they all fail to conform in the same way.

I'm certainly no apologist for M$, but of course, the reasons there are no problems otherwise could be merely to do with intelligent, defensive coding in their parsers [ugh it's painful to say that :-)].

>>changed their code to remove the £ sign

What have they put in instead?

>>but I'm fairly certain that there will be other characters that we haven't yet come across which will cause the interface to blow up

Anything > 0x7E should do it!

Somebody's asked me to post something I wrote to 'un-pretty-print' html, which I can adapt slightly so that *you* can code defensively against bad input if you want ...
>>IMHO "0xA3" is not a valid UTF8 symbol

Quite right heyhey.
 I had the same problem as well. This is due to M$ not conforming to the standards and screwing everything up. The thing is that I could read the stream of data but I could not display the characters properly on the screen and I wrote a little method that replaces all the M$ special characters with their corresponding decimal entity:

char [] characters = streamOfData.toCharArray();
int arrayLength = characters.length;
StringBuffer chars = new StringBuffer();
for (int i=0; i<arrayLength; i++)
{
    if ( (128 <= (int) characters[i] && (int) characters[i] <= 160) || ((int) characters[i]) >= 256 )
       chars.append("&#" + (int) characters[i] + ";");
    else
        chars.append(characters[i]);
}

  Maybe it could help you as well.
>>This is due to M$ not conforming to the standards and screwing everything up

But in howesd's case, I would guess that the problem is caused by improper coding of the source document. There shouldn't be any raw pound sterling characters appearing in the source. Wouldn't be a problem for me - I haven't got one on this keyboard, hence the words :-)
 In the past I never had problems with the pound symbol doing XML processing (either DOM or SAX) and it was there, the raw pound symbol insted of the &pound; entity or the &#163; one. But you can naver be sure... Even a different locale on the computer can have disastrous results.

  The only thing left is to try using the pound entity and see what happens.
Avatar of howesd

ASKER

I did some experimenting on this, writing a java application which would replace the "sending" application that's giving me the problem. What I found was that if I created a JDOM Document element, told it to set the encoding to UTF-8 and put a £ in the document, I would still get the problem of not neing able to parse the document in my receiving application. However, when I told the output stream writer that I'm sending the document through that it had to use UTF-8 encoding, it all started working correctly.

This is leading me to believe that the ASP is creating a document and putting the UTF-8 header at the top, but the actual transport mechanism which is sending the document out doesn't do any encoding.

And in answer to CEHJ's question, they've just taken the £ sign out of the message and not replaced it with anything at all.

Dave
ASKER CERTIFIED SOLUTION
Avatar of CEHJ
CEHJ
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
No comment has been added lately, so it's time to clean up this TA.

I will leave a recommendation in the Cleanup topic area that this question is:

- To be deleted and points NOT refunded

Please leave any comments here within the
next seven days.

PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER !

girionis
Cleanup Volunteer
Avatar of howesd

ASKER

I meant to accept CEHJ's comment as an answer a long time ago, purely on the strength of his excellent joke :)
 Better late than never :-)
;-)