We help IT Professionals succeed at work.

What is the default WML character encoding?

janegil
janegil asked
on
Many WAP sites (e.g. http://wap.mbl.is/)use ISO-8859-1 as default WML encoding. WML is XML, and XML uses UTF-8 as the default encoding, unless the MIME type says otherwise.

So my question is: Does the MIME type "text/vnd.wap.wml" imply that ISO-8859-1 is the default character encoding?

From http://www.w3.org/TR/REC-xml#NT-EncName :
In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is an error [...] for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8.
Comment
Watch Question

Commented:
Strictly speaking "text/vnd.wap.wml" implies that the body of the message is an XML data stream. This stream has an <?xml?> header and if this does NOT contain an encoding attribute the data MUST be in UTF-8.

The problem is that for HTML the body of the response is HTML "encoded" in a particular character set. This charset can be set determined by a "charset=" attribute on the Content-Type: text/... header. The body is then "decoded" into 31-bit Unicode using entities (ampersand number semicolon sequences) to get the HTML.

The problem is :-
   1. http defaults the body to 8898-1 if NO charset is present,
   2. xml defaults the data to UTF-8 if NO encoding is present.

So if you omit BOTH it is probably anybodies guess what the result will be.

I suggest that you ALWAYS send the data encoded in UTF-8 and DON'T put a charset= parameter on the Content-Type: header. That's what I do and I don't have any problems with it.

Author

Commented:
Yes, I do specify my encoding.

But I was looking for a paragraph I could use, to strike down those who don't.

Seems they're untouchable: If they use ISO-8859-1, they can rightly claim to be useing the THHP default for text/* subtypes, and if they use UTF-8, they can claim that they use the XML default.

So I'll have to fall back on the argument that the page doesn't work, which may be a harder-hitting argument anyway.

Commented:
I have found that for the Nokia tool kit I MUST specify charset=utf-8 on the Content-Type: when fetching via http.

I have had all sorts of problems assuming that the default type for text/* is iso-8898-1. It seems that the browser makes this "iso-8898-browser". To overcome that problem I "invented" 7-bit HTML. I send all characters whose Unicode value is below 128 "as-is" and all other characters as an ampersand semicolon sequence. I now have no problem displaying Hebrew in the Russian version of IE.

One could theorectically do this in wml, since UTF-8 only diverges from ASCII above 128 and all the WML control characters (which are XML control characters) lie in the ASCII areas. So escape ALL characters above 128 and the encoding will be irrelavent.