asked on

Find character encoding from locale

Is there a standard way to find the character encoding (for XML) given a locale?

For example, if the locale is 'en' or 'English' or similar, the character encoding should be 'iso-8859-1'. And if the locale is 'iw' or 'he' or 'Hebrew' or similar, the character encoding should be 'iso-8859-8'.

ahoffmann

do you mean somthing like in HTML:
<META HTTP-EQUIV="Content-Type" ...
for perl?

ASKER CERTIFIED SOLUTION

Sapa

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

yonat

ASKER

It's for an XML file I need to generate. The XML starts with:
<?xml version="1.0" encoding="iso-8859-1"?>

What I want to do is to figure out what encoding to write based on the current locale. I thought of just writing a big hash that maps locale names to encodings, but locale names vary wildly on different platforms. So I'm looking for a better way. Any ideas?

ahoffmann

> .. based on the current locale.
Which is "current", the writer or the viewer (through his browser)?

yonat

ASKER

Thanks for your answer, Sapa. I know many languages has more than one charset - I only need one that will work well, so it doesn't really matter that there are others.

Sapa

> So I'm looking for a better way. Any ideas?

sure. Use UTF-8.

Andrey

More correctly, you should already know character sets for all hardcoded strings and you should ask character set from any other message sources. If you does not know it, it's senselessly to choose charset. If you know it - pass data in the same charset or recode it to any allowed, for example into UTF-8.

yonat

ASKER

The server's locale, which I retrieve with setlocale(LC_CTYPE).

yonat

ASKER

Sapa, the locale is user-configured (or actually admin-configured). I prefer to make life a little easier for the user so that she will not have to configure the charset too - just the locale.

About utf8: will it always work? Or does it depend on how the data is actually written?

Sapa

> utf8: will it always work? Or does it depend on how the data is actually written?

UTF-8 is 8-bit Unicode Transformation Format. It aims to represent every character in every human language. So you can convert any textual data from any source into UTF-8 without loss. Also it allows to make multilingual documents. BTW, UTF-8 is the default character set for XML. Perl also has internal support for UTF-8 since v5.6

But you should know source codeset of any textual data you want to convert to UTF-8. If you get data from keyboard (standard input) there is no reliable way to find codeset. Some Unix-like OS'es (Linux, for example) has C-function nl_langinfo(CODESET) but some others hasn't. Also, you can get the current codeset if the full locale name (LANG=he_IL.ISO-8859-8 instead of LANG=he_IL) is set in environment.

ahoffmann

> UTF-8 is 8-bit Unicode Transformation Format. It aims to represent every character in every human language.
near correct. You mean UTF-16 instead :-)

Even XML uses it as default, the viewer must support it too, otherwise you get trash.

yonat

ASKER

Thanks!
So if I blindly assume utf8, when will things go wrong? (I assume the client is MSIE5+ for the feature that uses this, but I don't know the OS. And the server has Perl 5.6 or better, but again no assumpsions about the OS).

Sapa

> > UTF-8 is 8-bit Unicode Transformation Format. It aims to represent every character in every human language.
> near correct. You mean UTF-16 instead :-)

No, I am absolutely correct. UTF-16 cover only the subset of whole Unicode Character Set (USC) - BMP characters and non-BMP up to 0x10ffff, but UTF-8 cover whole UCS set as described in ISO 10646 standard (up to 0x7fffffff). I think you was confused by nubber '16', which is greater than '8' :-)

Sapa

> So if I blindly assume utf8, when will things go wrong?

The question is in your data sources. It will be OK if you will get all data you want to send to user in UTF-8. If your application is 'pure CGI', and you will send your form in UTF-8 (with Content-Type: text/html; charset=utf-8), browser should return form data in UTF-8 too. If you read data from file, you should know what codeset was used when this file was written.

ahoffmann

I was lazy, Sapa sorry.

yonat

ASKER

Thanks Sapa. You wrote: "If you read data from file, you should know what codeset was used when this file was written."

How?

Here is my situation:
- The admin specifies the locale.
- Users submit text in forms.
- The CGI script writes this text to files.

ahoffmann

aha, slowly you're going to give the information I requested very early:

> Which is "current", the writer or the viewer (through his browser)?

So, IIRC you have a CGI (in perl) which gets feed from a browser.
Then you have to check $env{'HTTP_ACCEPT_CHARSET'}, get the first value, that's probably the one which the user of the browser configured to be the default.
With this value, you can convert the characters to UTF-8.

yonat

ASKER

Unfortunately, Ichecked $env{'HTTP_ACCEPT_CHARSET'} and it does not exist. It seems the client does not always pass this information. (when does it?)

ahoffmann

quote from RFC2616:

...
The "charset" parameter is used with some media types to define the
character set (section 3.4) of the data. When no explicit charset
parameter is provided by the sender, media subtypes of the "text"
type are defined to have a default charset value of "ISO-8859-1" when
received via HTTP. Data in character sets other than "ISO-8859-1" or
its subsets MUST be labeled with an appropriate charset value. See
section 3.4.1 for compatibility problems.
...

yonat

ASKER

Thanks. I just realized this whole thing may be related to anther phenomenon so I'm going to investigate it further, and get back here later.

yonat

ASKER

Okay - this didn't the other problem, but anyway:
Thanks for all you help and patience. In the end I decided to make my own locale-to-charset mapping, and let the admin override the charset if they want to.

ahoffmann, I am adding a new question for you - please submit an answer there.