muti-byte languages such as CJK (Chinese-Japanese-Korean) in XSL-FO

Hi ,

I've an interview requirement that needs - experience working with muti-byte languages such as CJK (Chinese-Japanese-Korean) ??or may be double or multi-byte character set experience.

Where can I find the relevent information?? And what makes it different to work on other eastern lanaguages such as English.


Urgent answer required??

Thank You,
Jags.
LVL 10
jagadeesh_motamarriAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

bpmurrayCommented:
You should know the major codepages for these locales:
 
Codepage 932 or Shift-JIS for Japan (JIS = Japan Industry Standard)
Big5 for Traditional Chinese, as in Taiwan
GB2312 for Simplified Chinese, or GB 18030 nowadays, which is a huge character set, including both the Basic Multilingaual plane (BMP) of Unicode and Plane 1.

Unicode and UTF-8 are the most important character sets really, but the Japanese folk in particular don't like them because they don't differentiate between Chinese and Japanese ideographs. Unicode is called JIS X-0221 in Japan.

The best book on this stuff is CJKV Information Processing by Ken Lunde: http://www.oreilly.com/catalog/cjkvinfo/chapter/foreword.html. If you don't have time to get hold of the book, Ken has posted loads of URLs on his site here: http://www.praxagora.com/lunde/cjkv-urls.html.

The best site is probably Unicode - http://www.unicode.org





0
jagadeesh_motamarriAuthor Commented:
Can u explain me what makes it different when it comes to implementation while using single byte character set
0
jagadeesh_motamarriAuthor Commented:
and multi-byte character set
0
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

bpmurrayCommented:
I forgot to mention the DOS/Windows codepages for the encodings above:

GB2312 (or GBK) is called CP936
Big5 is called CP950

I also forgot about K. Korean uses CP 949 = KS C 5601.

Just a few quick points about the various languages:

All use ideographs, which are little drawings that illustrate the meaning behind the word. All of the ideographs came originally from China.
Japanese also uses Katakana, syllabic characters, usually used for foreign words, and Hiragana, which are interspersed with the ideographs (Kanji), often as prepositions
Korean ideographs are called Hanza, and they also have 11500 characters called Hangul, which are made up of up to 4 Jamo in a square.

If you're only looking at Windows, Nadine Kano's book is pretty good (I can't remember its name), and MS's site has a lot of MS-specific information: http://www.microsoft.com/globaldev/default.mspx
0
jagadeesh_motamarriAuthor Commented:
in programming point of view what changes do i take take care....becoz a regular implementation by default assusmes english....
0
jagadeesh_motamarriAuthor Commented:
Or let me put it this way -

My consultant asked me - >> Do you have any experience working with muti-byte languages such as CJK (Chinese-Japanese-Korean) ?? or may be double or multi-byte character set experience.

How should i respond to this question inorder to convience the interviewer that i 've a knowledge on this topic.
0
bpmurrayCommented:
First, there are libraries to  help you do this kind of thing: ICU for C & C++, ICU4J for Java; iconv on Unix/Linux; and the native support in most libraries (although that's almost never enough to do full internationalization of an app). However, I'll go through this from first principles, assuming there are no libs to help.

Let's compare CP932 (Japanese Windows) and CP1252 (what you're probably using). Each character in 1252 is one byte. That makes it simple to scan along a buffer of text (I'll use C, but the idea is essentially the same for all languages):

char *text = "Hello world!";
char *pStr;

for (pStr = text, *pStr != null; pStr++)
 ...


Now, for CP 932, you can't do that, because the characters can be one or two bytes. Some encodings even have up to 4 bytes, but the problem is the same as for 2:

char *text = "ABBCCDEEFGG"; /* Double-letters imply double-byte character */
char *pStr;

for (*pStr=text; *pStr != null; pStr++) /* Increment by one - every character is at least one byte */
{
   char myChar[2];
   myChar[0] = *pStr;
   if (isMultiByteChar(*pStr))
   {
      myChar[1] = *(pStr+1);
      pStr++;
  } else {
     myChar[1] = 0;
  }
   // Process myChar as the currnt character
}

The isMultiByteChar is the interesting bit: from the first byte, you can tell if this is a single- or double-byte character. If the first byte is in the range 0x81 to 0x9F or 0xE0 to 0xFC, it's the first of a two-byte character. Have a look at this: http://www.microsoft.com/globaldev/reference/dbcs/932.mspx. The greyed portion are the "lead bytes".

Now, the other codepages - 949, 936 and 950 - are slightly different. These have ASCII characters as you know them from space to underscore, but the range from 0x81 to 0xFF are all lead bytes for double-byte characters.

0
bpmurrayCommented:
The consultant is probably looking for an understanding of the area. Have a look at the codepages on MS's site (http://www.microsoft.com/globaldev/reference/WinCP.mspx) and you'll see that they're not really that difficult to handle. A completely different issue is the area of cultural norms, with sorting, accents, casing, etc. That's a lot more complex.
0
jagadeesh_motamarriAuthor Commented:
What should i exactly answer him....
0
bpmurrayCommented:
Consultant: Do you have any experience working with muti-byte languages such as CJK (Chinese-Japanese-Korean) or maybe double or multi-byte characters?
You: Yes. I'm always careful to ensure nothing I write prevents the application from running in any locale. I prefer to use Unicode UTF-16 or UTF-8, but sometimes you just have to manage the native encoding directly. I find that the easiest solution is to translate from the native encoding using MultiByteToWideChar on Windows, process the data as fixed-width UTF-16 in the program and convert it back at the end using WidecharToMultiByte. Of course, I always extract strings, graphics, colors, fonts, etc. to resource files so that they can be more easily localised, and I prefer having resources in a single DLL, so that this is the only thing that has to be replaced to run in a different language.
0
bpmurrayCommented:
Is this multi-platform or a single, e.g. Windows only? Does it involve multiple programming languages or only one? Which?

If you give me this info I can craft spomething that addresses these more specifically, with links to where you can get more in-depth info
0
jagadeesh_motamarriAuthor Commented:
java...
0
bpmurrayCommented:
OK.  Continuing your answer ...

Of course, since Java uses UTF-16 Unicode internally, it makes text processing that much easier. I convert the data coming in to Unicode, process them internally and convert them back. Java makes this pretty easy with its java.io.ByteToCharConverter and CharToByteConverter (sample code here)

   ByteToCharConverter toUnicode = ByteToCharConverter.getConverter("JIS X-0212"); // remember - JIS is Japanese
   String uniStr =  toUnicode.convertAll(japaneseStr.toCharArray());

I always resource my strings to bundle files or property files and use getString to retrieve the string I need. (sample code)
   messages = ResourceBundle.getBundle("MessagesBundle", currentLocale);
   System.out.println(messages.getString("hello"));

Other internationalization help Java provides natively are Calendars, although for full calendaring support, it's necessary to extend this functionality with what's available in ICU4J (http://icu.sourceforge.net/icu4j_faq.html). More i18n features in Java are number formatting, culturally correct comparisons and casing. It even does casing for Turkish correctly.

FYI: Turkish uppercase of "i" is an uppercase "I" with a dot on top and lowercase of "I" is a lowercase "i" with no dot.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
jagadeesh_motamarriAuthor Commented:
Great answers...!!!

Thank You,
Jagadeesh Motamarri.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Web Languages and Standards

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.