Solved

muti-byte languages such as CJK (Chinese-Japanese-Korean) in XSL-FO

Posted on 2006-07-11
14
483 Views
Last Modified: 2013-11-18
Hi ,

I've an interview requirement that needs - experience working with muti-byte languages such as CJK (Chinese-Japanese-Korean) ??or may be double or multi-byte character set experience.

Where can I find the relevent information?? And what makes it different to work on other eastern lanaguages such as English.


Urgent answer required??

Thank You,
Jags.
0
Comment
Question by:jagadeesh_motamarri
  • 7
  • 7
14 Comments
 
LVL 15

Expert Comment

by:bpmurray
ID: 17082080
You should know the major codepages for these locales:
 
Codepage 932 or Shift-JIS for Japan (JIS = Japan Industry Standard)
Big5 for Traditional Chinese, as in Taiwan
GB2312 for Simplified Chinese, or GB 18030 nowadays, which is a huge character set, including both the Basic Multilingaual plane (BMP) of Unicode and Plane 1.

Unicode and UTF-8 are the most important character sets really, but the Japanese folk in particular don't like them because they don't differentiate between Chinese and Japanese ideographs. Unicode is called JIS X-0221 in Japan.

The best book on this stuff is CJKV Information Processing by Ken Lunde: http://www.oreilly.com/catalog/cjkvinfo/chapter/foreword.html. If you don't have time to get hold of the book, Ken has posted loads of URLs on his site here: http://www.praxagora.com/lunde/cjkv-urls.html.

The best site is probably Unicode - http://www.unicode.org





0
 
LVL 10

Author Comment

by:jagadeesh_motamarri
ID: 17082099
Can u explain me what makes it different when it comes to implementation while using single byte character set
0
 
LVL 10

Author Comment

by:jagadeesh_motamarri
ID: 17082102
and multi-byte character set
0
 
LVL 15

Expert Comment

by:bpmurray
ID: 17082216
I forgot to mention the DOS/Windows codepages for the encodings above:

GB2312 (or GBK) is called CP936
Big5 is called CP950

I also forgot about K. Korean uses CP 949 = KS C 5601.

Just a few quick points about the various languages:

All use ideographs, which are little drawings that illustrate the meaning behind the word. All of the ideographs came originally from China.
Japanese also uses Katakana, syllabic characters, usually used for foreign words, and Hiragana, which are interspersed with the ideographs (Kanji), often as prepositions
Korean ideographs are called Hanza, and they also have 11500 characters called Hangul, which are made up of up to 4 Jamo in a square.

If you're only looking at Windows, Nadine Kano's book is pretty good (I can't remember its name), and MS's site has a lot of MS-specific information: http://www.microsoft.com/globaldev/default.mspx
0
 
LVL 10

Author Comment

by:jagadeesh_motamarri
ID: 17082234
in programming point of view what changes do i take take care....becoz a regular implementation by default assusmes english....
0
 
LVL 10

Author Comment

by:jagadeesh_motamarri
ID: 17082286
Or let me put it this way -

My consultant asked me - >> Do you have any experience working with muti-byte languages such as CJK (Chinese-Japanese-Korean) ?? or may be double or multi-byte character set experience.

How should i respond to this question inorder to convience the interviewer that i 've a knowledge on this topic.
0
 
LVL 15

Expert Comment

by:bpmurray
ID: 17082438
First, there are libraries to  help you do this kind of thing: ICU for C & C++, ICU4J for Java; iconv on Unix/Linux; and the native support in most libraries (although that's almost never enough to do full internationalization of an app). However, I'll go through this from first principles, assuming there are no libs to help.

Let's compare CP932 (Japanese Windows) and CP1252 (what you're probably using). Each character in 1252 is one byte. That makes it simple to scan along a buffer of text (I'll use C, but the idea is essentially the same for all languages):

char *text = "Hello world!";
char *pStr;

for (pStr = text, *pStr != null; pStr++)
 ...


Now, for CP 932, you can't do that, because the characters can be one or two bytes. Some encodings even have up to 4 bytes, but the problem is the same as for 2:

char *text = "ABBCCDEEFGG"; /* Double-letters imply double-byte character */
char *pStr;

for (*pStr=text; *pStr != null; pStr++) /* Increment by one - every character is at least one byte */
{
   char myChar[2];
   myChar[0] = *pStr;
   if (isMultiByteChar(*pStr))
   {
      myChar[1] = *(pStr+1);
      pStr++;
  } else {
     myChar[1] = 0;
  }
   // Process myChar as the currnt character
}

The isMultiByteChar is the interesting bit: from the first byte, you can tell if this is a single- or double-byte character. If the first byte is in the range 0x81 to 0x9F or 0xE0 to 0xFC, it's the first of a two-byte character. Have a look at this: http://www.microsoft.com/globaldev/reference/dbcs/932.mspx. The greyed portion are the "lead bytes".

Now, the other codepages - 949, 936 and 950 - are slightly different. These have ASCII characters as you know them from space to underscore, but the range from 0x81 to 0xFF are all lead bytes for double-byte characters.

0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 
LVL 15

Expert Comment

by:bpmurray
ID: 17082484
The consultant is probably looking for an understanding of the area. Have a look at the codepages on MS's site (http://www.microsoft.com/globaldev/reference/WinCP.mspx) and you'll see that they're not really that difficult to handle. A completely different issue is the area of cultural norms, with sorting, accents, casing, etc. That's a lot more complex.
0
 
LVL 10

Author Comment

by:jagadeesh_motamarri
ID: 17082510
What should i exactly answer him....
0
 
LVL 15

Expert Comment

by:bpmurray
ID: 17082563
Consultant: Do you have any experience working with muti-byte languages such as CJK (Chinese-Japanese-Korean) or maybe double or multi-byte characters?
You: Yes. I'm always careful to ensure nothing I write prevents the application from running in any locale. I prefer to use Unicode UTF-16 or UTF-8, but sometimes you just have to manage the native encoding directly. I find that the easiest solution is to translate from the native encoding using MultiByteToWideChar on Windows, process the data as fixed-width UTF-16 in the program and convert it back at the end using WidecharToMultiByte. Of course, I always extract strings, graphics, colors, fonts, etc. to resource files so that they can be more easily localised, and I prefer having resources in a single DLL, so that this is the only thing that has to be replaced to run in a different language.
0
 
LVL 15

Expert Comment

by:bpmurray
ID: 17082628
Is this multi-platform or a single, e.g. Windows only? Does it involve multiple programming languages or only one? Which?

If you give me this info I can craft spomething that addresses these more specifically, with links to where you can get more in-depth info
0
 
LVL 10

Author Comment

by:jagadeesh_motamarri
ID: 17082641
java...
0
 
LVL 15

Accepted Solution

by:
bpmurray earned 500 total points
ID: 17082840
OK.  Continuing your answer ...

Of course, since Java uses UTF-16 Unicode internally, it makes text processing that much easier. I convert the data coming in to Unicode, process them internally and convert them back. Java makes this pretty easy with its java.io.ByteToCharConverter and CharToByteConverter (sample code here)

   ByteToCharConverter toUnicode = ByteToCharConverter.getConverter("JIS X-0212"); // remember - JIS is Japanese
   String uniStr =  toUnicode.convertAll(japaneseStr.toCharArray());

I always resource my strings to bundle files or property files and use getString to retrieve the string I need. (sample code)
   messages = ResourceBundle.getBundle("MessagesBundle", currentLocale);
   System.out.println(messages.getString("hello"));

Other internationalization help Java provides natively are Calendars, although for full calendaring support, it's necessary to extend this functionality with what's available in ICU4J (http://icu.sourceforge.net/icu4j_faq.html). More i18n features in Java are number formatting, culturally correct comparisons and casing. It even does casing for Turkish correctly.

FYI: Turkish uppercase of "i" is an uppercase "I" with a dot on top and lowercase of "I" is a lowercase "i" with no dot.
0
 
LVL 10

Author Comment

by:jagadeesh_motamarri
ID: 17083263
Great answers...!!!

Thank You,
Jagadeesh Motamarri.
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
website content maintenance 3 76
API Soap Calls 4 86
Controlled Assessment GCSE - desperate help needed 4 75
T-SQL:  Sigh---Boy, this is fun.... 12 25
Have you tried to learn about Unicode, UTF-8, and multibyte text encoding and all the articles are just too "academic" or too technical? This article aims to make the whole topic easy for just about anyone to understand.
This is about my first experience with programming Arduino.
Viewers will learn one way to get user input in Java. Introduce the Scanner object: Declare the variable that stores the user input: An example prompting the user for input: Methods you need to invoke in order to properly get  user input:
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…

919 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now