Solved

muti-byte languages such as CJK (Chinese-Japanese-Korean) in XSL-FO

Posted on 2006-07-11
14
480 Views
Last Modified: 2013-11-18
Hi ,

I've an interview requirement that needs - experience working with muti-byte languages such as CJK (Chinese-Japanese-Korean) ??or may be double or multi-byte character set experience.

Where can I find the relevent information?? And what makes it different to work on other eastern lanaguages such as English.


Urgent answer required??

Thank You,
Jags.
0
Comment
Question by:jagadeesh_motamarri
  • 7
  • 7
14 Comments
 
LVL 15

Expert Comment

by:bpmurray
ID: 17082080
You should know the major codepages for these locales:
 
Codepage 932 or Shift-JIS for Japan (JIS = Japan Industry Standard)
Big5 for Traditional Chinese, as in Taiwan
GB2312 for Simplified Chinese, or GB 18030 nowadays, which is a huge character set, including both the Basic Multilingaual plane (BMP) of Unicode and Plane 1.

Unicode and UTF-8 are the most important character sets really, but the Japanese folk in particular don't like them because they don't differentiate between Chinese and Japanese ideographs. Unicode is called JIS X-0221 in Japan.

The best book on this stuff is CJKV Information Processing by Ken Lunde: http://www.oreilly.com/catalog/cjkvinfo/chapter/foreword.html. If you don't have time to get hold of the book, Ken has posted loads of URLs on his site here: http://www.praxagora.com/lunde/cjkv-urls.html.

The best site is probably Unicode - http://www.unicode.org





0
 
LVL 10

Author Comment

by:jagadeesh_motamarri
ID: 17082099
Can u explain me what makes it different when it comes to implementation while using single byte character set
0
 
LVL 10

Author Comment

by:jagadeesh_motamarri
ID: 17082102
and multi-byte character set
0
 
LVL 15

Expert Comment

by:bpmurray
ID: 17082216
I forgot to mention the DOS/Windows codepages for the encodings above:

GB2312 (or GBK) is called CP936
Big5 is called CP950

I also forgot about K. Korean uses CP 949 = KS C 5601.

Just a few quick points about the various languages:

All use ideographs, which are little drawings that illustrate the meaning behind the word. All of the ideographs came originally from China.
Japanese also uses Katakana, syllabic characters, usually used for foreign words, and Hiragana, which are interspersed with the ideographs (Kanji), often as prepositions
Korean ideographs are called Hanza, and they also have 11500 characters called Hangul, which are made up of up to 4 Jamo in a square.

If you're only looking at Windows, Nadine Kano's book is pretty good (I can't remember its name), and MS's site has a lot of MS-specific information: http://www.microsoft.com/globaldev/default.mspx
0
 
LVL 10

Author Comment

by:jagadeesh_motamarri
ID: 17082234
in programming point of view what changes do i take take care....becoz a regular implementation by default assusmes english....
0
 
LVL 10

Author Comment

by:jagadeesh_motamarri
ID: 17082286
Or let me put it this way -

My consultant asked me - >> Do you have any experience working with muti-byte languages such as CJK (Chinese-Japanese-Korean) ?? or may be double or multi-byte character set experience.

How should i respond to this question inorder to convience the interviewer that i 've a knowledge on this topic.
0
 
LVL 15

Expert Comment

by:bpmurray
ID: 17082438
First, there are libraries to  help you do this kind of thing: ICU for C & C++, ICU4J for Java; iconv on Unix/Linux; and the native support in most libraries (although that's almost never enough to do full internationalization of an app). However, I'll go through this from first principles, assuming there are no libs to help.

Let's compare CP932 (Japanese Windows) and CP1252 (what you're probably using). Each character in 1252 is one byte. That makes it simple to scan along a buffer of text (I'll use C, but the idea is essentially the same for all languages):

char *text = "Hello world!";
char *pStr;

for (pStr = text, *pStr != null; pStr++)
 ...


Now, for CP 932, you can't do that, because the characters can be one or two bytes. Some encodings even have up to 4 bytes, but the problem is the same as for 2:

char *text = "ABBCCDEEFGG"; /* Double-letters imply double-byte character */
char *pStr;

for (*pStr=text; *pStr != null; pStr++) /* Increment by one - every character is at least one byte */
{
   char myChar[2];
   myChar[0] = *pStr;
   if (isMultiByteChar(*pStr))
   {
      myChar[1] = *(pStr+1);
      pStr++;
  } else {
     myChar[1] = 0;
  }
   // Process myChar as the currnt character
}

The isMultiByteChar is the interesting bit: from the first byte, you can tell if this is a single- or double-byte character. If the first byte is in the range 0x81 to 0x9F or 0xE0 to 0xFC, it's the first of a two-byte character. Have a look at this: http://www.microsoft.com/globaldev/reference/dbcs/932.mspx. The greyed portion are the "lead bytes".

Now, the other codepages - 949, 936 and 950 - are slightly different. These have ASCII characters as you know them from space to underscore, but the range from 0x81 to 0xFF are all lead bytes for double-byte characters.

0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 15

Expert Comment

by:bpmurray
ID: 17082484
The consultant is probably looking for an understanding of the area. Have a look at the codepages on MS's site (http://www.microsoft.com/globaldev/reference/WinCP.mspx) and you'll see that they're not really that difficult to handle. A completely different issue is the area of cultural norms, with sorting, accents, casing, etc. That's a lot more complex.
0
 
LVL 10

Author Comment

by:jagadeesh_motamarri
ID: 17082510
What should i exactly answer him....
0
 
LVL 15

Expert Comment

by:bpmurray
ID: 17082563
Consultant: Do you have any experience working with muti-byte languages such as CJK (Chinese-Japanese-Korean) or maybe double or multi-byte characters?
You: Yes. I'm always careful to ensure nothing I write prevents the application from running in any locale. I prefer to use Unicode UTF-16 or UTF-8, but sometimes you just have to manage the native encoding directly. I find that the easiest solution is to translate from the native encoding using MultiByteToWideChar on Windows, process the data as fixed-width UTF-16 in the program and convert it back at the end using WidecharToMultiByte. Of course, I always extract strings, graphics, colors, fonts, etc. to resource files so that they can be more easily localised, and I prefer having resources in a single DLL, so that this is the only thing that has to be replaced to run in a different language.
0
 
LVL 15

Expert Comment

by:bpmurray
ID: 17082628
Is this multi-platform or a single, e.g. Windows only? Does it involve multiple programming languages or only one? Which?

If you give me this info I can craft spomething that addresses these more specifically, with links to where you can get more in-depth info
0
 
LVL 10

Author Comment

by:jagadeesh_motamarri
ID: 17082641
java...
0
 
LVL 15

Accepted Solution

by:
bpmurray earned 500 total points
ID: 17082840
OK.  Continuing your answer ...

Of course, since Java uses UTF-16 Unicode internally, it makes text processing that much easier. I convert the data coming in to Unicode, process them internally and convert them back. Java makes this pretty easy with its java.io.ByteToCharConverter and CharToByteConverter (sample code here)

   ByteToCharConverter toUnicode = ByteToCharConverter.getConverter("JIS X-0212"); // remember - JIS is Japanese
   String uniStr =  toUnicode.convertAll(japaneseStr.toCharArray());

I always resource my strings to bundle files or property files and use getString to retrieve the string I need. (sample code)
   messages = ResourceBundle.getBundle("MessagesBundle", currentLocale);
   System.out.println(messages.getString("hello"));

Other internationalization help Java provides natively are Calendars, although for full calendaring support, it's necessary to extend this functionality with what's available in ICU4J (http://icu.sourceforge.net/icu4j_faq.html). More i18n features in Java are number formatting, culturally correct comparisons and casing. It even does casing for Turkish correctly.

FYI: Turkish uppercase of "i" is an uppercase "I" with a dot on top and lowercase of "I" is a lowercase "i" with no dot.
0
 
LVL 10

Author Comment

by:jagadeesh_motamarri
ID: 17083263
Great answers...!!!

Thank You,
Jagadeesh Motamarri.
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

This is an explanation of a simple data model to help parse a JSON feed
Styling your websites can become very complex. Here I'll show how SASS can help you better organize, maintain and reuse your CSS code.
The viewer will learn the benefit of using external CSS files and the relationship between class and ID selectors. Create your external css file by saving it as style.css then set up your style tags: (CODE) Reference the nav tag and set your prop…
In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …

760 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

17 Experts available now in Live!

Get 1:1 Help Now