Solved

Unicode - 2 byte chars - 1 byte chars .... What to do ?

Posted on 2001-06-08
17
716 Views
Last Modified: 2012-06-27
Hi folks,

Following issue is bugging me and I'd like to have some insights so that I can decide how to proceed in my project.

I interptret Text data from existng formats.
Some of that text data is 1-byte per character, some of that text data is 2-byte per character (This is Unicode ... right ?).

At the moment, in the Objects that are created based on the information found I store all texts in 1-byte characters !
In other words, when I find Unicode ... I convert it to 1-byte characters.
For the detection and convertion of Unicode to single bytes I wrote a small routine.  The byte order also needs to be swapped before I can call the standard convertion routine !

This all works great but I work(ed) with Western texts only so far.

I'm wondering how that will effect Oriental character sets ? Will the standard convertion routines fail ?
So ... I'm also wondering if I shouldn't just store the Unicode in the Objects ... maybe even convert the 1-byte characters to Unicode before I store them in the Objects ?

In the Objects there's also code dealing with the texts.
E.g. Objects when requested certain strings, may probe child Objects for their name and/or text properties(s) and add them to the string which is finally returned !
I guess I will have to review them all then too to make sure they all work with Unicode stuff ??

How would I store the unicode then ?
Now the Object has a pointer char *Text
When the name is assigned it becomes Text = new char[strlen(InputText)+1] ; and then the data is copied.

What would be the best approach with Unicode ?

Thanks
0
Comment
Question by:sneeuw
17 Comments
 
LVL 11

Expert Comment

by:griessh
Comment Utility
You are touching my problem areas here:-)
The 256 characters you can create when using 1-byte coding are not enough to get ALL possible characters on your display. The ISO Latin-1 characters are all whe usually have to display in our western world. But already Russian (Cyrillic) and a few of the Turkish characters won't fit in that scheme anymore. So if you have to worry about these languages, you have to find new ways of encoding characters. You can always use a different font for a different language, but then your program has to know in which language it is working. It gets even worse when you have to use the Chinese/Japanese/Korean/Vietnamese characters (CJKV). That's were unicode comes in. with 2 bytes you can encode ALL possible characters (the drawback is, your fonts get blown up). If you look at unicode tables, you will see, that the upper byte is used for language style information, the lower byte for the actual characters.
Short result: If you have to show Kanja etc, use unicode. Use RogueWave libraries or something similiar for "wide characters" (16bit). That will help.
Some URLs: Basic unicode info at http://czyborra.com the official unicode site at http://www.unicode.org.

One more question: Are you working under UNIX or Windows?
======
Werner
0
 

Author Comment

by:sneeuw
Comment Utility
Windows
0
 
LVL 11

Expert Comment

by:griessh
Comment Utility
I'm more the UNIX person here, maybe a Win expert can comment here, too. I just know that newer versions of Windows are more usable with unicode. But still, unicode is unicode and you won't be able to get the full character range ...

======
Werner
0
 

Author Comment

by:sneeuw
Comment Utility
> But still, unicode is unicode and you won't be able to get the full character range ...

How do you mean ?
I thought Unicode IS the 2 byte set that covers ALL characters per character set ?


0
 

Author Comment

by:sneeuw
Comment Utility
> If you look at unicode tables, you will see, that the upper byte is used for language style information, the lower byte for the actual characters.

I'm not sure I get this ?
Then you still have only one byte per character .. right ?
One byte for the style ... one byte for the actual character ... so 256 styles, 256 possible characters ??
0
 
LVL 11

Expert Comment

by:griessh
Comment Utility
Yes, you are right, the 2 byte cover all characters. But they made it a bit more conventient for us. Let's look at some special fonts:
The ISO Latin-1 we ar eusing in the Western world have the high byte 0x00. The charcters in that range 0x0000 to 0x00ff are exactly the characters from the ISO Latin-1 character set 0x00 to 0xff.
The ISO Latin-5 (Cyrillic) is mapped from 0x0400 to 0x04ff, so if you have a 1byte ISO-Latin-5 font, you will get the same characters from a unicode font by just prepending the code with 0x04.
That just means there is some order in the unicode system that makes it possible to map characters back to single byte font sets, but you have to be aware what that means for your application.
Let me give you an example: I have a X/Motif app that has to work in different languages. It receives input from the network in unicode. My UI can only handle single byte fonts. So I have a conversion table that says: Currently I am in Russian mode and I have russian (ISO-Latin-5) fonts installed. If a (unicode) character comes over the network that does not have a 0x04 as high byte, I can't display it. All other charcters (with the 0x04) I can map 1to1 into my ISO-Latin-5 font.
Other languages are to be treated different (Turkish is a ISO-Latin-9 font) or are only possible when I have unicode fonts installed (Japanese using Katakana, Hiragana and Kangxi).
The problem is that most fonts are single byte fonts (256 characters). There are not too many real unicode fonts available. (There should be only one TryeType unicode font per font style, but that would have to have definintions for 65535 characters! A megabyte font set!)
I hope I didn't make it too confusing now.

If you know that a character is mapped into a specific font, you can always use a conversion from 2byte to 1byte. But if you want to be open for all characters, you have to save unicode (2byte) charcters.

=======
Werner
0
 

Author Comment

by:sneeuw
Comment Utility
Hi,

I'm letting it sip in .... ;-))

I quickly did a test and noticed that the unicode characters that came in indeed had 0x00 as Hi Byte !
(e.g. 0x0046 was 'F')
(Well in fact the data comes in 0x4600 but I swap all WORDS first !!)

Now ...
I removed the convertion with WideCharToMultiByte(...) for a minute and let the full unicode data be stored in the object.

Then I added the text to a ListNode in a ListView
I made sure I casted the text to wchar_t* first so that the Node->Caption = String((wchar_t*)Object->Text) ;would 'know' it was unicode for sure.

I ONLY saw the first character suggesting that the first 0x00 character stopped the display !
(I'm not entirely sure... I have to back-track because I use fcts such as strlen and strncpy of which I suspect they don't deal with unicode well ?

Is this then normal ?
Does a normal installed font not know of unicode ?

But still leaves me with a lot of questions ...
what if the Hi Byte is different from 0x00 ...
Will it then not work ?
Do I need to install a font per language ?

Do I need to store the unicode when I find out the font supports that particular unicode ??? and if so ... how do I detect that ??
0
 
LVL 11

Expert Comment

by:griessh
Comment Utility
>>I use fcts such as strlen and strncpy of which
I suspect they don't deal with unicode well ?

You are absolutely right. The strxxx() deals with char*, so a 0x00 of a unicode is treaded as an end of string (that's where I am stuck in my work, my UIL compiler does the same thing &*%^$!) How your fonts behave, I really don't know, we are getting in the Windows programming area now. All you other folks here: HELP US OUT!

Werner
0
What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

 

Author Comment

by:sneeuw
Comment Utility
During another test I made sure no strlen function changed the length.
When the (to wchar_t* casted text was provided to the VCL component it DID work !)
So I guess I don't need to implement checking of any kind.

I DO need conformity !
should I store ALL unicode (and convert the non-unicode strings) or should I store 1-byte text and convert all unicode before I store it ???

In case of Unicode I need to go through the code again to make sure all functions work correctly (I used strlen and strncpy etc...)

But I want the code to be read for future Oriental localization
0
 
LVL 11

Expert Comment

by:griessh
Comment Utility
OK, let's do it again:
The function int strlen(const char* c) returns the length of a string that uses char (1 byte). The end of string character is 0x00.
If we look at a unicode string sequence: 0x091a 0x091b 0x091c 0x0000 you will see, that the length is not 6 but 3 wide characters. That's why we have to use the wide functions here. If you are using VC++ please check your documentation about 'wide characters' (others compilers should have those functions also). Unicode IS wide, regular strings are 1 byte.
That is the same at all OSs.

Werner
0
 

Author Comment

by:sneeuw
Comment Utility
I know ;-)
I was just being informative ;-))
Wanted to say that (after some changes) the wide characters led to correctly displayed text !
The only thing left I guess is to make up my mind (and I was looking for some help) ... should I store all in wide or in single byte characters ?

I have a feeling the Borland cpp builder (visual) components can take both but need overhead.  I have the impression they store single byte anyway ... so I see no reason to do it to ...
(Correct me if I'm wrong)

The big question then remains ...
What will happen with oriental wide characters ?
0
 
LVL 9

Expert Comment

by:ShaunWilde
Comment Utility
it sounds like you have come across MBCS (Multibyte character set) also known as (DBCS). Strings can contain single-byte characters, double-byte characters, or both. The C run-time library provides functions, macros, and data types for MBCS-only programming.

this is different to UNICODE as that always has 2 bytes per character

0
 
LVL 11

Expert Comment

by:griessh
Comment Utility
>> The only thing left I guess is to make up my mind (and I was looking for some help) ... should I store
all in wide or in single byte characters ?

And the only thing I wanted to make clear is: If you concerned about (asian) unicode characters, you should store 2byte charcters. That's what I do, too. And my Japanese works just fine (with the exceptions I mentioned).

Werner
0
 

Author Comment

by:sneeuw
Comment Utility
> with the exceptions I mentioned

You mean the fonts getting blown up ?

I'm sorry we haven't reached conclusion yet but I'm stil doubting what to do.  I have the impression (although not properly documented) that the unicode approach might not work under Win95 + the fact that (visual) VCL components (Borland cpp Builder) rather like single-byte strings (although it seems to work).

I did learn from all of this that there's unicode and there are wide characters which is not the same.

I know use the routine :
WideCharToMultiByte
to convert all (suspected) wide char text that comes in.

While testing, the output of this standard C routine was always 1-byte character strings but the input was always ISO Latin-1.

What when the input is kanji or so ...
Any idea what the output will be ?
1 byte or two byte or wil the function fail ?
And then ...
Suppose the output is two-byte character strings ...
can there be (not ending) null characters in the string ?
Or are NULL characters in the string only possible in ISO Latin-1  ??
0
 
LVL 11

Accepted Solution

by:
griessh earned 200 total points
Comment Utility
Wide characters is a 2byte type that is used for unicode. Multibyte means that charachters can be 1, 2 3, ... bytes long, there has to be a seperator for that. The char end-of-string character is 0x00, the wide character end-of-string is 0x0000. If you have an input that is kanji, then you WILL get a wide character, because there are too many kanji charcaters for a 1byte type (>> 256), so storing and reading wide characters has to work. That's all I can tell you. Please try on different languages under Windows (maybe you can get a russian/turkish/japanese Win, if not, I can't help you anymore) to see if it works. I have no asian Windows, no Borland compiler/library to test it, that's all on your side now. I doubt that there is anybody out there able to tell you more about it. I am trying to get thet type of information now for almost 4 years and had to work my way through the unicode issues myself, never got any answer in any group.

Werner
0
 

Author Comment

by:sneeuw
Comment Utility
> I doubt that there is anybody out there able to tell you more about it. I am trying to get thet
type of information now for almost 4 years and had to work my way through the unicode issues myself,
never got any answer in any group.

hmm that's disappointing.
Anyway your help is/was greatly appreciated.

Peter
0
 

Expert Comment

by:Crane
Comment Utility
sneeuw:
I think it is quite easy to solve your problem but I am not sure what your question is.;-)
Are you still reading joliet format CD?
0

Featured Post

Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

Join & Write a Comment

Often, when implementing a feature, you won't know how certain events should be handled at the point where they occur and you'd rather defer to the user of your function or class. For example, a XML parser will extract a tag from the source code, wh…
Basic understanding on "OO- Object Orientation" is needed for designing a logical solution to solve a problem. Basic OOAD is a prerequisite for a coder to ensure that they follow the basic design of OO. This would help developers to understand the b…
The viewer will learn additional member functions of the vector class. Specifically, the capacity and swap member functions will be introduced.
The viewer will be introduced to the member functions push_back and pop_back of the vector class. The video will teach the difference between the two as well as how to use each one along with its functionality.

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

7 Experts available now in Live!

Get 1:1 Help Now