Solved

Multi Byte MBCS vs. Wide Char

Posted on 2001-06-18
19
2,011 Views
Last Modified: 2007-11-27
Hi,

At the moment, when I suspect (tested or not) that input-text is unicode (wide char) I convert using the routine :
WideCharToMultiByte()

This way I seem to end up with a 1-byte/char string.
(for western language text e.g. Dutch, English, ...)

- What if the input is e.g. Chinese or Hebrew or ...
With what will I end up ?? Or will the function simply fail ?
- Suppose the function succeeds ... Do I end up with a string which (can) contain(s) 1 and/or 2 byte characters ?
- Suppose I end up with a string which contains 1 or 2 byte characters (I don't care as long as it's a valid widely-system-supported data) can this Multi-Byte string contain NULL characters ?
  - In other words, can I use standard string functions such as strlen() which uses the NULL terminator and in case of Multi-Byte ... Is ONE NULL char enough as termination or should there be two NULL chars too like in Wide Char ?
- Finally, for those who are familiar with VCL & Borland cpp AnsiString's implementation :  I think I understood from the documentation that the VCL component AnsiString stores strings as MBCS ... is this correct ?

Few questions in one but all so strongly related I felt I had to post in one question.  Pls. try to 'touch' all sub-questions.  But maybe I'm completely 'missing the ball' and are the sub-questions irrelevant ?
0
Comment
Question by:sneeuw
  • 5
  • 5
  • 5
  • +2
19 Comments
 
LVL 1

Accepted Solution

by:
jizhang earned 75 total points
Comment Utility
WideCharToMultiByte is used to map a wide character string to its
multibyte character string counterpart.

int WideCharToMultiByte(
    UINT CodePage,
    DWORD dwFlags,
    LPCWSTR lpWideCharStr,
    int cchWideChar,
    LPSTR lpMultiByteStr,
    int cchMultiByte,
    LPCSTR lpDefaultChar,
    LPBOOL lpUsedDefaultChar,
    );

You may convert  a Unicode string to an ANSI string by the function.
But it is not a translater. For example:
wchar_t wk[] = {0x4e2d, 0x6587, 0x0000};
are 2 letters 0x4e2d, 0x6587  in unicode which present
Chinese GB "0xd6d0 0xcec4"
WideCharToMultiByte() will not output  0xd6 0xd0 0xce 0xc4

You can use NULL as string end:
wchar_t wk[] = {0x4e2d, 0x6587, NULL};
where NULL is 0x0000 (compiler knows).

>"standard string functions such as strlen() "
 You can not use strlen() to count string length, because
 wchar_t or TCHAR is diff from char.
 I think that  wchar_t is unsigend shor int.
 You can not use strcpy() to copy wchar_t.

CString _T("..") can be both unicode and ascii.
But CString::GetLength returns the number may not be the
size in BYTE, and you need mult by sizeof(TCHAR)

    CString str = _T("Hello, World");
    archive.Write( str, str.GetLength( ) * sizeof( TCHAR ) );
in unicode prog.

Do not confuse char, TCHAR, CString -- the 3 type objects.

0
 

Author Comment

by:sneeuw
Comment Utility
> WideCharToMultiByte is used to map a wide character string to its multibyte character string counterpart.

OK, but WHAT is a Multi Byte ??
Is it just another way of storing a string in 2 bytes/character or 1 byte/character or mixed or ... ?

> WideCharToMultiByte() will not output  0xd6 0xd0 0xce 0xc4

What will be the output then ?

> You can use NULL as string end:
OK, but suppose I have a char* Text which is made of Multi Byte characters ... do I terminate with 1 or 2 NULL bytes ?

> I think that  wchar_t is unsigend shor int.
Correct !
But what if a have a MultiByte string ?
Can a MultiByte string contain NULL bytes other than the terminating one ?

0
 
LVL 30

Expert Comment

by:Axter
Comment Utility
MultiByte is proprietary to VC++, and Microsoft itself has abandon this string type.
Don?t use it.  You can still use the functions for converting Wide to char and char to Wide.
0
 
LVL 30

Expert Comment

by:Axter
Comment Utility
The MultiByte string had multiple problems with it.
The Wide string type is part of the ANSI C++ standards.
MultiByte is not.
0
 

Author Comment

by:sneeuw
Comment Utility
> MultiByte is proprietary to VC++, and Microsoft itself has abandon this string type.
Don?t use it.  You can still use the functions for converting Wide to char and char to Wide.

Thanks but this confuses me even more ?
I have the impression that Borland C++ components which use the Borland AnsiString class store text as MultiByte ?

Actually, for a minute it all looked very nice to me (hence this posting to make sure).

I get text as input which can be either single byte or 2 bytes per character.  I test for Wide char and if true I convert to Multi-Byte (Using the function WideCharToMultiByte())

For the Western texts I tested with this resulted in 1 byte characters and I could use the (earlier written) code based on strlen() etc ...
Also, this converted text (char *Text) I can present to Borland C++ Builder Objects and all works nice !!

The problem I'm facing ...
Will it also work when the Wide Chars turn out to be not Western (e.g.) Chinese ?

Will the convertion work ?
What will be the result ?
I have NO clue ?

You say Multi-Byte is obsolete ?
How would you handle this situation then ?
Taken in account that the input can be 1 or 2 bytes per character.
At the moment, Objects I create get a name based on these texts.  At the moment these Objects have a member char *Text.
0
 
LVL 11

Expert Comment

by:griessh
Comment Utility
Hi sneeuw, still struggling :-)
========================================================
Here is an excerpt from the MS help:
Single-byte and Multibyte Character Sets
The ASCII character set defines characters in the range 0x00 ? 0x7F. There are a number of other character sets, primarily European, that define the characters within the range 0x00 ? 0x7F identically to the ASCII character set and also define an extended character set from 0x80 ? 0xFF. Thus an 8-bit, single-byte-character set (SBCS) is sufficient to represent the ASCII character set as well as the character sets for many European languages. However, some non-European character sets, such as Japanese Kanji, include many more characters than can be represented in a single-byte coding scheme, and therefore require multibyte-character set (MBCS) encoding.

Note   Many SBCS routines in the Microsoft run-time library handle multibyte bytes, characters, and strings as appropriate. Many multibyte-character sets define the ASCII character set as a subset. In many multibyte character sets, each character in the range 0x00 ? 0x7F is identical to the character that has the same value in the ASCII character set. For example, in both ASCII and MBCS character strings, the one-byte NULL character ('\0') has value 0x00 and indicates the terminating null character.

A multibyte character set may consist of both one-byte and two-byte characters. Thus a multibyte-character string may contain a mixture of single-byte and double-byte characters. A two-byte multibyte character has a lead byte and a trail byte. In a particular multibyte-character set, the lead bytes fall within a certain range, as do the trail bytes. When these ranges overlap, it may be necessary to evaluate the context to determine whether a given byte is functioning as a lead byte or a trail byte.
======================================================

To deal with Mulibyte characters, you have to use Multibyte function calls (like the mblen() in VC++ that gives you the length of your multibyte character.) You CANNOT use the regular string functions, because they rely on "1 character = 1 byte"! The system has too know hoew it stores the characters, so make sure you have enough storage space allocated.

One more help page:

========================================================
International Enabling

Most traditional C and C++ code makes assumptions about character and string manipulation that do not work well for international applications. While both MFC and the run-time library support Unicode or MBCS, there is still work for you to do. To guide you, this section explains the meaning of ?international enabling? in Visual C++:

Both Unicode and MBCS are enabled by means of portable data types in MFC function parameter lists and return types. These types are conditionally defined in the appropriate ways, depending on whether your build defines the symbol _UNICODE or the symbol _MBCS (which means DBCS). Different variants of the MFC libraries are automatically linked with your application, depending on which of these two symbols your build defines.

Class library code uses portable run-time functions and other means to ensure correct Unicode or MBCS behavior.

You still must handle certain kinds of internationalization tasks in your code:
Use the same portable run-time functions that make MFC portable under either environment.

Make literal strings and characters portable under either environment, using the _T macro. For more information, see Generic-Text Mappings in TCHAR.H.

Take precautions when parsing strings under MBCS. These precautions are not needed under Unicode. For more information, see MBCS Programming Tips.

Take care if you mix ANSI (8-bit) and Unicode (16-bit) characters in your application. It?s possible to use ANSI characters in some parts of your program and Unicode characters in others, but you cannot mix them in the same string.

Don?t ?hard-code? strings in your application. Instead, make them STRINGTABLE resources by adding them to the application?s .rc file. Your application can then be localized without requiring source code changes or recompilation. For more information on STRINGTABLE resources, see the String Editor documentation in Visual C++ User?s Guide.
Note   European and MBCS character sets have some characters, such as accented letters, with character codes greater than 0x80. Since most code uses signed characters, these characters greater than 0x80 are sign-extended when converted to int. This is a problem for array indexing because the sign-extended characters, being negative, will index outside the array.

Languages that use MBCS, such as Japanese, are also unique. Since a character may consist of one or two bytes, you should always manipulate both bytes at the same time.

======================================================

and one more:

======================================================
SBCS and MBCS Data Types
Any Microsoft MBCS run-time library routine that handles only one multibyte character or one byte of a multibyte character expects an unsigned int argument (where 0x00 <= character value <= 0xFFFF and 0x00 <= byte value <= 0xFF ). An MBCS routine that handles multibyte bytes or characters in a string context expects a multibyte-character string to be represented as an unsigned char pointer.

Caution   Each byte of a multibyte character can be represented in an 8-bit char. However, an SBCS or MBCS single-byte character of type char with a value greater than 0x7F is negative. When such a character is converted directly to an int or a long, the result is sign-extended by the compiler and can therefore yield unexpected results.

Therefore it is best to represent a byte of a multibyte character as an 8-bit unsigned char. Or, to avoid a negative result, simply convert a single-byte character of type char to an unsigned char before converting it to an int or a long.
======================================================

Good luck!

======
Werner
0
 
LVL 1

Expert Comment

by:jizhang
Comment Utility
I think mult byte could be  unsigned char  (just guess).
 {0x4e2d, 0x6587, NULL};  would be 4e2d658700.

compiler knows whar are "NULL",
in wchar_t, it is 0x0000 -- unsigned short.
in char is "\0".
 in mult-byte is 0x00.
There are not the same.

>Can a MultiByte string contain NULL bytes other than the terminating one ?
  Good question.
  a control character can be in a data file, you can open a file as binary and
  read/write them in/out one byte by one byte.
  NULL bytes can be stored in an array -- string is an array.
  But when you print it, display it, it would act as a control (or not a control)
  depends on how you deal with it.
  If you use SetWindowText(), OutTextW(), a control would be a control.
  0x2029, 0x2028 (unicode) would be CRLF --in  unicode prog.
  similar things would happend with NULL or ascii 0x00 or "\0".
  You need to write  your codes to decide what  to do with those
  control characters -- as a value or delimitor or control or whatever.





0
 
LVL 11

Expert Comment

by:griessh
Comment Utility
Axter

Sorry to object, Multibyte is not VC specific. We have it even here in our X/Motif environment. Multibyte is more compact, but has some problems with handling strings. That's why wide characters are more convenient (Who cares about a few more bytes for a string).
But you are right, because since it is much easier to deal with wide characters people don't use Multibyte too often. And UNICODE just maps so nice into wide characters anyway.

======
Werner
0
 
LVL 1

Expert Comment

by:jizhang
Comment Utility
I think mult byte could be  unsigned char  (just guess).
 {0x4e2d, 0x6587, NULL};  would be 4e2d658700.

compiler knows whar are "NULL",
in wchar_t, it is 0x0000 -- unsigned short.
in char is "\0".
 in mult-byte is 0x00.
There are not the same.

>Can a MultiByte string contain NULL bytes other than the terminating one ?
  Good question.
  a control character can be in a data file, you can open a file as binary and
  read/write them in/out one byte by one byte.
  NULL bytes can be stored in an array -- string is an array.
  But when you print it, display it, it would act as a control (or not a control)
  depends on how you deal with it.
  If you use SetWindowText(), OutTextW(), a control would be a control.
  0x2029, 0x2028 (unicode) would be CRLF --in  unicode prog.
  similar things would happend with NULL or ascii 0x00 or "\0".
  You need to write  your codes to decide what  to do with those
  control characters -- as a value or delimitor or control or whatever.





0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 

Author Comment

by:sneeuw
Comment Utility
>> To deal with Mulibyte characters, you have to use Multibyte function calls (like the mblen() in VC++
that gives you the length of your multibyte character.) You CANNOT use the regular string functions,
because they rely on "1 character = 1 byte"! The system has too know hoew it stores the characters,
so make sure you have enough storage space allocated.

Is that really so ?
Im mean, in my case I use strlen() to determine the exact amount of BYTES, not characters !!  So I know how many bytes to allocate to e.g. copy the string in !
Since, according to the (very usefull) information you copied, also MultiBytes are terminated with one single NULL byte !?  Correct ?


> >Can a MultiByte string contain NULL bytes other than the terminating one ?

According to the info griessch copied, MBCS also uses a NULL byte to terminate.  So I take it WideCharToMultiByte() returns a string which does not contain NULL bytes in the middle of a string unlike the case can be with Unicode ?
0
 
LVL 11

Expert Comment

by:griessh
Comment Utility
sneeuw

For single byte characters there are function strxxx(), wide character have their own function set and multibyte charcaters have their own function set. Whatever timplementation of a data type on your specific platform is and however your compiler/libraries work with these data types, the type specific functions are made to work with these types.
If you use a wide character function or single character function with multi byte character strings, don't be surprised if it doesn't work. Please just accept that functions like strlen() WILL NOT work with multibyte characters.

The data type is the part of a program that tells the compiler how to deal with the bytes in memory. You can always define a float and force the compiler to look at it as an integer. They eill never be the same, because their definition is different. The same with charcters. If don't believe that your compiler builders can handle Kanji, then find out how Kanji charcaters are defined as Multibytes, create a string with Kanji characters, use an appropriate font and display the string.

The VC help says that VC implements the EOS charcter as 0x00 and that is the same character as the ASCII NULL. It does not say that there is no NULL in the other possible multibyte characters! Do not rely on that! The Multibyte string length function will tell you how many characters are in the string.

======
Werner
0
 
LVL 1

Expert Comment

by:jizhang
Comment Utility
There are strlen, wcslen, _mbslen, _mbstrlen 4 functions.
Get the length of a string.
size_t  strlen( const char *string );
size_t  wcslen( const wchar_t *string );
size_t  mbslen( const unsigned char *string );
size_t  mbstrlen( const char *string );

If you can store  your mb in char*, you may call strlen(),
otherwise you can not.

you have to use unsigned char* to store some mb char-sets.

0
 

Author Comment

by:sneeuw
Comment Utility
>> multibyte charcaters have their own function set

Which ones ?  I see sets for single byte and wide char but not MultiByte ?  I still (sorry to doubt you guys ;-) have the impression that the normal single byte routines also apply to the MultiByte strings.  As long as you (e.g. in case of strlen()) assume you're dealing with BYTES not characters.

>> The same with charcters. If don't believe that your
compiler builders can handle Kanji, then find out how Kanji charcaters are defined as Multibytes, create
a string with Kanji characters, use an appropriate font and display the string.

The problem is ... I've seen it work, from a photograph in a Taiwanese magazine (they sent me the magazine because they had done an article on my soft).

I'm just not sure if it will work always ?
That's why I'm trying to understand.

Btw.  Here's how Borland implements AnsiString (the constructor of the class) :
__fastcall AnsiString();
__fastcall AnsiString(const char* src);
__fastcall AnsiString(const AnsiString& src);
__fastcall AnsiString(const char* src, unsigned char len);
__fastcall AnsiString(const wchar_t* src);
__fastcall AnsiString(int src);
__fastcall AnsiString(double src);
__fastcall AnsiString(char src);
__fastcall AnsiString(short);
__fastcall AnsiString(unsigned short);
__fastcall AnsiString(unsigned int);
__fastcall AnsiString(long);
__fastcall AnsiString(unsigned long);
__fastcall AnsiString(__int64);
__fastcall AnsiString(unsigned __int64);
__fastcall AnsiString(const WideString &src);

This constructor apparantly is able to take MBCS ?
provided as a pointer to char. SBCS or MBCS, what I get is correct text in the object !
FurtherMore, AnsiString does not store the text as Wide char but as MBCS (I think)  according to Borland (found in help text) because Win95 is not able to deal with WideChar well !??

I know ... I'm confused too ;-)))

>> There are strlen, wcslen, _mbslen, _mbstrlen 4 functions.
I could not find in the Online help (nor could I compile during a test) :
mbstrlen and mbslen
0
 
LVL 11

Expert Comment

by:griessh
Comment Utility
Those mb*() functions are VC++ functions, the Borland might have different names. Did you do a search for "multibyte" in your compile environment? If there si nobody out in EE who knows about Borland and multibyte, you have to do some legwork here :-)
For your doubts: I still remember the day when I switched my AIX box into Japanese and saw the first Japanese Xterm on the screen. It was a start!

======
Werner
0
 
LVL 1

Expert Comment

by:jizhang
Comment Utility

I see. You do not use MS VC++.
MS VC++ has 3 types functions for wide char, char, MB.

To store Chinese GB or BIG5 codes, you need to use
unsigned char, or int, because the code range.

For example, Chinese GB2312, each Chinese letter
has 2 bytes, each byte value > 0xa0.
Chinese GB2312 file allows mixed ASCII.

It is possible that not use wide char to programm
and render with right font. If you are interested in
 my prog (executeable)  which did in this way,
 you may get from
 http://www.zanmen.com/homepages/JiZhang/software/uni03.zip








0
 

Author Comment

by:sneeuw
Comment Utility
OK,
Took some time but now I managed to create something which can be used as an example to work from.

In this program there are still two scenario's where I struggle with MBCS vs. Unicode vs. SBCS etc. ...

1. What I read of CD File-System tables (e.g. ISO9660) is single byte (SBCS) or dual-byte (Unicode).  At the moment all my code is still char (single byte) oriented but this seems to work well if I convert the Unicode strings to MBCS (WideCharToMultiByte).  In this case all still works great (even strlen() because I only use this fct to determine length in bytes, NOT characters).
[This converts the unicode to strings without NULL characters except a terminating NULL and works well with VCL strings (The Borland cpp Builder strings, which are MBCS internally as well)]

2. Localisation.
For this I use :
LoadString( StrRes, Identifier, Buffer, BufferSize) ;
which works great too.
Here again I struggle with ... what about e.g. Chinese.
So somebody translated the *.rc files for me and I built a dll from them.  This doesn't seem to work on a Chinese system (can't test at home).
So I thought, why not load Unicode from the dlls and use that to display (with a hex editor I saw that all dlls (e.g. the Italian one) contain unicode texts).
This Unicode I then could use to Display and would probably solve the issue ... I THINK.

For this I tried the unicode variant (notice the W) :
LoadStringW( StrRes, Identifier, Buffer, BufferSize) ;
(Buffer in this case is a wchar_t pointer)

This doesn't seem to work (at all).  Even for the Spanish or whatever dll there's no unicode extracted from the dlls
The function fails with (from GetLastError()) code 120 which doesn't tell me anything.

Bright ideas appreciated !

Thanks,

For all the files etc. go to www.isobuster
.com, scroll down and follow the link to 0.99.7.2 and you'll see what I mean.
0
 
LVL 5

Expert Comment

by:Netminder
Comment Utility
sneeuw,

These questions are still open and our records show you logged in recently. Please resolve them appropriately as soon as possible. Continued disregard of your open questions will result in the force/acceptance of a comment as an answer; other actions affecting your account may also be taken. I will revisit these questions in approximately seven (7) days. Please note that the recommended minimum for an "Easy" question is 50 points.
http://experts-exchange.com/jsp/qShow.jsp?ta=winprog&qid=20183446
http://experts-exchange.com/jsp/qShow.jsp?ta=winprog&qid=20158806
http://experts-exchange.com/jsp/qShow.jsp?ta=cplusprog&qid=20192985
http://experts-exchange.com/jsp/qShow.jsp?ta=cplusprog&qid=20151309
http://experts-exchange.com/jsp/qShow.jsp?ta=cplusprog&qid=20137274
http://experts-exchange.com/jsp/qShow.jsp?ta=cplusprog&qid=20136466
http://experts-exchange.com/jsp/qShow.jsp?ta=delphi&qid=20088277
http://experts-exchange.com/jsp/qShow.jsp?ta=javascript&qid=20183228

EXPERTS: Please leave your thoughts on this question here.

Thanks,

Netminder
Community Support Moderator
Experts Exchange
0
 
LVL 11

Expert Comment

by:griessh
Comment Utility
SOrry, but as in all my discussions with sneeuw about UNICODE, wide characters and multibyte strings I claim this question for me. I gave sneeuw ALL :-) the available information about this topic and he still insists it works different. Doing internationalization for several years now (I have to deal with Cyrillic, Turkish, ISO Latin, UNICODE, kanji, hiragama, katakana etc) I know the pitsfalls of this topic. There is no more theoretical discussion needed, without some code that fails we won't be able to help the asker any further.

======
Werner
0
 
LVL 5

Expert Comment

by:Netminder
Comment Utility
Force/accepted by

Netminder
Community Support Moderator
Experts Exchange

griessh: points for you at http://www.experts-exchange.com/jsp/qShow.jsp?ta=cplusprog&qid=20270940
0

Featured Post

What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

Suggested Solutions

In days of old, returning something by value from a function in C++ was necessarily avoided because it would, invariably, involve one or even two copies of the object being created and potentially costly calls to a copy-constructor and destructor. A…
This article will show you some of the more useful Standard Template Library (STL) algorithms through the use of working examples.  You will learn about how these algorithms fit into the STL architecture, how they work with STL containers, and why t…
The viewer will learn how to pass data into a function in C++. This is one step further in using functions. Instead of only printing text onto the console, the function will be able to perform calculations with argumentents given by the user.
The viewer will learn how to use the return statement in functions in C++. The video will also teach the user how to pass data to a function and have the function return data back for further processing.

763 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now