MultiByteToWideChar

MultiByteToWideChar() Function behave differently when code page is changed from CP_ACP to CP_UTF8

the input string is Grüezi zäme to be converted into widechar

work fine when use CP_ACP

but when i use CP_UTF8 it removes special chars

"ü" and "ä"


using CP_UTF8 as MSDN suggest we should use CP_UTF8 for consistent result



davinder101Asked:
Who is Participating?
 
Gideon7Commented:
UTF8 is isomorphic to Unicode (UCS), which encompasses the codepoints for almost every known language in the world.  CP_ACP defines use of the default ANSI character code for a particular code page.  Although it varies by language (e.g., ACP=1252 = Windows-1252 for Western Latin), I am not aware of a single Windows language for which a glyph is not representable by at least one UCS codepoint.  This includes all current ideographic languages (Chinese, Korean, Japanese, etc).
The Unicode committee limited UCS to the first 16 multilingual planes specifically to allow for a UTF8 encoding using at most four octets.  So your statement is incorrect.  There is definitely a mapping from any conceivable CP_ACP code page to UTF8 (UCS).
The problem is that the user is entering a fixed literal string using the ANSI codepage into the MultiByteToWideChar function.  Changing the input codepage to CP_UTF8 requires also changing the input string to UTF8 before submitting it to MultiByteToWideChar.
0
 
Gideon7Commented:
Assuming that CP_ACP is Latin1 (Windows-1252 or ISO 8859-1), the input string "Grüezi zäme" is encoded as the octet stream 47 72 fc 65 7a 69 20 7a e4 6d 65.   For CP_UTF8 the input string is encoded as the octet stream 47 72 c3 bc 65 7a 69 20 7a c3 a4 6d 65.
Verify that your input to MultiByteToWideChar matches the given octet strings shown above for the respective code page argument (CP_xxx).
Note that CP_UTF8 does not allow the use of any special flags.  That is, the argumment dwFlags must be zero.
0
 
bharatpurCommented:
UTF 8 standard works on multi byte character set. i.e. the conversion unit that is take as input can be one byte/two byte depending on the initial byte value. There is no one to one mapping of  special characters between CP_ACP char to CP_UTF8 after the ascii char range from 0->32 or 0->127(32/127 I am not sure about).  "ü" and "ä" i.e (252,228) fall out of range ..So you cant expect "ü" and "ä" conversion to same Glyph in UTF8.  
0
Cloud Class® Course: SQL Server Core 2016

This course will introduce you to SQL Server Core 2016, as well as teach you about SSMS, data tools, installation, server configuration, using Management Studio, and writing and executing queries.

 
bharatpurCommented:
Hi Gideon
Can you pls suggest hoe to change the input string to UTF8 before submitting it to MultiByteToWideChar.
0
 
bharatpurCommented:
ü represents octet FC. According to UTF8 range of values between FC-FD are not accepted(invalid octets). Restricted by RFC 3629: start of 6-byte sequence.

ä represent octet E4. Although this is a valid UTF8 octet but it represents the start of 2 byte char. But ä is followed by m which is always read in single octet. So the combination  äm is not a valid UTF8 encoding.

So invalid octet and there combination are left out during conversion from multibyte to widechar string thats why you are unable to see ü  and ä  in the widechar sting.
0
 
Gideon7Commented:
No.  You are confusing the multi-byte UTF8 representation with the one-byte Windows-1252 (ISO-8859-1) representation.
In UTF8, ü is represented by two octets c3 bc.  ä is represented by two octets c3 a4.
The input must be the multibyte values c3 bc or c3 a4.  Not the one byte values fc or e4.
The following code is wrong:
LPCSTR szName = L"Grüezi zäme";  // ISO-8859-1
WCHAR wszOut[NCHARS];
// Wrong - CP_UTF8 should be 1252.
::MultiByteToWideChar(CP_UTF8, 0, szName, -1, wszOut, NCHARS);
You need to use CP_ACP or 1252, not CP_UTF8.  Octets above 7f are not supported in C/C++ literal strings.
0
 
bharatpurCommented:
I am not confused. I gave the explanation how Grüezi zäme i.e ( 47 72 fc 65 7a 69 20 7a e4 6d 65) contardict UTF8 encoding principle.
0
 
Gideon7Commented:
My answer is correct.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.