Posted on 2009-03-30
Last Modified: 2013-11-20
MultiByteToWideChar() Function behave differently when code page is changed from CP_ACP to CP_UTF8

the input string is Grüezi zäme to be converted into widechar

work fine when use CP_ACP

but when i use CP_UTF8 it removes special chars

"ü" and "ä"

using CP_UTF8 as MSDN suggest we should use CP_UTF8 for consistent result

Question by:davinder101
  • 4
  • 4
LVL 12

Expert Comment

ID: 24027630
Assuming that CP_ACP is Latin1 (Windows-1252 or ISO 8859-1), the input string "Grüezi zäme" is encoded as the octet stream 47 72 fc 65 7a 69 20 7a e4 6d 65.   For CP_UTF8 the input string is encoded as the octet stream 47 72 c3 bc 65 7a 69 20 7a c3 a4 6d 65.
Verify that your input to MultiByteToWideChar matches the given octet strings shown above for the respective code page argument (CP_xxx).
Note that CP_UTF8 does not allow the use of any special flags.  That is, the argumment dwFlags must be zero.

Expert Comment

ID: 24294840
UTF 8 standard works on multi byte character set. i.e. the conversion unit that is take as input can be one byte/two byte depending on the initial byte value. There is no one to one mapping of  special characters between CP_ACP char to CP_UTF8 after the ascii char range from 0->32 or 0->127(32/127 I am not sure about).  "ü" and "ä" i.e (252,228) fall out of range ..So you cant expect "ü" and "ä" conversion to same Glyph in UTF8.  
LVL 12

Accepted Solution

Gideon7 earned 500 total points
ID: 24295624
UTF8 is isomorphic to Unicode (UCS), which encompasses the codepoints for almost every known language in the world.  CP_ACP defines use of the default ANSI character code for a particular code page.  Although it varies by language (e.g., ACP=1252 = Windows-1252 for Western Latin), I am not aware of a single Windows language for which a glyph is not representable by at least one UCS codepoint.  This includes all current ideographic languages (Chinese, Korean, Japanese, etc).
The Unicode committee limited UCS to the first 16 multilingual planes specifically to allow for a UTF8 encoding using at most four octets.  So your statement is incorrect.  There is definitely a mapping from any conceivable CP_ACP code page to UTF8 (UCS).
The problem is that the user is entering a fixed literal string using the ANSI codepage into the MultiByteToWideChar function.  Changing the input codepage to CP_UTF8 requires also changing the input string to UTF8 before submitting it to MultiByteToWideChar.

Expert Comment

ID: 24302309
Hi Gideon
Can you pls suggest hoe to change the input string to UTF8 before submitting it to MultiByteToWideChar.
IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!


Expert Comment

ID: 24303360
ü represents octet FC. According to UTF8 range of values between FC-FD are not accepted(invalid octets). Restricted by RFC 3629: start of 6-byte sequence.

ä represent octet E4. Although this is a valid UTF8 octet but it represents the start of 2 byte char. But ä is followed by m which is always read in single octet. So the combination  äm is not a valid UTF8 encoding.

So invalid octet and there combination are left out during conversion from multibyte to widechar string thats why you are unable to see ü  and ä  in the widechar sting.
LVL 12

Expert Comment

ID: 24306134
No.  You are confusing the multi-byte UTF8 representation with the one-byte Windows-1252 (ISO-8859-1) representation.
In UTF8, ü is represented by two octets c3 bc.  ä is represented by two octets c3 a4.
The input must be the multibyte values c3 bc or c3 a4.  Not the one byte values fc or e4.
The following code is wrong:
LPCSTR szName = L"Grüezi zäme";  // ISO-8859-1
// Wrong - CP_UTF8 should be 1252.
::MultiByteToWideChar(CP_UTF8, 0, szName, -1, wszOut, NCHARS);
You need to use CP_ACP or 1252, not CP_UTF8.  Octets above 7f are not supported in C/C++ literal strings.

Expert Comment

ID: 24311453
I am not confused. I gave the explanation how Grüezi zäme i.e ( 47 72 fc 65 7a 69 20 7a e4 6d 65) contardict UTF8 encoding principle.
LVL 12

Expert Comment

ID: 25237991
My answer is correct.

Featured Post

Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
Host to IP 7 73
tripleUp challenge 7 67
Annoying "thing" blocks my view 4 51
Thin secure Windows 10 5 49
Introduction: Hints for the grid button.  Nested classes, templated collections.  Squash that darned bug! Continuing from the sixth article about sudoku.   Open the project in visual studio. First we will finish with the SUD_SETVALUE messa…
Introduction: The undo support, implementing a stack. Continuing from the eigth article about sudoku.   We need a mechanism to keep track of the digits entered so as to implement an undo mechanism.  This should be a ‘Last In First Out’ collec…
This video will show you how to get GIT to work in Eclipse.   It will walk you through how to install the EGit plugin in eclipse and how to checkout an existing repository.
Get a first impression of how PRTG looks and learn how it works.   This video is a short introduction to PRTG, as an initial overview or as a quick start for new PRTG users.

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now