Posted on 2009-03-30
Last Modified: 2013-11-20
MultiByteToWideChar() Function behave differently when code page is changed from CP_ACP to CP_UTF8

the input string is Grüezi zäme to be converted into widechar

work fine when use CP_ACP

but when i use CP_UTF8 it removes special chars

"ü" and "ä"

using CP_UTF8 as MSDN suggest we should use CP_UTF8 for consistent result

Question by:davinder101
  • 4
  • 4
LVL 12

Expert Comment

ID: 24027630
Assuming that CP_ACP is Latin1 (Windows-1252 or ISO 8859-1), the input string "Grüezi zäme" is encoded as the octet stream 47 72 fc 65 7a 69 20 7a e4 6d 65.   For CP_UTF8 the input string is encoded as the octet stream 47 72 c3 bc 65 7a 69 20 7a c3 a4 6d 65.
Verify that your input to MultiByteToWideChar matches the given octet strings shown above for the respective code page argument (CP_xxx).
Note that CP_UTF8 does not allow the use of any special flags.  That is, the argumment dwFlags must be zero.

Expert Comment

ID: 24294840
UTF 8 standard works on multi byte character set. i.e. the conversion unit that is take as input can be one byte/two byte depending on the initial byte value. There is no one to one mapping of  special characters between CP_ACP char to CP_UTF8 after the ascii char range from 0->32 or 0->127(32/127 I am not sure about).  "ü" and "ä" i.e (252,228) fall out of range ..So you cant expect "ü" and "ä" conversion to same Glyph in UTF8.  
LVL 12

Accepted Solution

Gideon7 earned 500 total points
ID: 24295624
UTF8 is isomorphic to Unicode (UCS), which encompasses the codepoints for almost every known language in the world.  CP_ACP defines use of the default ANSI character code for a particular code page.  Although it varies by language (e.g., ACP=1252 = Windows-1252 for Western Latin), I am not aware of a single Windows language for which a glyph is not representable by at least one UCS codepoint.  This includes all current ideographic languages (Chinese, Korean, Japanese, etc).
The Unicode committee limited UCS to the first 16 multilingual planes specifically to allow for a UTF8 encoding using at most four octets.  So your statement is incorrect.  There is definitely a mapping from any conceivable CP_ACP code page to UTF8 (UCS).
The problem is that the user is entering a fixed literal string using the ANSI codepage into the MultiByteToWideChar function.  Changing the input codepage to CP_UTF8 requires also changing the input string to UTF8 before submitting it to MultiByteToWideChar.
Are your AD admin tools letting you down?

Managing Active Directory can get complicated.  Often, the native tools for managing AD are just not up to the task.  The largest Active Directory installations in the world have relied on one tool to manage their day-to-day administration tasks: Hyena. Start your trial today.


Expert Comment

ID: 24302309
Hi Gideon
Can you pls suggest hoe to change the input string to UTF8 before submitting it to MultiByteToWideChar.

Expert Comment

ID: 24303360
ü represents octet FC. According to UTF8 range of values between FC-FD are not accepted(invalid octets). Restricted by RFC 3629: start of 6-byte sequence.

ä represent octet E4. Although this is a valid UTF8 octet but it represents the start of 2 byte char. But ä is followed by m which is always read in single octet. So the combination  äm is not a valid UTF8 encoding.

So invalid octet and there combination are left out during conversion from multibyte to widechar string thats why you are unable to see ü  and ä  in the widechar sting.
LVL 12

Expert Comment

ID: 24306134
No.  You are confusing the multi-byte UTF8 representation with the one-byte Windows-1252 (ISO-8859-1) representation.
In UTF8, ü is represented by two octets c3 bc.  ä is represented by two octets c3 a4.
The input must be the multibyte values c3 bc or c3 a4.  Not the one byte values fc or e4.
The following code is wrong:
LPCSTR szName = L"Grüezi zäme";  // ISO-8859-1
// Wrong - CP_UTF8 should be 1252.
::MultiByteToWideChar(CP_UTF8, 0, szName, -1, wszOut, NCHARS);
You need to use CP_ACP or 1252, not CP_UTF8.  Octets above 7f are not supported in C/C++ literal strings.

Expert Comment

ID: 24311453
I am not confused. I gave the explanation how Grüezi zäme i.e ( 47 72 fc 65 7a 69 20 7a e4 6d 65) contardict UTF8 encoding principle.
LVL 12

Expert Comment

ID: 25237991
My answer is correct.

Featured Post

Master Your Team's Linux and Cloud Stack!

The average business loses $13.5M per year to ineffective training (per 1,000 employees). Keep ahead of the competition and combine in-person quality with online cost and flexibility by training with Linux Academy.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Separate into files by filename 12 82
Path of Workbook 3 77
Capture logon name 13 73
Work with App store 7 52
In this article, I'll describe -- and show pictures of -- some of the significant additions that have been made available to programmers in the MFC Feature Pack for Visual C++ 2008.  These same feature are in the MFC libraries that come with Visual …
Exception Handling is in the core of any application that is able to dignify its name. In this article, I'll guide you through the process of writing a DRY (Don't Repeat Yourself) Exception Handling mechanism, using Aspect Oriented Programming.
This video will show you how to get GIT to work in Eclipse.   It will walk you through how to install the EGit plugin in eclipse and how to checkout an existing repository.
In a recent question ( here at Experts Exchange, a member asked how to add page numbers to a PDF file using Adobe Acrobat XI Pro. This short video Micro Tutorial sh…

773 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question