Solved

MultiByteToWideChar

Posted on 2009-03-30
10
2,524 Views
Last Modified: 2013-11-20
MultiByteToWideChar() Function behave differently when code page is changed from CP_ACP to CP_UTF8

the input string is Grüezi zäme to be converted into widechar

work fine when use CP_ACP

but when i use CP_UTF8 it removes special chars

"ü" and "ä"


using CP_UTF8 as MSDN suggest we should use CP_UTF8 for consistent result



0
Comment
Question by:davinder101
  • 4
  • 4
10 Comments
 
LVL 12

Expert Comment

by:Gideon7
ID: 24027630
Assuming that CP_ACP is Latin1 (Windows-1252 or ISO 8859-1), the input string "Grüezi zäme" is encoded as the octet stream 47 72 fc 65 7a 69 20 7a e4 6d 65.   For CP_UTF8 the input string is encoded as the octet stream 47 72 c3 bc 65 7a 69 20 7a c3 a4 6d 65.
Verify that your input to MultiByteToWideChar matches the given octet strings shown above for the respective code page argument (CP_xxx).
Note that CP_UTF8 does not allow the use of any special flags.  That is, the argumment dwFlags must be zero.
0
 

Expert Comment

by:bharatpur
ID: 24294840
UTF 8 standard works on multi byte character set. i.e. the conversion unit that is take as input can be one byte/two byte depending on the initial byte value. There is no one to one mapping of  special characters between CP_ACP char to CP_UTF8 after the ascii char range from 0->32 or 0->127(32/127 I am not sure about).  "ü" and "ä" i.e (252,228) fall out of range ..So you cant expect "ü" and "ä" conversion to same Glyph in UTF8.  
0
 
LVL 12

Accepted Solution

by:
Gideon7 earned 500 total points
ID: 24295624
UTF8 is isomorphic to Unicode (UCS), which encompasses the codepoints for almost every known language in the world.  CP_ACP defines use of the default ANSI character code for a particular code page.  Although it varies by language (e.g., ACP=1252 = Windows-1252 for Western Latin), I am not aware of a single Windows language for which a glyph is not representable by at least one UCS codepoint.  This includes all current ideographic languages (Chinese, Korean, Japanese, etc).
The Unicode committee limited UCS to the first 16 multilingual planes specifically to allow for a UTF8 encoding using at most four octets.  So your statement is incorrect.  There is definitely a mapping from any conceivable CP_ACP code page to UTF8 (UCS).
The problem is that the user is entering a fixed literal string using the ANSI codepage into the MultiByteToWideChar function.  Changing the input codepage to CP_UTF8 requires also changing the input string to UTF8 before submitting it to MultiByteToWideChar.
0
 

Expert Comment

by:bharatpur
ID: 24302309
Hi Gideon
Can you pls suggest hoe to change the input string to UTF8 before submitting it to MultiByteToWideChar.
0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 

Expert Comment

by:bharatpur
ID: 24303360
ü represents octet FC. According to UTF8 range of values between FC-FD are not accepted(invalid octets). Restricted by RFC 3629: start of 6-byte sequence.

ä represent octet E4. Although this is a valid UTF8 octet but it represents the start of 2 byte char. But ä is followed by m which is always read in single octet. So the combination  äm is not a valid UTF8 encoding.

So invalid octet and there combination are left out during conversion from multibyte to widechar string thats why you are unable to see ü  and ä  in the widechar sting.
0
 
LVL 12

Expert Comment

by:Gideon7
ID: 24306134
No.  You are confusing the multi-byte UTF8 representation with the one-byte Windows-1252 (ISO-8859-1) representation.
In UTF8, ü is represented by two octets c3 bc.  ä is represented by two octets c3 a4.
The input must be the multibyte values c3 bc or c3 a4.  Not the one byte values fc or e4.
The following code is wrong:
LPCSTR szName = L"Grüezi zäme";  // ISO-8859-1
WCHAR wszOut[NCHARS];
// Wrong - CP_UTF8 should be 1252.
::MultiByteToWideChar(CP_UTF8, 0, szName, -1, wszOut, NCHARS);
You need to use CP_ACP or 1252, not CP_UTF8.  Octets above 7f are not supported in C/C++ literal strings.
0
 

Expert Comment

by:bharatpur
ID: 24311453
I am not confused. I gave the explanation how Grüezi zäme i.e ( 47 72 fc 65 7a 69 20 7a e4 6d 65) contardict UTF8 encoding principle.
0
 
LVL 12

Expert Comment

by:Gideon7
ID: 25237991
My answer is correct.
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction: Load and Save to file, Document-View interaction inside the SDI. Continuing from the second article about sudoku.   Open the project in visual studio. From the class view select CSudokuDoc and double click to open the header …
Introduction: Hints for the grid button.  Nested classes, templated collections.  Squash that darned bug! Continuing from the sixth article about sudoku.   Open the project in visual studio. First we will finish with the SUD_SETVALUE messa…
This video will show you how to get GIT to work in Eclipse.   It will walk you through how to install the EGit plugin in eclipse and how to checkout an existing repository.
Many functions in Excel can make decisions. The most simple of these is the IF function: it returns a value depending on whether a condition you describe is true or false. Once you get the hang of using the IF function, you will find it easier to us…

911 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now