?
Solved

MultiByteToWideChar

Posted on 2009-03-30
10
Medium Priority
?
2,595 Views
Last Modified: 2013-11-20
MultiByteToWideChar() Function behave differently when code page is changed from CP_ACP to CP_UTF8

the input string is Grüezi zäme to be converted into widechar

work fine when use CP_ACP

but when i use CP_UTF8 it removes special chars

"ü" and "ä"


using CP_UTF8 as MSDN suggest we should use CP_UTF8 for consistent result



0
Comment
Question by:davinder101
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 4
10 Comments
 
LVL 12

Expert Comment

by:Gideon7
ID: 24027630
Assuming that CP_ACP is Latin1 (Windows-1252 or ISO 8859-1), the input string "Grüezi zäme" is encoded as the octet stream 47 72 fc 65 7a 69 20 7a e4 6d 65.   For CP_UTF8 the input string is encoded as the octet stream 47 72 c3 bc 65 7a 69 20 7a c3 a4 6d 65.
Verify that your input to MultiByteToWideChar matches the given octet strings shown above for the respective code page argument (CP_xxx).
Note that CP_UTF8 does not allow the use of any special flags.  That is, the argumment dwFlags must be zero.
0
 

Expert Comment

by:bharatpur
ID: 24294840
UTF 8 standard works on multi byte character set. i.e. the conversion unit that is take as input can be one byte/two byte depending on the initial byte value. There is no one to one mapping of  special characters between CP_ACP char to CP_UTF8 after the ascii char range from 0->32 or 0->127(32/127 I am not sure about).  "ü" and "ä" i.e (252,228) fall out of range ..So you cant expect "ü" and "ä" conversion to same Glyph in UTF8.  
0
 
LVL 12

Accepted Solution

by:
Gideon7 earned 2000 total points
ID: 24295624
UTF8 is isomorphic to Unicode (UCS), which encompasses the codepoints for almost every known language in the world.  CP_ACP defines use of the default ANSI character code for a particular code page.  Although it varies by language (e.g., ACP=1252 = Windows-1252 for Western Latin), I am not aware of a single Windows language for which a glyph is not representable by at least one UCS codepoint.  This includes all current ideographic languages (Chinese, Korean, Japanese, etc).
The Unicode committee limited UCS to the first 16 multilingual planes specifically to allow for a UTF8 encoding using at most four octets.  So your statement is incorrect.  There is definitely a mapping from any conceivable CP_ACP code page to UTF8 (UCS).
The problem is that the user is entering a fixed literal string using the ANSI codepage into the MultiByteToWideChar function.  Changing the input codepage to CP_UTF8 requires also changing the input string to UTF8 before submitting it to MultiByteToWideChar.
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 

Expert Comment

by:bharatpur
ID: 24302309
Hi Gideon
Can you pls suggest hoe to change the input string to UTF8 before submitting it to MultiByteToWideChar.
0
 

Expert Comment

by:bharatpur
ID: 24303360
ü represents octet FC. According to UTF8 range of values between FC-FD are not accepted(invalid octets). Restricted by RFC 3629: start of 6-byte sequence.

ä represent octet E4. Although this is a valid UTF8 octet but it represents the start of 2 byte char. But ä is followed by m which is always read in single octet. So the combination  äm is not a valid UTF8 encoding.

So invalid octet and there combination are left out during conversion from multibyte to widechar string thats why you are unable to see ü  and ä  in the widechar sting.
0
 
LVL 12

Expert Comment

by:Gideon7
ID: 24306134
No.  You are confusing the multi-byte UTF8 representation with the one-byte Windows-1252 (ISO-8859-1) representation.
In UTF8, ü is represented by two octets c3 bc.  ä is represented by two octets c3 a4.
The input must be the multibyte values c3 bc or c3 a4.  Not the one byte values fc or e4.
The following code is wrong:
LPCSTR szName = L"Grüezi zäme";  // ISO-8859-1
WCHAR wszOut[NCHARS];
// Wrong - CP_UTF8 should be 1252.
::MultiByteToWideChar(CP_UTF8, 0, szName, -1, wszOut, NCHARS);
You need to use CP_ACP or 1252, not CP_UTF8.  Octets above 7f are not supported in C/C++ literal strings.
0
 

Expert Comment

by:bharatpur
ID: 24311453
I am not confused. I gave the explanation how Grüezi zäme i.e ( 47 72 fc 65 7a 69 20 7a e4 6d 65) contardict UTF8 encoding principle.
0
 
LVL 12

Expert Comment

by:Gideon7
ID: 25237991
My answer is correct.
0

Featured Post

On Demand Webinar: Networking for the Cloud Era

Ready to improve network connectivity? Watch this webinar to learn how SD-WANs and a one-click instant connect tool can boost provisions, deployment, and management of your cloud connection.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Exception Handling is in the core of any application that is able to dignify its name. In this article, I'll guide you through the process of writing a DRY (Don't Repeat Yourself) Exception Handling mechanism, using Aspect Oriented Programming.
Have you tried to learn about Unicode, UTF-8, and multibyte text encoding and all the articles are just too "academic" or too technical? This article aims to make the whole topic easy for just about anyone to understand.
This video will show you how to get GIT to work in Eclipse.   It will walk you through how to install the EGit plugin in eclipse and how to checkout an existing repository.
In this video we outline the Physical Segments view of NetCrunch network monitor. By following this brief how-to video, you will be able to learn how NetCrunch visualizes your network, how granular is the information collected, as well as where to f…
Suggested Courses
Course of the Month11 days, 18 hours left to enroll

752 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question