Link to home
Start Free TrialLog in
Avatar of festijazz
festijazz

asked on

How to convert MFC::CString to UTF8 wchar_t*

Hello Everyone,
I wonder if MFC::CString is ansi or utf-8 by default.  My thought would be to consider it as ansi so the question is how to convert the following sequence:

1. CString str = _T("Hello World");
2. char* pszAnsi = str.GetBuffer();
3. wchar_t* pwUnicode = CString(pszAnsi).AllocSysString();
4. wchar_t* pwUTF-8 = ConvertUnicodeToUTF-8(pwUnicode);

Is it the right sequency ? I know I can directly skip from 1 to 3 but my big wish is to have a method from char* ansi to wchar_t* utf-8 directly.

What can you tell me about that conversion query ?
Thank you very much in advance.
Best regards.
MiQi
Avatar of AndyAinscow
AndyAinscow
Flag of Switzerland image

It is ansii or unicode depending on the project chosen when it was created.  (That is why you see the _T macro in use in MFC code).  Have a look in the project settings.  There you will see which was selected.
There is a lot of confusion over the use of Unicode in Windows applications, most of which is due to Microsoft using wholly incorrect terminology and misleading statements. I wrote an article about this, which you may find helpful to read as it tries to demystify things a little.

https://www.experts-exchange.com/articles/18363/When-is-Unicode-not-Unicode-When-Microsoft-gets-involved.html

FWIW: As recommended by the following link, I always work with UTF8 internally and only ever convert to UTF16 or ANSI at an API boundary. Not only is UTF8 a way simpler transformation format, it's also the only format that is totally cross platform as it has no issues with byte ordering or data type sizes.

http://utf8everywhere.org/

As for your original question, I believe AndyAinscow has probably provided the answer you need; however, just to elaborate: the whole point of using the _T macro is that you really shouldn't have to care about the character encoding; at least not unless you have a specific function that needs either UTF16 or ANSI. If neither is the case, you can just forget all about the encoding and just happily code away.

If you still have a concern about this it would be helpful to know the "use case" so we can better guide you.

All the best.

-Rx.
Avatar of festijazz
festijazz

ASKER

I got a request to convert single byte characters array to utf 8 bytes array. Most of the time,I convert to unicode bstr and do not care of cross platform for the library I made. So my concern is how to do such a conversion.
Also in Linux utf 8 may be coded on 4 bytes.
how to handle via a single method this transaparant conversion?
thank you very much in advance.
Best regards.
MiQi
>>I got a request to convert single byte characters array to utf 8 bytes array.

So why did you ask about converting a CString to a utf8
because my code has cstring but can be easily updated to char*
>>because my code has cstring but can be easily updated to char*

First why do that possibly backwords and complex step.
Second you have asked a question and then later said that isn't what I am interested in (if you want to know something then ask about it, not something else).
Third have you bothered to do what I said in my first comment.
hello sir,
converting from cstring to char* has to be done for portability, that is not a big pain. Then be sure I will read yout links but I wanted to add a comment before going to sleep.

I will come back to you once reading all your precious inputs.
thanks y ou
best regards.
MiQi
>> in Linux utf 8 may be coded on 4 bytes.
By definition, UTF8 is a multibyte transformation format, as it UTF16.  They take as many bytes as necessary. Only UTF32 is a 4 Byte encoding format. I have a feeling you are confusing encoding types with data types. The wchar_t type is normally 4 bytes on Linux and 2 bytes on Windows
ASKER CERTIFIED SOLUTION
Avatar of sarabande
sarabande
Flag of Luxembourg image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thank you very much,
I did also research on my side and I came to the same results:

      char* MultiBytesString1 = "HÄllÜ WÖrld";
      char* MultiBytesString2 = "Hello World";
      wchar_t* WideCharacters1 = GetWC(MultiBytesString1);
      wchar_t* WideCharacters2 = GetWC(MultiBytesString2);

      char* utf8_str1 = ToUTF8(WideCharacters1); //
      char* utf8_str2 = ToUTF8(WideCharacters2); //

      int UTF8_Size1 = strlen(utf8_str1) + 1;  // -> goes to the file.
      int UTF8_Size2 = strlen(utf8_str2) + 1;  // -> goes to the file.

      bool utf8_1 = is_valid_utf8(utf8_str1);
      bool utf8_2 = is_valid_utf8(utf8_str2);

      wchar_t* WideChars1 = FromUTF8(utf8_str1);
      wchar_t* WideChars2 = FromUTF8(utf8_str2);

      char* MBCS1 = GetMBCS(WideChars1);
      char* MBCS2 = GetMBCS(WideChars2);