• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1609
  • Last Modified:

How to convert MFC::CString to UTF8 wchar_t*

Hello Everyone,
I wonder if MFC::CString is ansi or utf-8 by default.  My thought would be to consider it as ansi so the question is how to convert the following sequence:

1. CString str = _T("Hello World");
2. char* pszAnsi = str.GetBuffer();
3. wchar_t* pwUnicode = CString(pszAnsi).AllocSysString();
4. wchar_t* pwUTF-8 = ConvertUnicodeToUTF-8(pwUnicode);

Is it the right sequency ? I know I can directly skip from 1 to 3 but my big wish is to have a method from char* ansi to wchar_t* utf-8 directly.

What can you tell me about that conversion query ?
Thank you very much in advance.
Best regards.
MiQi
0
festijazz
Asked:
festijazz
  • 4
  • 3
  • 2
  • +1
1 Solution
 
AndyAinscowFreelance programmer / ConsultantCommented:
It is ansii or unicode depending on the project chosen when it was created.  (That is why you see the _T macro in use in MFC code).  Have a look in the project settings.  There you will see which was selected.
0
 
evilrixSenior Software Engineer (Avast)Commented:
There is a lot of confusion over the use of Unicode in Windows applications, most of which is due to Microsoft using wholly incorrect terminology and misleading statements. I wrote an article about this, which you may find helpful to read as it tries to demystify things a little.

https://www.experts-exchange.com/articles/18363/When-is-Unicode-not-Unicode-When-Microsoft-gets-involved.html

FWIW: As recommended by the following link, I always work with UTF8 internally and only ever convert to UTF16 or ANSI at an API boundary. Not only is UTF8 a way simpler transformation format, it's also the only format that is totally cross platform as it has no issues with byte ordering or data type sizes.

http://utf8everywhere.org/

As for your original question, I believe AndyAinscow has probably provided the answer you need; however, just to elaborate: the whole point of using the _T macro is that you really shouldn't have to care about the character encoding; at least not unless you have a specific function that needs either UTF16 or ANSI. If neither is the case, you can just forget all about the encoding and just happily code away.

If you still have a concern about this it would be helpful to know the "use case" so we can better guide you.

All the best.

-Rx.
1
 
festijazzAuthor Commented:
I got a request to convert single byte characters array to utf 8 bytes array. Most of the time,I convert to unicode bstr and do not care of cross platform for the library I made. So my concern is how to do such a conversion.
Also in Linux utf 8 may be coded on 4 bytes.
how to handle via a single method this transaparant conversion?
thank you very much in advance.
Best regards.
MiQi
0
Cloud Class® Course: MCSA MCSE Windows Server 2012

This course teaches how to install and configure Windows Server 2012 R2.  It is the first step on your path to becoming a Microsoft Certified Solutions Expert (MCSE).

 
AndyAinscowFreelance programmer / ConsultantCommented:
>>I got a request to convert single byte characters array to utf 8 bytes array.

So why did you ask about converting a CString to a utf8
0
 
festijazzAuthor Commented:
because my code has cstring but can be easily updated to char*
0
 
AndyAinscowFreelance programmer / ConsultantCommented:
>>because my code has cstring but can be easily updated to char*

First why do that possibly backwords and complex step.
Second you have asked a question and then later said that isn't what I am interested in (if you want to know something then ask about it, not something else).
Third have you bothered to do what I said in my first comment.
0
 
festijazzAuthor Commented:
hello sir,
converting from cstring to char* has to be done for portability, that is not a big pain. Then be sure I will read yout links but I wanted to add a comment before going to sleep.

I will come back to you once reading all your precious inputs.
thanks y ou
best regards.
MiQi
0
 
evilrixSenior Software Engineer (Avast)Commented:
>> in Linux utf 8 may be coded on 4 bytes.
By definition, UTF8 is a multibyte transformation format, as it UTF16.  They take as many bytes as necessary. Only UTF32 is a 4 Byte encoding format. I have a feeling you are confusing encoding types with data types. The wchar_t type is normally 4 bytes on Linux and 2 bytes on Windows
0
 
sarabandeCommented:
in mfc you could use the function

WideCharToMultiByte to convert from a wchar_t string to char string that has utf-8 multi-byte encoding.

LPWSTR pWideString = L"Some Ansi Text with characters beyond ASCII like € or µ ";

// an utf8 character may use up to 4 Bytes
int utf8size = wcslen(pWideString) * 4+1 ;
char *    pUTF8String = new char[utf8size];

WideCharToMultiByte(CP_UTF8, WC_NO_BEST_FIT_CHARS, pWideString,
       -1,  pUTF8String , utf8size, NULL, NULL);

....

delete []pUTF8String;  // free memory after use

Open in new window


you can assign the pUTF8String to a std::string, or if your application uses 'multi-byte character set' (look into the General page of the configuration Settings) also to a CString (see comments of Andy).

note, if displaying the converted text at the UI you may encounter strange characters since multi-byte utf-8 characters will not display properly in mfc.

if your initial input is ANSI text, you would convert the text to wide characters (UTF-16) before.

char * text = "Some Ansi Text with characters beyond ASCII like € or µ ";
_bstr_t bstr = text;
LPCWSTR pWideString = (wchar_t*)bstr;

Open in new window


the _bstr_t class is a helper that can be used to convert from ansi to utf16 and back.

Sara
0
 
festijazzAuthor Commented:
Thank you very much,
I did also research on my side and I came to the same results:

      char* MultiBytesString1 = "HÄllÜ WÖrld";
      char* MultiBytesString2 = "Hello World";
      wchar_t* WideCharacters1 = GetWC(MultiBytesString1);
      wchar_t* WideCharacters2 = GetWC(MultiBytesString2);

      char* utf8_str1 = ToUTF8(WideCharacters1); //
      char* utf8_str2 = ToUTF8(WideCharacters2); //

      int UTF8_Size1 = strlen(utf8_str1) + 1;  // -> goes to the file.
      int UTF8_Size2 = strlen(utf8_str2) + 1;  // -> goes to the file.

      bool utf8_1 = is_valid_utf8(utf8_str1);
      bool utf8_2 = is_valid_utf8(utf8_str2);

      wchar_t* WideChars1 = FromUTF8(utf8_str1);
      wchar_t* WideChars2 = FromUTF8(utf8_str2);

      char* MBCS1 = GetMBCS(WideChars1);
      char* MBCS2 = GetMBCS(WideChars2);
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Get your problem seen by more experts

Be seen. Boost your question’s priority for more expert views and faster solutions

  • 4
  • 3
  • 2
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now