festijazz
asked on
How to convert MFC::CString to UTF8 wchar_t*
Hello Everyone,
I wonder if MFC::CString is ansi or utf-8 by default. My thought would be to consider it as ansi so the question is how to convert the following sequence:
1. CString str = _T("Hello World");
2. char* pszAnsi = str.GetBuffer();
3. wchar_t* pwUnicode = CString(pszAnsi).AllocSysS tring();
4. wchar_t* pwUTF-8 = ConvertUnicodeToUTF-8(pwUn icode);
Is it the right sequency ? I know I can directly skip from 1 to 3 but my big wish is to have a method from char* ansi to wchar_t* utf-8 directly.
What can you tell me about that conversion query ?
Thank you very much in advance.
Best regards.
MiQi
I wonder if MFC::CString is ansi or utf-8 by default. My thought would be to consider it as ansi so the question is how to convert the following sequence:
1. CString str = _T("Hello World");
2. char* pszAnsi = str.GetBuffer();
3. wchar_t* pwUnicode = CString(pszAnsi).AllocSysS
4. wchar_t* pwUTF-8 = ConvertUnicodeToUTF-8(pwUn
Is it the right sequency ? I know I can directly skip from 1 to 3 but my big wish is to have a method from char* ansi to wchar_t* utf-8 directly.
What can you tell me about that conversion query ?
Thank you very much in advance.
Best regards.
MiQi
It is ansii or unicode depending on the project chosen when it was created. (That is why you see the _T macro in use in MFC code). Have a look in the project settings. There you will see which was selected.
There is a lot of confusion over the use of Unicode in Windows applications, most of which is due to Microsoft using wholly incorrect terminology and misleading statements. I wrote an article about this, which you may find helpful to read as it tries to demystify things a little.
https://www.experts-exchange.com/articles/18363/When-is-Unicode-not-Unicode-When-Microsoft-gets-involved.html
FWIW: As recommended by the following link, I always work with UTF8 internally and only ever convert to UTF16 or ANSI at an API boundary. Not only is UTF8 a way simpler transformation format, it's also the only format that is totally cross platform as it has no issues with byte ordering or data type sizes.
http://utf8everywhere.org/
As for your original question, I believe AndyAinscow has probably provided the answer you need; however, just to elaborate: the whole point of using the _T macro is that you really shouldn't have to care about the character encoding; at least not unless you have a specific function that needs either UTF16 or ANSI. If neither is the case, you can just forget all about the encoding and just happily code away.
If you still have a concern about this it would be helpful to know the "use case" so we can better guide you.
All the best.
-Rx.
https://www.experts-exchange.com/articles/18363/When-is-Unicode-not-Unicode-When-Microsoft-gets-involved.html
FWIW: As recommended by the following link, I always work with UTF8 internally and only ever convert to UTF16 or ANSI at an API boundary. Not only is UTF8 a way simpler transformation format, it's also the only format that is totally cross platform as it has no issues with byte ordering or data type sizes.
http://utf8everywhere.org/
As for your original question, I believe AndyAinscow has probably provided the answer you need; however, just to elaborate: the whole point of using the _T macro is that you really shouldn't have to care about the character encoding; at least not unless you have a specific function that needs either UTF16 or ANSI. If neither is the case, you can just forget all about the encoding and just happily code away.
If you still have a concern about this it would be helpful to know the "use case" so we can better guide you.
All the best.
-Rx.
ASKER
I got a request to convert single byte characters array to utf 8 bytes array. Most of the time,I convert to unicode bstr and do not care of cross platform for the library I made. So my concern is how to do such a conversion.
Also in Linux utf 8 may be coded on 4 bytes.
how to handle via a single method this transaparant conversion?
thank you very much in advance.
Best regards.
MiQi
Also in Linux utf 8 may be coded on 4 bytes.
how to handle via a single method this transaparant conversion?
thank you very much in advance.
Best regards.
MiQi
>>I got a request to convert single byte characters array to utf 8 bytes array.
So why did you ask about converting a CString to a utf8
So why did you ask about converting a CString to a utf8
ASKER
because my code has cstring but can be easily updated to char*
>>because my code has cstring but can be easily updated to char*
First why do that possibly backwords and complex step.
Second you have asked a question and then later said that isn't what I am interested in (if you want to know something then ask about it, not something else).
Third have you bothered to do what I said in my first comment.
First why do that possibly backwords and complex step.
Second you have asked a question and then later said that isn't what I am interested in (if you want to know something then ask about it, not something else).
Third have you bothered to do what I said in my first comment.
ASKER
hello sir,
converting from cstring to char* has to be done for portability, that is not a big pain. Then be sure I will read yout links but I wanted to add a comment before going to sleep.
I will come back to you once reading all your precious inputs.
thanks y ou
best regards.
MiQi
converting from cstring to char* has to be done for portability, that is not a big pain. Then be sure I will read yout links but I wanted to add a comment before going to sleep.
I will come back to you once reading all your precious inputs.
thanks y ou
best regards.
MiQi
>> in Linux utf 8 may be coded on 4 bytes.
By definition, UTF8 is a multibyte transformation format, as it UTF16. They take as many bytes as necessary. Only UTF32 is a 4 Byte encoding format. I have a feeling you are confusing encoding types with data types. The wchar_t type is normally 4 bytes on Linux and 2 bytes on Windows
By definition, UTF8 is a multibyte transformation format, as it UTF16. They take as many bytes as necessary. Only UTF32 is a 4 Byte encoding format. I have a feeling you are confusing encoding types with data types. The wchar_t type is normally 4 bytes on Linux and 2 bytes on Windows
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thank you very much,
I did also research on my side and I came to the same results:
char* MultiBytesString1 = "HÄllÜ WÖrld";
char* MultiBytesString2 = "Hello World";
wchar_t* WideCharacters1 = GetWC(MultiBytesString1);
wchar_t* WideCharacters2 = GetWC(MultiBytesString2);
char* utf8_str1 = ToUTF8(WideCharacters1); //
char* utf8_str2 = ToUTF8(WideCharacters2); //
int UTF8_Size1 = strlen(utf8_str1) + 1; // -> goes to the file.
int UTF8_Size2 = strlen(utf8_str2) + 1; // -> goes to the file.
bool utf8_1 = is_valid_utf8(utf8_str1);
bool utf8_2 = is_valid_utf8(utf8_str2);
wchar_t* WideChars1 = FromUTF8(utf8_str1);
wchar_t* WideChars2 = FromUTF8(utf8_str2);
char* MBCS1 = GetMBCS(WideChars1);
char* MBCS2 = GetMBCS(WideChars2);
I did also research on my side and I came to the same results:
char* MultiBytesString1 = "HÄllÜ WÖrld";
char* MultiBytesString2 = "Hello World";
wchar_t* WideCharacters1 = GetWC(MultiBytesString1);
wchar_t* WideCharacters2 = GetWC(MultiBytesString2);
char* utf8_str1 = ToUTF8(WideCharacters1); //
char* utf8_str2 = ToUTF8(WideCharacters2); //
int UTF8_Size1 = strlen(utf8_str1) + 1; // -> goes to the file.
int UTF8_Size2 = strlen(utf8_str2) + 1; // -> goes to the file.
bool utf8_1 = is_valid_utf8(utf8_str1);
bool utf8_2 = is_valid_utf8(utf8_str2);
wchar_t* WideChars1 = FromUTF8(utf8_str1);
wchar_t* WideChars2 = FromUTF8(utf8_str2);
char* MBCS1 = GetMBCS(WideChars1);
char* MBCS2 = GetMBCS(WideChars2);