Solved

Convert C++ string to UTF8

Posted on 2004-08-04
15
14,548 Views
Last Modified: 2011-04-14
Hi,

Is there any way to read in a text file as UTF8 format. Or store it in a UTF8 buffer?  Thanks.

Perry.

P.S. This is extremely urgent so I'm awarding it 500 points.
0
Comment
Question by:dumbo2569
  • 6
  • 6
  • 3
15 Comments
 
LVL 86

Accepted Solution

by:
jkr earned 200 total points
ID: 11720497
Sure, all you have to take care of is using wide character (UNICODE) strings, e.g.

#include <string>
#include <fstream>
using namespace std;

void read_unicode_file ( char* pszName, wstring& buf) {

    wifstream is (pszName);
    wstring strLine;
    while (!is.eof()) {

        getline ( is, strLine);
        buf += strLine;
    }
}

would read the whole file into a wstring called buf.
0
 
LVL 86

Expert Comment

by:jkr
ID: 11720590
BTW, if it is about converting ASCII text to UNICODE, you could use 'mbstowcs()', e.g.

/* MBSTOWCS.CPP illustrates the behavior of the mbstowcs function
 */

#include <stdlib.h>
#include <stdio.h>

void main( void )
{
    int i;
    char    *pmbnull  = NULL;
    char    *pmbhello = (char *)malloc( MB_CUR_MAX );
    wchar_t *pwchello = L"Hi";
    wchar_t *pwc      = (wchar_t *)malloc( sizeof( wchar_t ));

    printf( "Convert to multibyte string:\n" );
    i = wcstombs( pmbhello, pwchello, MB_CUR_MAX );
    printf( "\tCharacters converted: %u\n", i );
    printf( "\tHex value of first" );
    printf( " multibyte character: %#.4x\n\n", pmbhello );

    printf( "Convert back to wide-character string:\n" );
    i = mbstowcs( pwc, pmbhello, MB_CUR_MAX );
    printf( "\tCharacters converted: %u\n", i );
    printf( "\tHex value of first" );
    printf( " wide character: %#.4x\n\n", pwc );
}
0
 
LVL 1

Author Comment

by:dumbo2569
ID: 11720873
What's the difference between UNICODE and UTF?
0
 
LVL 86

Expert Comment

by:jkr
ID: 11720895
UINCODE is the superset to UTF8.
0
 
LVL 1

Author Comment

by:dumbo2569
ID: 11721315
Is there any way to print buf to the console so that I can verify the data?
0
 
LVL 4

Assisted Solution

by:AssafLavie
AssafLavie earned 300 total points
ID: 11721792
>  UINCODE is the superset to UTF8.

jkr, isn't UTF8 the 8-bit enconding of Unicode (which is 16 bit)?
Just like UTF7 is the 70bit encoding of Unicode...
No?
0
 
LVL 4

Expert Comment

by:AssafLavie
ID: 11721794
correction: i mean to say 7bit, not 70.. obviously.
0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 
LVL 86

Expert Comment

by:jkr
ID: 11722440
>>jkr, isn't UTF8 the 8-bit enconding of Unicode (which is 16 bit)?

Ooops, you are right, it is a pecial codepage - see e.g http://support.microsoft.com/default.aspx?scid=kb;en-us;175392 ("INFO: UTF8 Support")

----------------------------->8----------------------------

UTF8 is a code page that uses a string of bytes to represent a 16-bit Unicode string where ASCII text (<=U+007F) remains unchanged as a single byte, U+0080-07FF (including Latin, Greek, Cyrillic, Hebrew, and Arabic) is converted to a 2-byte sequence, and U+0800-FFFF (Chinese, Japanese, Korean, and others) becomes a 3-byte sequence.

----------------------------->8----------------------------

So, the idea is to

    char* pszAnsi   =   new char [ buf.size()];

    if  (   !WideCharToMultiByte    (   CP_UTF8,
                                        0,
                                        buf.c_str
                                        -1,
                                        pszAnsi,
                                        buf.size() + 1
                                        NULL,
                                        NULL
                                    )
        )
        {
            // error
        }

0
 
LVL 1

Author Comment

by:dumbo2569
ID: 11730522
This just converts a UNICODE wstring (buf) to normal ASCII char* string. I need to convert a string to a UTF8 string.

See, my problem is that I need to send an MQMessage to an MQ queue. My program (C++, sender program), sends the message correctly to the queue. However, another client program (JAVA, consumer program), reads the message as UTF8.

My code: (sends the message in ASCII)
  //message preparation
  ImqString strText(strFile.c_str());   // ImqString is an MQ defined type
  msg.setFormat( MQFMT_STRING );      
  msg.setMessageLength(strFile.length());  
  msg.writeItem(strText);  

Their code: (reads the message in UTF8)
message.readUTF();

However, the problem with the MQ C++ API's is that they only accept char* and strings as input parameters for the message. There is no C++ equivalent to "message.writeUTF(string)".

Is there anyway to create a UTF8 string and trick the compiler into using a char* to point to that string?
0
 
LVL 86

Expert Comment

by:jkr
ID: 11730734
>>This just converts a UNICODE wstring (buf) to normal ASCII char* string. I need to convert a string to a UTF8 string.

So, try the other way round:

    char* psz = "this is a test string in ASCII";
    int len = strlen (psz) + 1;
    wchar_t* buf = new wchar_t [ len];

   if  (   !MultiByteToWideChar (   CP_UTF8,
                                       0,
                                       psz,
                                       -1,
                                       buf,
                                       len + 1,
                                       NULL,
                                       NULL
                                   )
       )
       {
           // error
       }

  ImqString strText((char*) buf);   // ImqString is an MQ defined type
  msg.setFormat( MQFMT_STRING );    
  msg.setMessageLength(strFile.length());  
  msg.writeItem(strText);  
0
 
LVL 1

Author Comment

by:dumbo2569
ID: 11730825
JKR,

I just tried your suggestion and I'm getting this error. I don't think you can cast a char* directly from a wchar_t*.

C:\code\CPP\MQPut\MQPut.cpp(218) : error C2040: 'buf' : 'unsigned short *' differs in levels of indirection from 'class std::basic_string<unsigned short,struct std::char_traits<unsigned short>,class std::allocator<unsigned short> >'
C:\code\CPP\MQPut\MQPut.cpp(234) : error C2440: 'type cast' : cannot convert from 'class std::basic_string<unsigned short,struct std::char_traits<unsigned short>,class std::allocator<unsigned short> >' to 'char *'
0
 
LVL 86

Expert Comment

by:jkr
ID: 11730855
Have you used

wchar_t* buf = new wchar_t [ len];

or is it still the above code? I thought we were reading ASCII text now...
0
 
LVL 1

Author Comment

by:dumbo2569
ID: 11731032
I just did a copy and paste.

I'm gettin compilation errors on this line:
wchar_t* buf = new wchar_t [ len];

and on this line from the (char*)buf;
ImqString strText((char*) buf);   // ImqString is an MQ defined type

BTW, thanks for all your help.
0
 
LVL 4

Expert Comment

by:AssafLavie
ID: 11732806
The following converts Unicode to UTF8:

#include <windows.h>
#include <string>
#include <iostream>
#include <assert.h>
#include <vector>

using namespace std;


string unicodeToUtf8(const wstring& source)
{
      if (source.empty())
            return string();
      int neededSize = WideCharToMultiByte(CP_UTF8, 0, source.c_str(), source.size(), NULL, 0, NULL, NULL);
      assert(neededSize);
      vector<char> buffer(neededSize);
      assert(neededSize == WideCharToMultiByte(CP_UTF8, 0, source.c_str(), source.size(), &buffer[0], neededSize, NULL, NULL));
      return &buffer[0];
}

int main()
{
      cout << unicodeToUtf8(L"unicode text here");
      return 0;
}

HTH
0
 
LVL 1

Author Comment

by:dumbo2569
ID: 11775897
Thanks for all your help guys.

After tons of research, I realized that UTF8 is actually backwards compatible with ASCII. Meaning that a plain ASCII string is the same a UTF8 string, so no conversion was necessary. Even though UTF8 is a multi-byte encoding scheme (1,2,3 or 4 bytes/character), it is a 1-byte/character string if you are using just plain ASCII.

Since you guys were a bunch of help and I did use a lot of your code to point me to my conclusion, I'll split the points up.

Thanks.
0

Featured Post

Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

Join & Write a Comment

Unlike C#, C++ doesn't have native support for sealing classes (so they cannot be sub-classed). At the cost of a virtual base class pointer it is possible to implement a pseudo sealing mechanism The trick is to virtually inherit from a base class…
This article shows you how to optimize memory allocations in C++ using placement new. Applicable especially to usecases dealing with creation of large number of objects. A brief on problem: Lets take example problem for simplicity: - I have a G…
The goal of the tutorial is to teach the user how to use functions in C++. The video will cover how to define functions, how to call functions and how to create functions prototypes. Microsoft Visual C++ 2010 Express will be used as a text editor an…
The viewer will learn how to user default arguments when defining functions. This method of defining functions will be contrasted with the non-default-argument of defining functions.

757 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

17 Experts available now in Live!

Get 1:1 Help Now