Solved

Convert C++ string to UTF8

Posted on 2004-08-04
15
14,633 Views
Last Modified: 2011-04-14
Hi,

Is there any way to read in a text file as UTF8 format. Or store it in a UTF8 buffer?  Thanks.

Perry.

P.S. This is extremely urgent so I'm awarding it 500 points.
0
Comment
Question by:dumbo2569
  • 6
  • 6
  • 3
15 Comments
 
LVL 86

Accepted Solution

by:
jkr earned 200 total points
ID: 11720497
Sure, all you have to take care of is using wide character (UNICODE) strings, e.g.

#include <string>
#include <fstream>
using namespace std;

void read_unicode_file ( char* pszName, wstring& buf) {

    wifstream is (pszName);
    wstring strLine;
    while (!is.eof()) {

        getline ( is, strLine);
        buf += strLine;
    }
}

would read the whole file into a wstring called buf.
0
 
LVL 86

Expert Comment

by:jkr
ID: 11720590
BTW, if it is about converting ASCII text to UNICODE, you could use 'mbstowcs()', e.g.

/* MBSTOWCS.CPP illustrates the behavior of the mbstowcs function
 */

#include <stdlib.h>
#include <stdio.h>

void main( void )
{
    int i;
    char    *pmbnull  = NULL;
    char    *pmbhello = (char *)malloc( MB_CUR_MAX );
    wchar_t *pwchello = L"Hi";
    wchar_t *pwc      = (wchar_t *)malloc( sizeof( wchar_t ));

    printf( "Convert to multibyte string:\n" );
    i = wcstombs( pmbhello, pwchello, MB_CUR_MAX );
    printf( "\tCharacters converted: %u\n", i );
    printf( "\tHex value of first" );
    printf( " multibyte character: %#.4x\n\n", pmbhello );

    printf( "Convert back to wide-character string:\n" );
    i = mbstowcs( pwc, pmbhello, MB_CUR_MAX );
    printf( "\tCharacters converted: %u\n", i );
    printf( "\tHex value of first" );
    printf( " wide character: %#.4x\n\n", pwc );
}
0
 
LVL 1

Author Comment

by:dumbo2569
ID: 11720873
What's the difference between UNICODE and UTF?
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 86

Expert Comment

by:jkr
ID: 11720895
UINCODE is the superset to UTF8.
0
 
LVL 1

Author Comment

by:dumbo2569
ID: 11721315
Is there any way to print buf to the console so that I can verify the data?
0
 
LVL 4

Assisted Solution

by:AssafLavie
AssafLavie earned 300 total points
ID: 11721792
>  UINCODE is the superset to UTF8.

jkr, isn't UTF8 the 8-bit enconding of Unicode (which is 16 bit)?
Just like UTF7 is the 70bit encoding of Unicode...
No?
0
 
LVL 4

Expert Comment

by:AssafLavie
ID: 11721794
correction: i mean to say 7bit, not 70.. obviously.
0
 
LVL 86

Expert Comment

by:jkr
ID: 11722440
>>jkr, isn't UTF8 the 8-bit enconding of Unicode (which is 16 bit)?

Ooops, you are right, it is a pecial codepage - see e.g http://support.microsoft.com/default.aspx?scid=kb;en-us;175392 ("INFO: UTF8 Support")

----------------------------->8----------------------------

UTF8 is a code page that uses a string of bytes to represent a 16-bit Unicode string where ASCII text (<=U+007F) remains unchanged as a single byte, U+0080-07FF (including Latin, Greek, Cyrillic, Hebrew, and Arabic) is converted to a 2-byte sequence, and U+0800-FFFF (Chinese, Japanese, Korean, and others) becomes a 3-byte sequence.

----------------------------->8----------------------------

So, the idea is to

    char* pszAnsi   =   new char [ buf.size()];

    if  (   !WideCharToMultiByte    (   CP_UTF8,
                                        0,
                                        buf.c_str
                                        -1,
                                        pszAnsi,
                                        buf.size() + 1
                                        NULL,
                                        NULL
                                    )
        )
        {
            // error
        }

0
 
LVL 1

Author Comment

by:dumbo2569
ID: 11730522
This just converts a UNICODE wstring (buf) to normal ASCII char* string. I need to convert a string to a UTF8 string.

See, my problem is that I need to send an MQMessage to an MQ queue. My program (C++, sender program), sends the message correctly to the queue. However, another client program (JAVA, consumer program), reads the message as UTF8.

My code: (sends the message in ASCII)
  //message preparation
  ImqString strText(strFile.c_str());   // ImqString is an MQ defined type
  msg.setFormat( MQFMT_STRING );      
  msg.setMessageLength(strFile.length());  
  msg.writeItem(strText);  

Their code: (reads the message in UTF8)
message.readUTF();

However, the problem with the MQ C++ API's is that they only accept char* and strings as input parameters for the message. There is no C++ equivalent to "message.writeUTF(string)".

Is there anyway to create a UTF8 string and trick the compiler into using a char* to point to that string?
0
 
LVL 86

Expert Comment

by:jkr
ID: 11730734
>>This just converts a UNICODE wstring (buf) to normal ASCII char* string. I need to convert a string to a UTF8 string.

So, try the other way round:

    char* psz = "this is a test string in ASCII";
    int len = strlen (psz) + 1;
    wchar_t* buf = new wchar_t [ len];

   if  (   !MultiByteToWideChar (   CP_UTF8,
                                       0,
                                       psz,
                                       -1,
                                       buf,
                                       len + 1,
                                       NULL,
                                       NULL
                                   )
       )
       {
           // error
       }

  ImqString strText((char*) buf);   // ImqString is an MQ defined type
  msg.setFormat( MQFMT_STRING );    
  msg.setMessageLength(strFile.length());  
  msg.writeItem(strText);  
0
 
LVL 1

Author Comment

by:dumbo2569
ID: 11730825
JKR,

I just tried your suggestion and I'm getting this error. I don't think you can cast a char* directly from a wchar_t*.

C:\code\CPP\MQPut\MQPut.cpp(218) : error C2040: 'buf' : 'unsigned short *' differs in levels of indirection from 'class std::basic_string<unsigned short,struct std::char_traits<unsigned short>,class std::allocator<unsigned short> >'
C:\code\CPP\MQPut\MQPut.cpp(234) : error C2440: 'type cast' : cannot convert from 'class std::basic_string<unsigned short,struct std::char_traits<unsigned short>,class std::allocator<unsigned short> >' to 'char *'
0
 
LVL 86

Expert Comment

by:jkr
ID: 11730855
Have you used

wchar_t* buf = new wchar_t [ len];

or is it still the above code? I thought we were reading ASCII text now...
0
 
LVL 1

Author Comment

by:dumbo2569
ID: 11731032
I just did a copy and paste.

I'm gettin compilation errors on this line:
wchar_t* buf = new wchar_t [ len];

and on this line from the (char*)buf;
ImqString strText((char*) buf);   // ImqString is an MQ defined type

BTW, thanks for all your help.
0
 
LVL 4

Expert Comment

by:AssafLavie
ID: 11732806
The following converts Unicode to UTF8:

#include <windows.h>
#include <string>
#include <iostream>
#include <assert.h>
#include <vector>

using namespace std;


string unicodeToUtf8(const wstring& source)
{
      if (source.empty())
            return string();
      int neededSize = WideCharToMultiByte(CP_UTF8, 0, source.c_str(), source.size(), NULL, 0, NULL, NULL);
      assert(neededSize);
      vector<char> buffer(neededSize);
      assert(neededSize == WideCharToMultiByte(CP_UTF8, 0, source.c_str(), source.size(), &buffer[0], neededSize, NULL, NULL));
      return &buffer[0];
}

int main()
{
      cout << unicodeToUtf8(L"unicode text here");
      return 0;
}

HTH
0
 
LVL 1

Author Comment

by:dumbo2569
ID: 11775897
Thanks for all your help guys.

After tons of research, I realized that UTF8 is actually backwards compatible with ASCII. Meaning that a plain ASCII string is the same a UTF8 string, so no conversion was necessary. Even though UTF8 is a multi-byte encoding scheme (1,2,3 or 4 bytes/character), it is a 1-byte/character string if you are using just plain ASCII.

Since you guys were a bunch of help and I did use a lot of your code to point me to my conclusion, I'll split the points up.

Thanks.
0

Featured Post

Free Tool: Postgres Monitoring System

A PHP and Perl based system to collect and display usage statistics from PostgreSQL databases.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
C++ get user from AD  (VS6) 7 95
PDF library for Delphi 2 135
Embarcadero C++ Builder XE10.1 Berlin red arrow Indicator 2 71
c++  placing data into a form and an editbox 5 23
C++ Properties One feature missing from standard C++ that you will find in many other Object Oriented Programming languages is something called a Property (http://www.experts-exchange.com/Programming/Languages/CPP/A_3912-Object-Properties-in-C.ht…
Go is an acronym of golang, is a programming language developed Google in 2007. Go is a new language that is mostly in the C family, with significant input from Pascal/Modula/Oberon family. Hence Go arisen as low-level language with fast compilation…
The goal of the tutorial is to teach the user how to use functions in C++. The video will cover how to define functions, how to call functions and how to create functions prototypes. Microsoft Visual C++ 2010 Express will be used as a text editor an…
The viewer will learn how to use the return statement in functions in C++. The video will also teach the user how to pass data to a function and have the function return data back for further processing.

756 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question