Go Premium for a chance to win a PS4. Enter to Win

x
?
Solved

Convert C++ string to UTF8

Posted on 2004-08-04
15
Medium Priority
?
14,731 Views
Last Modified: 2011-04-14
Hi,

Is there any way to read in a text file as UTF8 format. Or store it in a UTF8 buffer?  Thanks.

Perry.

P.S. This is extremely urgent so I'm awarding it 500 points.
0
Comment
Question by:dumbo2569
  • 6
  • 6
  • 3
15 Comments
 
LVL 86

Accepted Solution

by:
jkr earned 600 total points
ID: 11720497
Sure, all you have to take care of is using wide character (UNICODE) strings, e.g.

#include <string>
#include <fstream>
using namespace std;

void read_unicode_file ( char* pszName, wstring& buf) {

    wifstream is (pszName);
    wstring strLine;
    while (!is.eof()) {

        getline ( is, strLine);
        buf += strLine;
    }
}

would read the whole file into a wstring called buf.
0
 
LVL 86

Expert Comment

by:jkr
ID: 11720590
BTW, if it is about converting ASCII text to UNICODE, you could use 'mbstowcs()', e.g.

/* MBSTOWCS.CPP illustrates the behavior of the mbstowcs function
 */

#include <stdlib.h>
#include <stdio.h>

void main( void )
{
    int i;
    char    *pmbnull  = NULL;
    char    *pmbhello = (char *)malloc( MB_CUR_MAX );
    wchar_t *pwchello = L"Hi";
    wchar_t *pwc      = (wchar_t *)malloc( sizeof( wchar_t ));

    printf( "Convert to multibyte string:\n" );
    i = wcstombs( pmbhello, pwchello, MB_CUR_MAX );
    printf( "\tCharacters converted: %u\n", i );
    printf( "\tHex value of first" );
    printf( " multibyte character: %#.4x\n\n", pmbhello );

    printf( "Convert back to wide-character string:\n" );
    i = mbstowcs( pwc, pmbhello, MB_CUR_MAX );
    printf( "\tCharacters converted: %u\n", i );
    printf( "\tHex value of first" );
    printf( " wide character: %#.4x\n\n", pwc );
}
0
 
LVL 1

Author Comment

by:dumbo2569
ID: 11720873
What's the difference between UNICODE and UTF?
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
LVL 86

Expert Comment

by:jkr
ID: 11720895
UINCODE is the superset to UTF8.
0
 
LVL 1

Author Comment

by:dumbo2569
ID: 11721315
Is there any way to print buf to the console so that I can verify the data?
0
 
LVL 4

Assisted Solution

by:AssafLavie
AssafLavie earned 900 total points
ID: 11721792
>  UINCODE is the superset to UTF8.

jkr, isn't UTF8 the 8-bit enconding of Unicode (which is 16 bit)?
Just like UTF7 is the 70bit encoding of Unicode...
No?
0
 
LVL 4

Expert Comment

by:AssafLavie
ID: 11721794
correction: i mean to say 7bit, not 70.. obviously.
0
 
LVL 86

Expert Comment

by:jkr
ID: 11722440
>>jkr, isn't UTF8 the 8-bit enconding of Unicode (which is 16 bit)?

Ooops, you are right, it is a pecial codepage - see e.g http://support.microsoft.com/default.aspx?scid=kb;en-us;175392 ("INFO: UTF8 Support")

----------------------------->8----------------------------

UTF8 is a code page that uses a string of bytes to represent a 16-bit Unicode string where ASCII text (<=U+007F) remains unchanged as a single byte, U+0080-07FF (including Latin, Greek, Cyrillic, Hebrew, and Arabic) is converted to a 2-byte sequence, and U+0800-FFFF (Chinese, Japanese, Korean, and others) becomes a 3-byte sequence.

----------------------------->8----------------------------

So, the idea is to

    char* pszAnsi   =   new char [ buf.size()];

    if  (   !WideCharToMultiByte    (   CP_UTF8,
                                        0,
                                        buf.c_str
                                        -1,
                                        pszAnsi,
                                        buf.size() + 1
                                        NULL,
                                        NULL
                                    )
        )
        {
            // error
        }

0
 
LVL 1

Author Comment

by:dumbo2569
ID: 11730522
This just converts a UNICODE wstring (buf) to normal ASCII char* string. I need to convert a string to a UTF8 string.

See, my problem is that I need to send an MQMessage to an MQ queue. My program (C++, sender program), sends the message correctly to the queue. However, another client program (JAVA, consumer program), reads the message as UTF8.

My code: (sends the message in ASCII)
  //message preparation
  ImqString strText(strFile.c_str());   // ImqString is an MQ defined type
  msg.setFormat( MQFMT_STRING );      
  msg.setMessageLength(strFile.length());  
  msg.writeItem(strText);  

Their code: (reads the message in UTF8)
message.readUTF();

However, the problem with the MQ C++ API's is that they only accept char* and strings as input parameters for the message. There is no C++ equivalent to "message.writeUTF(string)".

Is there anyway to create a UTF8 string and trick the compiler into using a char* to point to that string?
0
 
LVL 86

Expert Comment

by:jkr
ID: 11730734
>>This just converts a UNICODE wstring (buf) to normal ASCII char* string. I need to convert a string to a UTF8 string.

So, try the other way round:

    char* psz = "this is a test string in ASCII";
    int len = strlen (psz) + 1;
    wchar_t* buf = new wchar_t [ len];

   if  (   !MultiByteToWideChar (   CP_UTF8,
                                       0,
                                       psz,
                                       -1,
                                       buf,
                                       len + 1,
                                       NULL,
                                       NULL
                                   )
       )
       {
           // error
       }

  ImqString strText((char*) buf);   // ImqString is an MQ defined type
  msg.setFormat( MQFMT_STRING );    
  msg.setMessageLength(strFile.length());  
  msg.writeItem(strText);  
0
 
LVL 1

Author Comment

by:dumbo2569
ID: 11730825
JKR,

I just tried your suggestion and I'm getting this error. I don't think you can cast a char* directly from a wchar_t*.

C:\code\CPP\MQPut\MQPut.cpp(218) : error C2040: 'buf' : 'unsigned short *' differs in levels of indirection from 'class std::basic_string<unsigned short,struct std::char_traits<unsigned short>,class std::allocator<unsigned short> >'
C:\code\CPP\MQPut\MQPut.cpp(234) : error C2440: 'type cast' : cannot convert from 'class std::basic_string<unsigned short,struct std::char_traits<unsigned short>,class std::allocator<unsigned short> >' to 'char *'
0
 
LVL 86

Expert Comment

by:jkr
ID: 11730855
Have you used

wchar_t* buf = new wchar_t [ len];

or is it still the above code? I thought we were reading ASCII text now...
0
 
LVL 1

Author Comment

by:dumbo2569
ID: 11731032
I just did a copy and paste.

I'm gettin compilation errors on this line:
wchar_t* buf = new wchar_t [ len];

and on this line from the (char*)buf;
ImqString strText((char*) buf);   // ImqString is an MQ defined type

BTW, thanks for all your help.
0
 
LVL 4

Expert Comment

by:AssafLavie
ID: 11732806
The following converts Unicode to UTF8:

#include <windows.h>
#include <string>
#include <iostream>
#include <assert.h>
#include <vector>

using namespace std;


string unicodeToUtf8(const wstring& source)
{
      if (source.empty())
            return string();
      int neededSize = WideCharToMultiByte(CP_UTF8, 0, source.c_str(), source.size(), NULL, 0, NULL, NULL);
      assert(neededSize);
      vector<char> buffer(neededSize);
      assert(neededSize == WideCharToMultiByte(CP_UTF8, 0, source.c_str(), source.size(), &buffer[0], neededSize, NULL, NULL));
      return &buffer[0];
}

int main()
{
      cout << unicodeToUtf8(L"unicode text here");
      return 0;
}

HTH
0
 
LVL 1

Author Comment

by:dumbo2569
ID: 11775897
Thanks for all your help guys.

After tons of research, I realized that UTF8 is actually backwards compatible with ASCII. Meaning that a plain ASCII string is the same a UTF8 string, so no conversion was necessary. Even though UTF8 is a multi-byte encoding scheme (1,2,3 or 4 bytes/character), it is a 1-byte/character string if you are using just plain ASCII.

Since you guys were a bunch of help and I did use a lot of your code to point me to my conclusion, I'll split the points up.

Thanks.
0

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

IntroductionThis article is the second in a three part article series on the Visual Studio 2008 Debugger.  It provides tips in setting and using breakpoints. If not familiar with this debugger, you can find a basic introduction in the EE article loc…
Article by: evilrix
Looking for a way to avoid searching through large data sets for data that doesn't exist? A Bloom Filter might be what you need. This data structure is a probabilistic filter that allows you to avoid unnecessary searches when you know the data defin…
The viewer will learn how to pass data into a function in C++. This is one step further in using functions. Instead of only printing text onto the console, the function will be able to perform calculations with argumentents given by the user.
The viewer will learn additional member functions of the vector class. Specifically, the capacity and swap member functions will be introduced.
Suggested Courses

886 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question