Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

Convert C++ string to UTF8

Posted on 2004-08-04
15
Medium Priority
?
14,708 Views
Last Modified: 2011-04-14
Hi,

Is there any way to read in a text file as UTF8 format. Or store it in a UTF8 buffer?  Thanks.

Perry.

P.S. This is extremely urgent so I'm awarding it 500 points.
0
Comment
Question by:dumbo2569
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 6
  • 3
15 Comments
 
LVL 86

Accepted Solution

by:
jkr earned 600 total points
ID: 11720497
Sure, all you have to take care of is using wide character (UNICODE) strings, e.g.

#include <string>
#include <fstream>
using namespace std;

void read_unicode_file ( char* pszName, wstring& buf) {

    wifstream is (pszName);
    wstring strLine;
    while (!is.eof()) {

        getline ( is, strLine);
        buf += strLine;
    }
}

would read the whole file into a wstring called buf.
0
 
LVL 86

Expert Comment

by:jkr
ID: 11720590
BTW, if it is about converting ASCII text to UNICODE, you could use 'mbstowcs()', e.g.

/* MBSTOWCS.CPP illustrates the behavior of the mbstowcs function
 */

#include <stdlib.h>
#include <stdio.h>

void main( void )
{
    int i;
    char    *pmbnull  = NULL;
    char    *pmbhello = (char *)malloc( MB_CUR_MAX );
    wchar_t *pwchello = L"Hi";
    wchar_t *pwc      = (wchar_t *)malloc( sizeof( wchar_t ));

    printf( "Convert to multibyte string:\n" );
    i = wcstombs( pmbhello, pwchello, MB_CUR_MAX );
    printf( "\tCharacters converted: %u\n", i );
    printf( "\tHex value of first" );
    printf( " multibyte character: %#.4x\n\n", pmbhello );

    printf( "Convert back to wide-character string:\n" );
    i = mbstowcs( pwc, pmbhello, MB_CUR_MAX );
    printf( "\tCharacters converted: %u\n", i );
    printf( "\tHex value of first" );
    printf( " wide character: %#.4x\n\n", pwc );
}
0
 
LVL 1

Author Comment

by:dumbo2569
ID: 11720873
What's the difference between UNICODE and UTF?
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
LVL 86

Expert Comment

by:jkr
ID: 11720895
UINCODE is the superset to UTF8.
0
 
LVL 1

Author Comment

by:dumbo2569
ID: 11721315
Is there any way to print buf to the console so that I can verify the data?
0
 
LVL 4

Assisted Solution

by:AssafLavie
AssafLavie earned 900 total points
ID: 11721792
>  UINCODE is the superset to UTF8.

jkr, isn't UTF8 the 8-bit enconding of Unicode (which is 16 bit)?
Just like UTF7 is the 70bit encoding of Unicode...
No?
0
 
LVL 4

Expert Comment

by:AssafLavie
ID: 11721794
correction: i mean to say 7bit, not 70.. obviously.
0
 
LVL 86

Expert Comment

by:jkr
ID: 11722440
>>jkr, isn't UTF8 the 8-bit enconding of Unicode (which is 16 bit)?

Ooops, you are right, it is a pecial codepage - see e.g http://support.microsoft.com/default.aspx?scid=kb;en-us;175392 ("INFO: UTF8 Support")

----------------------------->8----------------------------

UTF8 is a code page that uses a string of bytes to represent a 16-bit Unicode string where ASCII text (<=U+007F) remains unchanged as a single byte, U+0080-07FF (including Latin, Greek, Cyrillic, Hebrew, and Arabic) is converted to a 2-byte sequence, and U+0800-FFFF (Chinese, Japanese, Korean, and others) becomes a 3-byte sequence.

----------------------------->8----------------------------

So, the idea is to

    char* pszAnsi   =   new char [ buf.size()];

    if  (   !WideCharToMultiByte    (   CP_UTF8,
                                        0,
                                        buf.c_str
                                        -1,
                                        pszAnsi,
                                        buf.size() + 1
                                        NULL,
                                        NULL
                                    )
        )
        {
            // error
        }

0
 
LVL 1

Author Comment

by:dumbo2569
ID: 11730522
This just converts a UNICODE wstring (buf) to normal ASCII char* string. I need to convert a string to a UTF8 string.

See, my problem is that I need to send an MQMessage to an MQ queue. My program (C++, sender program), sends the message correctly to the queue. However, another client program (JAVA, consumer program), reads the message as UTF8.

My code: (sends the message in ASCII)
  //message preparation
  ImqString strText(strFile.c_str());   // ImqString is an MQ defined type
  msg.setFormat( MQFMT_STRING );      
  msg.setMessageLength(strFile.length());  
  msg.writeItem(strText);  

Their code: (reads the message in UTF8)
message.readUTF();

However, the problem with the MQ C++ API's is that they only accept char* and strings as input parameters for the message. There is no C++ equivalent to "message.writeUTF(string)".

Is there anyway to create a UTF8 string and trick the compiler into using a char* to point to that string?
0
 
LVL 86

Expert Comment

by:jkr
ID: 11730734
>>This just converts a UNICODE wstring (buf) to normal ASCII char* string. I need to convert a string to a UTF8 string.

So, try the other way round:

    char* psz = "this is a test string in ASCII";
    int len = strlen (psz) + 1;
    wchar_t* buf = new wchar_t [ len];

   if  (   !MultiByteToWideChar (   CP_UTF8,
                                       0,
                                       psz,
                                       -1,
                                       buf,
                                       len + 1,
                                       NULL,
                                       NULL
                                   )
       )
       {
           // error
       }

  ImqString strText((char*) buf);   // ImqString is an MQ defined type
  msg.setFormat( MQFMT_STRING );    
  msg.setMessageLength(strFile.length());  
  msg.writeItem(strText);  
0
 
LVL 1

Author Comment

by:dumbo2569
ID: 11730825
JKR,

I just tried your suggestion and I'm getting this error. I don't think you can cast a char* directly from a wchar_t*.

C:\code\CPP\MQPut\MQPut.cpp(218) : error C2040: 'buf' : 'unsigned short *' differs in levels of indirection from 'class std::basic_string<unsigned short,struct std::char_traits<unsigned short>,class std::allocator<unsigned short> >'
C:\code\CPP\MQPut\MQPut.cpp(234) : error C2440: 'type cast' : cannot convert from 'class std::basic_string<unsigned short,struct std::char_traits<unsigned short>,class std::allocator<unsigned short> >' to 'char *'
0
 
LVL 86

Expert Comment

by:jkr
ID: 11730855
Have you used

wchar_t* buf = new wchar_t [ len];

or is it still the above code? I thought we were reading ASCII text now...
0
 
LVL 1

Author Comment

by:dumbo2569
ID: 11731032
I just did a copy and paste.

I'm gettin compilation errors on this line:
wchar_t* buf = new wchar_t [ len];

and on this line from the (char*)buf;
ImqString strText((char*) buf);   // ImqString is an MQ defined type

BTW, thanks for all your help.
0
 
LVL 4

Expert Comment

by:AssafLavie
ID: 11732806
The following converts Unicode to UTF8:

#include <windows.h>
#include <string>
#include <iostream>
#include <assert.h>
#include <vector>

using namespace std;


string unicodeToUtf8(const wstring& source)
{
      if (source.empty())
            return string();
      int neededSize = WideCharToMultiByte(CP_UTF8, 0, source.c_str(), source.size(), NULL, 0, NULL, NULL);
      assert(neededSize);
      vector<char> buffer(neededSize);
      assert(neededSize == WideCharToMultiByte(CP_UTF8, 0, source.c_str(), source.size(), &buffer[0], neededSize, NULL, NULL));
      return &buffer[0];
}

int main()
{
      cout << unicodeToUtf8(L"unicode text here");
      return 0;
}

HTH
0
 
LVL 1

Author Comment

by:dumbo2569
ID: 11775897
Thanks for all your help guys.

After tons of research, I realized that UTF8 is actually backwards compatible with ASCII. Meaning that a plain ASCII string is the same a UTF8 string, so no conversion was necessary. Even though UTF8 is a multi-byte encoding scheme (1,2,3 or 4 bytes/character), it is a 1-byte/character string if you are using just plain ASCII.

Since you guys were a bunch of help and I did use a lot of your code to point me to my conclusion, I'll split the points up.

Thanks.
0

Featured Post

Enroll in September's Course of the Month

This month’s featured course covers 16 hours of training in installation, management, and deployment of VMware vSphere virtualization environments. It's free for Premium Members, Team Accounts, and Qualified Experts!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

When writing generic code, using template meta-programming techniques, it is sometimes useful to know if a type is convertible to another type. A good example of when this might be is if you are writing diagnostic instrumentation for code to generat…
Written by John Humphreys C++ Threading and the POSIX Library This article will cover the basic information that you need to know in order to make use of the POSIX threading library available for C and C++ on UNIX and most Linux systems.   [s…
The goal of the video will be to teach the user the difference and consequence of passing data by value vs passing data by reference in C++. An example of passing data by value as well as an example of passing data by reference will be be given. Bot…
The viewer will be introduced to the technique of using vectors in C++. The video will cover how to define a vector, store values in the vector and retrieve data from the values stored in the vector.
Suggested Courses

722 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question