Solved

Convert C++ string to UTF8

Posted on 2004-08-04
15
14,688 Views
Last Modified: 2011-04-14
Hi,

Is there any way to read in a text file as UTF8 format. Or store it in a UTF8 buffer?  Thanks.

Perry.

P.S. This is extremely urgent so I'm awarding it 500 points.
0
Comment
Question by:dumbo2569
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 6
  • 3
15 Comments
 
LVL 86

Accepted Solution

by:
jkr earned 200 total points
ID: 11720497
Sure, all you have to take care of is using wide character (UNICODE) strings, e.g.

#include <string>
#include <fstream>
using namespace std;

void read_unicode_file ( char* pszName, wstring& buf) {

    wifstream is (pszName);
    wstring strLine;
    while (!is.eof()) {

        getline ( is, strLine);
        buf += strLine;
    }
}

would read the whole file into a wstring called buf.
0
 
LVL 86

Expert Comment

by:jkr
ID: 11720590
BTW, if it is about converting ASCII text to UNICODE, you could use 'mbstowcs()', e.g.

/* MBSTOWCS.CPP illustrates the behavior of the mbstowcs function
 */

#include <stdlib.h>
#include <stdio.h>

void main( void )
{
    int i;
    char    *pmbnull  = NULL;
    char    *pmbhello = (char *)malloc( MB_CUR_MAX );
    wchar_t *pwchello = L"Hi";
    wchar_t *pwc      = (wchar_t *)malloc( sizeof( wchar_t ));

    printf( "Convert to multibyte string:\n" );
    i = wcstombs( pmbhello, pwchello, MB_CUR_MAX );
    printf( "\tCharacters converted: %u\n", i );
    printf( "\tHex value of first" );
    printf( " multibyte character: %#.4x\n\n", pmbhello );

    printf( "Convert back to wide-character string:\n" );
    i = mbstowcs( pwc, pmbhello, MB_CUR_MAX );
    printf( "\tCharacters converted: %u\n", i );
    printf( "\tHex value of first" );
    printf( " wide character: %#.4x\n\n", pwc );
}
0
 
LVL 1

Author Comment

by:dumbo2569
ID: 11720873
What's the difference between UNICODE and UTF?
0
[Live Webinar] The Cloud Skills Gap

As Cloud technologies come of age, business leaders grapple with the impact it has on their team's skills and the gap associated with the use of a cloud platform.

Join experts from 451 Research and Concerto Cloud Services on July 27th where we will examine fact and fiction.

 
LVL 86

Expert Comment

by:jkr
ID: 11720895
UINCODE is the superset to UTF8.
0
 
LVL 1

Author Comment

by:dumbo2569
ID: 11721315
Is there any way to print buf to the console so that I can verify the data?
0
 
LVL 4

Assisted Solution

by:AssafLavie
AssafLavie earned 300 total points
ID: 11721792
>  UINCODE is the superset to UTF8.

jkr, isn't UTF8 the 8-bit enconding of Unicode (which is 16 bit)?
Just like UTF7 is the 70bit encoding of Unicode...
No?
0
 
LVL 4

Expert Comment

by:AssafLavie
ID: 11721794
correction: i mean to say 7bit, not 70.. obviously.
0
 
LVL 86

Expert Comment

by:jkr
ID: 11722440
>>jkr, isn't UTF8 the 8-bit enconding of Unicode (which is 16 bit)?

Ooops, you are right, it is a pecial codepage - see e.g http://support.microsoft.com/default.aspx?scid=kb;en-us;175392 ("INFO: UTF8 Support")

----------------------------->8----------------------------

UTF8 is a code page that uses a string of bytes to represent a 16-bit Unicode string where ASCII text (<=U+007F) remains unchanged as a single byte, U+0080-07FF (including Latin, Greek, Cyrillic, Hebrew, and Arabic) is converted to a 2-byte sequence, and U+0800-FFFF (Chinese, Japanese, Korean, and others) becomes a 3-byte sequence.

----------------------------->8----------------------------

So, the idea is to

    char* pszAnsi   =   new char [ buf.size()];

    if  (   !WideCharToMultiByte    (   CP_UTF8,
                                        0,
                                        buf.c_str
                                        -1,
                                        pszAnsi,
                                        buf.size() + 1
                                        NULL,
                                        NULL
                                    )
        )
        {
            // error
        }

0
 
LVL 1

Author Comment

by:dumbo2569
ID: 11730522
This just converts a UNICODE wstring (buf) to normal ASCII char* string. I need to convert a string to a UTF8 string.

See, my problem is that I need to send an MQMessage to an MQ queue. My program (C++, sender program), sends the message correctly to the queue. However, another client program (JAVA, consumer program), reads the message as UTF8.

My code: (sends the message in ASCII)
  //message preparation
  ImqString strText(strFile.c_str());   // ImqString is an MQ defined type
  msg.setFormat( MQFMT_STRING );      
  msg.setMessageLength(strFile.length());  
  msg.writeItem(strText);  

Their code: (reads the message in UTF8)
message.readUTF();

However, the problem with the MQ C++ API's is that they only accept char* and strings as input parameters for the message. There is no C++ equivalent to "message.writeUTF(string)".

Is there anyway to create a UTF8 string and trick the compiler into using a char* to point to that string?
0
 
LVL 86

Expert Comment

by:jkr
ID: 11730734
>>This just converts a UNICODE wstring (buf) to normal ASCII char* string. I need to convert a string to a UTF8 string.

So, try the other way round:

    char* psz = "this is a test string in ASCII";
    int len = strlen (psz) + 1;
    wchar_t* buf = new wchar_t [ len];

   if  (   !MultiByteToWideChar (   CP_UTF8,
                                       0,
                                       psz,
                                       -1,
                                       buf,
                                       len + 1,
                                       NULL,
                                       NULL
                                   )
       )
       {
           // error
       }

  ImqString strText((char*) buf);   // ImqString is an MQ defined type
  msg.setFormat( MQFMT_STRING );    
  msg.setMessageLength(strFile.length());  
  msg.writeItem(strText);  
0
 
LVL 1

Author Comment

by:dumbo2569
ID: 11730825
JKR,

I just tried your suggestion and I'm getting this error. I don't think you can cast a char* directly from a wchar_t*.

C:\code\CPP\MQPut\MQPut.cpp(218) : error C2040: 'buf' : 'unsigned short *' differs in levels of indirection from 'class std::basic_string<unsigned short,struct std::char_traits<unsigned short>,class std::allocator<unsigned short> >'
C:\code\CPP\MQPut\MQPut.cpp(234) : error C2440: 'type cast' : cannot convert from 'class std::basic_string<unsigned short,struct std::char_traits<unsigned short>,class std::allocator<unsigned short> >' to 'char *'
0
 
LVL 86

Expert Comment

by:jkr
ID: 11730855
Have you used

wchar_t* buf = new wchar_t [ len];

or is it still the above code? I thought we were reading ASCII text now...
0
 
LVL 1

Author Comment

by:dumbo2569
ID: 11731032
I just did a copy and paste.

I'm gettin compilation errors on this line:
wchar_t* buf = new wchar_t [ len];

and on this line from the (char*)buf;
ImqString strText((char*) buf);   // ImqString is an MQ defined type

BTW, thanks for all your help.
0
 
LVL 4

Expert Comment

by:AssafLavie
ID: 11732806
The following converts Unicode to UTF8:

#include <windows.h>
#include <string>
#include <iostream>
#include <assert.h>
#include <vector>

using namespace std;


string unicodeToUtf8(const wstring& source)
{
      if (source.empty())
            return string();
      int neededSize = WideCharToMultiByte(CP_UTF8, 0, source.c_str(), source.size(), NULL, 0, NULL, NULL);
      assert(neededSize);
      vector<char> buffer(neededSize);
      assert(neededSize == WideCharToMultiByte(CP_UTF8, 0, source.c_str(), source.size(), &buffer[0], neededSize, NULL, NULL));
      return &buffer[0];
}

int main()
{
      cout << unicodeToUtf8(L"unicode text here");
      return 0;
}

HTH
0
 
LVL 1

Author Comment

by:dumbo2569
ID: 11775897
Thanks for all your help guys.

After tons of research, I realized that UTF8 is actually backwards compatible with ASCII. Meaning that a plain ASCII string is the same a UTF8 string, so no conversion was necessary. Even though UTF8 is a multi-byte encoding scheme (1,2,3 or 4 bytes/character), it is a 1-byte/character string if you are using just plain ASCII.

Since you guys were a bunch of help and I did use a lot of your code to point me to my conclusion, I'll split the points up.

Thanks.
0

Featured Post

[Live Webinar] The Cloud Skills Gap

As Cloud technologies come of age, business leaders grapple with the impact it has on their team's skills and the gap associated with the use of a cloud platform.

Join experts from 451 Research and Concerto Cloud Services on July 27th where we will examine fact and fiction.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

When writing generic code, using template meta-programming techniques, it is sometimes useful to know if a type is convertible to another type. A good example of when this might be is if you are writing diagnostic instrumentation for code to generat…
Unlike C#, C++ doesn't have native support for sealing classes (so they cannot be sub-classed). At the cost of a virtual base class pointer it is possible to implement a pseudo sealing mechanism The trick is to virtually inherit from a base class…
The goal of the tutorial is to teach the user how to use functions in C++. The video will cover how to define functions, how to call functions and how to create functions prototypes. Microsoft Visual C++ 2010 Express will be used as a text editor an…
The goal of the video will be to teach the user the concept of local variables and scope. An example of a locally defined variable will be given as well as an explanation of what scope is in C++. The local variable and concept of scope will be relat…

617 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question