Solved

IOStreams and char to wchar_t conversion...

Posted on 2004-08-23
16
1,012 Views
Last Modified: 2013-12-14
Hi, Experts!

I'm relatively new to using STL IOStreams and need somebody to kick me in the right direction.

My understanding is that if one were to use a wide stream to read UTF-8-encoded strings, the stream would automatically interpret a multiple-byte character and blow it out to its wide-character equivalent via the codecvt<wchar_t, char, mbstate_t> facet of the locale associated with the stream. Is this correct?

If it is, my problem is that the wide stream is reading each individual narrow character and turning it into a wide character (1:1) instead of decoding a series of narrow characters as one wide character.

Below is a contrived example, but it illustrates my point:

------------------------
#define UNICODE

#include "afxwin.h"
#include <iostream>
#incluse <string>

void main()
{
     ofstream fout;
     wifstream fin;
     wstring strIncorrect;
     wstring strCorrect;
     char fileName[] = "c:\\test.txt";

     // write a Japanese UTF-8 character to the file
     fout.open(fileName);
     fout << "\xe3\x82\xb9";
     fout.close();

     // read the Japanese character from the file using a wide stream
     // and display it
     fin.open(fileName);
     getline(fin, strIncorrect);
     fin.close();

     strCorrect = L'\u30B9'; // the "\xe3\x82\xb9" unicode equivalent

     wcout << L"Incorrect size: " << (unsigned int)strIncorrect.size() << endl;
     wcout << L"Correct size: " << (unsigned int)strCorrect.size() << endl;

     // use message boxes to display the unicode character since wcout won't
     // display it properly
     AfxMessageBox(CString(L"Incorrect: ") + strIncorrect.c_str());
     AfxMessageBox(CString(L"Correct: ") + strCorrect.c_str());
}

------------------------

If you had Asian fonts installed and opened the generated c:\test.txt in Notepad, you'd see a Japanese character instead of the individual narrow characters "&#960;é&#9571;"; likewise with the message boxes that pop up.

In the above example, how do I get the stream to give me a single wide character instead of the individual characters that make up the multibyte character?

Compiler is VC7.

Help puhleeeze!!!
0
Comment
Question by:rafd123
  • 7
  • 5
  • 2
  • +1
16 Comments
 
LVL 17

Expert Comment

by:rstaveley
Comment Utility
Streams have internal and external representations. Internal representations being the way strings are loaded in memory and external representations being how they are read/written to disk. If you create a wofstream and output L"Hello" in the US locale (or any European locale), you will get a file of 5 bytes because the external representation uses 8-bit characters. Only the internal representation uses wchar_t. Similarly wifstreams expect there to be 8-bit characters in the external representation (i.e. expects L"Hello" to come as 5 bytes).
0
 
LVL 4

Author Comment

by:rafd123
Comment Utility
I understand the concept of internal and external representations of characters.

However, I was under the impression that the codecvt<wchar_t, char, mbstate_t> facet (wchar_t being the internal representation, char being the external representation, and mbstate_t being the state type used to determine if a char is a single character or the first byte of a multibyte character) would be able to translate a series of individual 8-bit characters into a wide character.

My understanding is that when the wide stream reads an 8-bit character using codecvt<wchar_t, char, mbstate_t>::in(), the in() function returns codecvt_base::partial to tell the stream that the current 8-bit character is the first byte of a multibyte character so that it may continue to find that last 8-bit character of a series of 8-bit characters that is to become the internal character (in this case, calling mbtowc() to create a wchar_t).

Am I living in some sort of twisted reality of my own creation? :D
0
 
LVL 17

Expert Comment

by:rstaveley
Comment Utility
Here's my disclaimer... I picked up on your question, because I've noticed this TA is relatively light on experience when it comes to i18n. Related questions in the past haven't had much of a response.  I have the Langer & Kreft IOStreams and Locales look sitting on my shelf, but I must confess that I've never needed to use it in anger. So please take my guidance with a pinch of salt.

Now here's a wild stab... I don't see you imbuing a locale in your test code above.
0
 
LVL 4

Author Comment

by:rafd123
Comment Utility
I bought the exact book two days ago for this very purpose; I assure you that before two days ago, I knew nothing about this junk. I apologize if I angered you in any way, but I'm trying REAL hard to figure out how this crap works (by reading books on the subject, asking experts, steping into STL code, etc...).

I tried imbuing the stream (even with the Japanese_Japan locale) with no success:

fin.imbue(locale("Japanese_Japan"));

I had no choice but to step into Microsoft's STL code (ugh!!!). It turns out that the codecvt<wchar_t, char, mbstate_t>::do_in() DOES call _Mbrtowc() (per the xlocale header file)...which eventually calls MultiByteToWideChar() (per xmbtowc.c)...the exact behavior I've been expecting. The trick, apparently, is to imbue the stream with a codecvt that is associated with a locale that is associated with a code page of CP_UTF8:

locale loc(fin.getloc(), new codecvt<wchar_t, char, mbstate_t>(_Locinfo("Japanese_Japan")));
fin.imbue(loc);

The question now is: what locale has a code page of UTF8?


0
 
LVL 17

Expert Comment

by:rstaveley
Comment Utility
> I apologize if I angered you in any way

Sorry if I gave that impression. Of course you didn't. I'm similarly perplexed about locales too, but other things keep cropping up and I never quite get to grips with them :-)

I get the impression that there is no locale with a code page of UTF8. Similarly, I believe there are no locales with wide character external representations. It would be an elegant way to do ISO-8559-1 <> UNICODE conversions if there were. Do let me know what you find out, though. I'm sorry, I'm not being much help. Hopefully an i18n expert will emerge in this thread... it may wind up being you, rafd123  ;-)
0
 
LVL 39

Expert Comment

by:itsmeandnobodyelse
Comment Utility
rstaveley, i am admiring your diction though it always costs me some time to understand what you've told. I found out what's a disclaimer, translated TA to 'Topic Area' and got i18n == 'internationalization'.

But what do you want to say with

>> I have the Langer & Kreft IOStreams and Locales look sitting on my shelf, but I
>> must confess that I've never needed to use it in anger

and

>> So please take my guidance with a pinch of salt.

???

Regards, Alex

Sorry for being ignorant ;-)


0
 
LVL 4

Accepted Solution

by:
anthony_w earned 400 total points
Comment Utility
In the absence of a locale which supports UTF-8, you have to do the conversion yourself. You can write a custom facet to do the UTF-8 translations, and add that facet to the locale used by the stream. Alternatively, you can do the conversion before writing/after reading.

Here is some code to encode/decode to/from UTF8.

decodeUtf8 reads bytes from the iterator range passed in, and returns a single character. You use it like so:

std::ifstream inputStream("somefile",std::ios::binary);
std::istreambuf_iterator<char> start(inputStream);
std::istreambuf_iterator<char> const end;

while(start!=end)
{
    unsigned long const readChar=decodeUtf8(start,end);
    doSomethingWith(readChar);
}

encodeUtf8 does the reverse; it encodes a single character as UTF8 and dumps it to the output iterator. e.g.:

std::wstring const someWideString=getWideStringFromSomewhere();
std::ofstream outputStream("somefile",std::ios::binary);
std::ostreambuf_iterator<char> outputIterator(outputStream);

for(std::wstring::const_iterator it=someWideString.begin();it!=someWideString.end();++it)
{
    encodeUtf8(*it,outputIterator);
}

template<typename InputIterator>
unsigned long decodeUtf8(InputIterator& current,InputIterator const& end)
{
    if(current==end)
        throw std::runtime_error("No Char");
       
    unsigned char c=*current++;
          
    if(c&0x80)
    {
        unsigned numBytesInChar=1;
        numBytesInChar=(c&0x40)?
            ((c&0x20)?
             ((c&0x10)?
              ((c&0x8)?
               ((c&0x4)?
                ((c&0x2)?
                 0:6):5):4):3):2):0;
          
        if(!numBytesInChar)
            throw std::runtime_error("Invalid UTF8 encoded character");
            
        unsigned long currentChar=c&(0x7f>>numBytesInChar);
                   
        for(unsigned currentByte=1;currentByte!=numBytesInChar;++currentByte)
        {
            if(current==end)
                throw std::runtime_error("Invalid UTF8 encoded character");
                       
            c=*current++;
                       
            if((c&0xc0)!=0x80)
                throw std::runtime_error("Invalid UTF8 encoded character");
                       
            currentChar=(currentChar<<6)+(c&0x3f);
        }
      
        return currentChar;
    }
    else
    {
        return c;
    }
}


template<typename OutputIterator>
void encodeUtf8(unsigned long c,OutputIterator& dest)
{
    const unsigned numBytesInChar=(c&0x7c000000)?6:
        ((c&0x3e00000)?5:
         ((c&0x1f0000)?4:
          ((c&0xf800)?3:
           ((c&0x780)?2:1))));
           
    const unsigned long val=c;
           
    unsigned shift=6*(numBytesInChar-1);
      
    // write the first byte if we haven't already done so
    *dest++=(numBytesInChar==1)?(val&0x7f):
        (((0x3f00>>numBytesInChar)&0xff)+((val>>shift)&0x3f));
    shift-=6;

    for(unsigned bytesLeft=numBytesInChar-1;bytesLeft;++dest,shift-=6,--bytesLeft)
    {
        *dest=0x80+((val>>shift)&0x3f);
    }
}
0
What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

 
LVL 17

Expert Comment

by:rstaveley
Comment Utility
Alex,

My apologies are needed for excessive and unnecessary jargon. I guess they are a reflection of my own locale :-)

>> I have the Langer & Kreft IOStreams and Locales look sitting on my shelf, but I
>> must confess that I've never needed to use it in anger

look = book (don't know how my 'l' key found its way over to the 'b'!)
"use it in anger" = use it for any real purpose
"take ... with a pinch of salt" = don't completely believe what is said

anthony_w,

That's good code, but you shied away from illustrating how to write the custom facet. Is that because the custom facet is the wrong tool for the job?
0
 
LVL 4

Author Comment

by:rafd123
Comment Utility
I had a feeling the conversion might need to be written from scratch.

Ok, dudes. Here's my attempt at a UTF8-aware codecvt (code review, anyone?):

----------------------------------

class Utf8CodeCvt : public std::codecvt<wchar_t, char, mbstate_t>
{
public:
     Utf8CodeCvt(){}
     virtual ~Utf8CodeCvt(){}

     static bool IsLeadByte(const char& c)  // Note: Implemented this 'cause Microsoft's implementation is tied to the global locale
     {
          return NumBytesInMbChar(c) == 1 ? false : true;
     }

     static unsigned int NumBytesInMbChar(const char& c)
     {
          unsigned ret = 1;

          if(c&0x80)
          {
               ret = (c&0x40)?((c&0x20)?((c&0x10)?((c&0x8)?1:4):3):2):1;
          }

          return ret;
     }

protected:
     virtual result do_in(mbstate_t& state,
                          const char* from,
                          const char* from_end,
                          const char*& from_next,
                          wchar_t* to,
                          wchar_t* to_end,
                          wchar_t*& to_next) const
     {
          result ret = noconv;

          if(0 == state)
          {
               int bytesInChar = NumBytesInMbChar(*from);
               state = bytesInChar > 1 ? bytesInChar : 0; //if this is a multibyte char, store the number
                                                          //of bytes in state; if not, the state is 0
          }          

          if(0 == state)
          {
               // state is 0, so this must be a single byte character
               *to = from[0];
               to_next = to + 1; // set the next character in the "to" buffer one past the character
                                 // we just wrote to it
               from_next = from_end; // continue processing the "from" buffer from where it left off
               ret = ok;
          }
          else if(from_end - from == state && to_end - to >= 1)
          {
               // we have all the bytes necessary for the multibyte character and
               // the "to" buffer is large enough to store a wide character; let's
               // do the conversion
               if(0 == MultiByteToWideChar(CP_UTF8,
                                           NULL,
                                           from,
                                           state,
                                           to,
                                           1))
               {                    
                    ret = error;
               }
               else
               {
                    to_next = to + 1; // set the next character in the "to" buffer one past the character
                                      // we just wrote to it
                    ret = ok;
               }

               state = 0; // reset the state
               from_next = from_end; // continue processing the "from" buffer from where it left off
          }
          else
          {
               if(from_end - from < state) // need more "from" bytes to make a complete multi byte character
               {
                    from_next = from; // continue processing the "from" buffer from the same place it started
               }
               else // need more "to" bytes to store the converted character
               {
                    assert(to_end - to < 1);
                    to_next = to_end; // set where the next allocation of the "to" buffer should be
               }
               
               return partial; // indicate to the stream that we need more
          }

          return ret;          
     }

     virtual result do_out(mbstate_t& state,
                           const wchar_t* from,
                           const wchar_t* from_end,
                           const wchar_t*& from_next,
                           char* to,
                           char* to_end,
                           char*& to_next) const
     {
          result ret = noconv;

          if(0 == state)
          {
               // store the number of chars required in state
               state = WideCharToMultiByte(CP_UTF8,
                                           NULL,
                                           from,
                                           (int)(from_end - from),
                                           NULL,
                                           0,
                                           NULL,
                                           NULL);

               if(0 == state)
               {
                    ret = error;
               }              
          }

          if(error != ret)
          {
               assert(0 < state);

               if(to_end - to >= state)
               {
                    // the "to" buffer is large enough; let's do the conversion
                    int bytesWritten = WideCharToMultiByte(CP_UTF8,
                                                           NULL,
                                                           from,
                                                           (int)(from_end - from),
                                                           to,
                                                           (int)(to_end - to),
                                                           NULL,
                                                           NULL);

                    if(0 == bytesWritten)
                    {
                         ret = error;
                    }
                    else
                    {
                         to_next = to + bytesWritten; // set the next character in the "to" buffer one past the characters written
                         ret = ok;
                    }

                    state = 0; // reset the state
                    from_next = from_end; // continue processing the "from" buffer from where it left off
               }
               else // need more "to" bytes to store the converted character
               {
                    to_next = to_end; // set where the next allocation of the "to" buffer should be
                    ret = partial;
               }
          }          

          return ret;
     }

     virtual result do_unshift(mbstate_t& state,
                               char* to,
                               char* to_end,
                               char*& to_next) const
     {
          return noconv; // not needed since the encoding is state independent
     }

     virtual int do_length(mbstate_t& state,
                           const char* from,
                           const char* end,
                           size_t max ) const throw()
     {
          return (int)((max < (size_t)(end - from)) ? max : end - from);
     }

     virtual bool do_always_noconv() const throw()
     {
          return false;
     }

     virtual int do_max_length() const throw()
     {
          return 4; // maximum number of chars it takes to make a wchar_t
     }

     virtual int do_encoding() const throw()
     {
          return 0; // 0 indicates variable number of chars to make a wchar_t
     }
};
0
 
LVL 17

Expert Comment

by:rstaveley
Comment Utility
It isn't portable, of course, because you are using a couple of Windoze functions and you have a hard-coded assumption about sizeof(wchar_t) - unless I'm mistaken about do_max_length().

Now to reveal the extent of my ignorance... how would you use this to convert iso-8859-1 <> utf8... because that ought to be a simple test to achieve with test data with French accents in it? You ought to be able to write XML which can be viewed with Internet Explorer with either encoding and see it work.
0
 
LVL 4

Author Comment

by:rafd123
Comment Utility
It's true that it isn't portable because of the MultiByteToWideChar and WideCharToMultiByte functions; I plan on fixing that...just wanted to get something to work quick and dirty. The do_max_length() should be portable since it's refering to the maximum number (not bytes) of chars that make up a UTF-8 character; perhaps my comment is misleading.

This conversion routine should be able to handle any iso-88591-1 characters you throw at it since iso-88591-1 is a subset of UTF-8...specifically, the first 256 characters...right?

Thanks for your help, guys!! It's MUCH appericiated!
0
 
LVL 17

Assisted Solution

by:rstaveley
rstaveley earned 100 total points
Comment Utility
If I use XSLT to transform a document with ISO-8859-1 encoding to UTF-8 and I put a French word like élève into the ISO-8859-1 document, there are different byte values for the accented characters in the UTF-8.

Let's say I start with:
--------8<--------
<?xml version="1.0" encoding="iso-8859-1"?>
<test>
élève
</test>
--------8<--------

NB: You can read that as I pasted it, because Experts Exchange uses charset=iso-8859-1.

I transform it with:
--------8<--------
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" encoding="utf-8" />
<xsl:template match="/">
      <xsl:copy-of select="." />
</xsl:template>
</xsl:stylesheet>
--------8<--------

The accented characters use two characters, as you might expect.

Transforming it back again with...
--------8<--------
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" encoding="iso-8859-1" />
<xsl:template match="/">
      <xsl:copy-of select="." />
</xsl:template>
</xsl:stylesheet>
--------8<--------

...gets the single byte characters again.

So we can see that ISO-8859-1 isn't subset of UTF-8. It is an 8-bit character set, favoured by Europeans and Americans who haven't got anything better to do with characters > 127.

So getting back to the world of C++, do you reckon that your facet could be used to achieve the same conversions, which the XSLTs above do? If should be a pretty good test. All you need to do is change the encoding= attribute in the XML declaration?
0
 
LVL 4

Author Comment

by:rafd123
Comment Utility
Ah, you're right rstaveley! ISO-8859-1 isn't a subset of UTF-8. The Utf8CodeCvt will puke on ISO-8859-1 characters!

However, if the file you're reading isn't UTF-8 encoded, then there's no need to imbue the stream with Utf8CodeCvt...because if you did, it could possibly fail to read it correctly if you have ISO-8859-1 characters in it.

Try this:
--------8<--------
<?xml version="1.0" encoding="UTF-8"?>
<test>
élève
</test>
--------8<--------

When you try to parse this with an XML parser, it should puke for the same reason that Utf8CodeCvt pukes.

So I guess the lesson for us (I'm not being patronizing when I say this...I'm glad we're fleshing this out) is the program using the stream needs to somehow determine in advance how to imbue stream (e.g. detecting BOMs, analyzing the encoding attribute of the XML declaration, etc...) before doing so.

Would you say this is accurate?

BTW, I meant to split the "accepted answer" points since both you and anthony_w contributed...but I goofed because I have never done it before, and it doesn't look like I can change it. I'm going to have to get a moderator to change it! Sorry about that!

Thanks again for your help!
0
 
LVL 17

Expert Comment

by:rstaveley
Comment Utility
Presumably XML parsers detect the BOM and in its absense, read the stream as US ASCII7 up to the first '>', and then imbue the stream with the appropriate encoding per the encoding attribute in the XML declaration.

I was trying to think of an application for your custom facet in the context of XML. I guess you could imbue the input stream with UTF-8-ness when you've read up to the first '>' in an XML file, and have found encoding="utf-8". For "iso-8859-1", you could leave it as it is (assuming your locale can cope with it?). To generate UTF-8, you need to imbue the output stream with UTF-8-ness.

You ought to be able to create a command line utililties (e.g. utf8ify and unutf8ify) to convert text files to and from UTF-8, using your facet. It would be quite a nice illustration.
0
 
LVL 4

Expert Comment

by:anthony_w
Comment Utility
rstavely said:

> That's good code, but you shied away from illustrating how to write the custom facet. Is that because the custom facet is the wrong tool for the job?

A custom facet requires that you know the encoding before you start, and limits you to using streams (using a facet outside of a stream is a real PITA). If that's your usage, then that's probably best, because once imbued you can use the stream as normal. When I wrote the code above, I wanted to be able to handle any iterators (e.g. for data stored in a std::string, or std::vector<char>, as well as for streams), so a facet was inappropriate.


> Presumably XML parsers detect the BOM and in its absense, read the stream as US ASCII7 up to the first '>', and then imbue the stream with the appropriate encoding per the encoding attribute in the XML declaration.

The XML standard suggests ways of determining the encoding (appendix F). In the first instance, if you have any out-of-band info specifying the character set (e.g. mime charset header), then the parser should probably use that. In the second instance, the presence of a BOM can narrow it down to UTF-8 or a 16- or 32-bit encoding, with a specific byte order. Failing so far, the text must have an XML declaration, with an encoding, if it is not UTF-8 or UTF-16. Next you can read the first few bytes, which must represent <?xml, and can narrow it down to an encoding with enough information to read the full XML declaration and determine the encoding (e.g. EBCDIC, but no specific code page, or an encoding which shares the first 128 characters with US-ASCII (e.g. ISO-8859-x, UTF-8))

> I was trying to think of an application for your custom facet in the context of XML. I guess you could imbue the input stream with UTF-8-ness when you've read up to the first '>' in an XML file, and have found encoding="utf-8". For "iso-8859-1", you could leave it as it is (assuming your locale can cope with it?). To generate UTF-8, you need to imbue the output stream with UTF-8-ness.

Yes.
0

Featured Post

Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

Join & Write a Comment

This article will show you some of the more useful Standard Template Library (STL) algorithms through the use of working examples.  You will learn about how these algorithms fit into the STL architecture, how they work with STL containers, and why t…
Go is an acronym of golang, is a programming language developed Google in 2007. Go is a new language that is mostly in the C family, with significant input from Pascal/Modula/Oberon family. Hence Go arisen as low-level language with fast compilation…
The viewer will learn how to use and create new code templates in NetBeans IDE 8.0 for Windows.
The viewer will learn additional member functions of the vector class. Specifically, the capacity and swap member functions will be introduced.

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

6 Experts available now in Live!

Get 1:1 Help Now