Link to home
Start Free TrialLog in
Avatar of rstaveley
rstaveleyFlag for United Kingdom of Great Britain and Northern Ireland

asked on

Reading/writing wide character files with wifstream/wofstream

I've just tried the following code with VC 7.1 on Windoze and GC 3.2 on Linux and I generate a file of three bytes in both environments:
--------8<--------
#include <iostream>
#include <fstream>
#include <string>

int main()
{
      const char filename[] = "three_wchars.txt";
      std::wofstream file(filename);
      if (!file) {
            std::cerr << "Error: Unable to create " << filename << '\n';;
            return 1;
      }
      std::wstring wstr(L"abc");
      file << wstr;
      file.close();
}
--------8<--------
I was expecting to get a 6 byte file.

What's going on?
Avatar of rstaveley
rstaveley
Flag of United Kingdom of Great Britain and Northern Ireland image

ASKER

If you've got an answer, you'll probably be able to sort out theblip's question at http:/Q_20796791.html too.
Looks like DanRollins found what I'm after here....

http:/Q_20318167.html#7124067
...but I can't get it to work.

Here's my best shot at re-implementing his approach to create a 3 character file, which I was hoping to show up as a 6 character file (but I still get 3 characters on GC 3.2 anc VC 7.1):
--------8<--------
#include <iostream>
#include <fstream>
#include <string>
#include <locale>

class Simple_codecvt : public std::codecvt<wchar_t,char,mbstate_t> {
public:
        typedef wchar_t _E;
        typedef char _To;
        typedef mbstate_t _St;
        explicit Simple_codecvt(size_t _R = 0) : std::codecvt<wchar_t,char,mbstate_t>(_R) {}
protected:
        virtual result do_in(_St& _State,const _To *_F1,const _To *_L1,const _To *& _Mid1,_E *_F2,_E *_L2,_E *& _Mid2) const {return noconv;}
        virtual result do_out(_St& _State,const _E *_F1,const _E *_L1,const _E *& _Mid1,_To *_F2, _To *_L2,_To *& _Mid2) const {return noconv;}
        virtual result do_unshift(_St& _State,_To *_F2, _To *_L2,_To *& _Mid2) const {return noconv;}
        virtual int do_length(_St& _State, const _To *_F1,const _To *_L1, size_t _N2) const throw() {return (_N2 < (size_t)(_L1-_F1)?_N2 :_L1 - _F1);}
        virtual bool do_always_noconv() const throw() {return true;}
        virtual int do_max_length() const throw() {return 2;}
        virtual int do_encoding() const throw() {return 2;}
};

int main()
{
        try {
                std::locale loc(std::locale::classic(),new Simple_codecvt);
                std::wofstream file;
                file.imbue(loc);
                file.open("three_wchars.txt");
                if (!file) {
                        std::cerr << "Error: Unable to create file\n";
                        return 1;
                }

                //std::wstring wstr(L"abc");
                //file << wstr /*<< std::endl*/;

                file << L"123";

                file.close();
        }
        catch (std::exception e) {
                std::cerr << "Exception: " << e.what() << std::endl;
        }
}
--------8<--------
ASKER CERTIFIED SOLUTION
Avatar of DanRollins
DanRollins
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks for looking at it, Dan. Sorry to drag this out of the archives!

This compiles for me on VC 7.1, when yours didn't. However, it doesn't do what I want it to do :-)

Presumably yours was VC 6.0 (because of the date)??

I'll try yours on VC 6.0, which I should still have hereabouts.
Yes... yours works for VC6.

The _ADDFAC macro isn't supported on VC7.1. My attempt to implement it was probably what was wrong.
Ah.... I'd put a w_char in the external representation.

The following compiles and works in VC 7.1 and VC 6.0. Unsurprisingly, however, bearing in mind the leading underscore, the _ADDFAC macro isn't supported by GCC. I wonder what the portable way of doing this is?
--------8<--------
#include <iostream>
#include <fstream>
#include <string>
#include <locale>

class Simple_codecvt : public std::codecvt<wchar_t,char,mbstate_t> {
public:
        typedef wchar_t _E;
        typedef char _To;
        typedef mbstate_t _St;
        explicit Simple_codecvt(size_t _R = 0) : std::codecvt<wchar_t,char,mbstate_t>(_R) {}
protected:
        virtual result do_in(_St& _State,const _To *_F1,const _To *_L1,const _To *& _Mid1,_E *_F2,_E *_L2,_E *& _Mid2) const {return noconv;}
        virtual result do_out(_St& _State,const _E *_F1,const _E *_L1,const _E *& _Mid1,_To *_F2, _To *_L2,_To *& _Mid2) const {return noconv;}
        virtual result do_unshift(_St& _State,_To *_F2, _To *_L2,_To *& _Mid2) const {return noconv;}
        virtual int do_length(_St& _State, const _To *_F1,const _To *_L1, size_t _N2) const throw() {return (_N2 < (size_t)(_L1-_F1)?_N2 :_L1 - _F1);}
        virtual bool do_always_noconv() const throw() {return true;}
        virtual int do_max_length() const throw() {return 2;}
        virtual int do_encoding() const throw() {return 2;}
};

int main()
{
        try {
                std::locale loc = std::_ADDFAC(std::locale::classic(),new Simple_codecvt);
                std::wofstream file;
                file.imbue(loc);
                file.open("three_wchars.txt",std::ios::trunc|std::ios::binary);
                if (!file) {
                        std::cerr << "Error: Unable to create file\n";
                        return 1;
                }

                std::wstring wstr(L"abc");
                file << wstr;
                file.close();
        }
        catch (std::exception e) {
                std::cerr << "Exception: " << e.what() << std::endl;
        }
        return 0;
}
--------8<--------
Thanks for the points.  
I tried unscrambling that STL mess to figure out what _ADDFAC does, but once they get into the locale support, it's just a hopeless jumble.  That's the main reason I hate STL.  I can't imagine that code so obfuscated could be anything close to efficient.

If I want to output 32 bytes, I just output them :)

-- Dan
It beats me why there isn't a simple mechanism to specify the encoding of the stream that you are about to open for reading/writing in STL. It feels very wrong that wofstream needs to be opened in binary mode and for all the double-back somersaults to do UTF-16-ish file I/O. The need to use implementation specific macros like _ADDFAC... humph.

Having said that, many thanks for finding a way, Dan :-)
hmmm....

Did I miss something?

What was the answer?
> What was the answer?

The answer to the question was that woftream has two representations of the character. There is the internal representation, which we all know and love; this is wchar_t (16 bits on VC/Windoze and 32 bits on GCC/Linux). There is, however also an external representation - the representation that get written to or read from file - and that was what threw me.

DanRollins's codecvt posting last year pointed me towards the embarrassingly virginal pages of my IOStreams and Locales for Dummies reference, which told me that IOStreams use an external representation, which is typically compact and appropriate for your locale. My locale is en_GB, which means that wftstream uses an 8-bit representation for characters. That's why file << wstr only generated a 3 character file on my systems.

The rest of this communication is a reflection of the fact that it is hard for the likes of fools like me to apply a facet to my locale to get it to behave differently from the default implementation. The need to use non-portable macros like _ADDFAC in VC6/VC7, which means that the code isn't portable to GCC, is like reading "There be dragons" on an ancient map.

Simple_codecvt isn't portable. I haven't been able to test it, but it should surely need to have the following modification to be portable:

        virtual int do_max_length() const throw() {return sizeof(wchar_t);} /* Not necessarily 2 */
        virtual int do_encoding() const throw() {return sizeof(wchar_t);}

The reason why I haven't been able to test it is that I haven't found out the portable (or indeed GCC-specific) equivalent of _ADDFAC and therefore don't really know how to add this facet to the locale in GCC.

It would be nice to think that a codecvt class could be written to allow UTF-16 files to be read/written by wfstream http:/Q_20796791.html#9752719, but I can't see how the BOM (byte order marker) at the beginning of the stream would be elegantly handled by codecvt.

The question was answered, but there are quite a few questions that ought to follow up from here...  :-)
Thanks for the clarification.