Solved: converting a unicode file to an ascii file..n vice versa?

What have you tried so far?

what kind of unicode text file do you have?
utf8, utf16 o utf32? Or just windows unicode (ucs2)?

jkr,

Does that really work? I thought wfstreams expected characters to be narrow in their external representation and were only wchar_t when in RAM. If I write to a wofstream I see narrow characters on disk.

Here's some struggling I did with this a couple of years ago http:/Q_20797258.html

chip3d

yes, if u use wostream, the codecvt trait will narrow the wchar_t before writing... As long as you have have values below 255, everything is written to the file, as soon as there is a value above like arabic letters, the stream will stop an set the failbit.

you need to change the default codecvt trait with a trait that perform no conversion just as described in the article of you prev post.

But carefull with cross plattform apps. Most windows compilers are using a wchar_t as 16Bit type and for Linux normally 32Bit...

Or just read the document with a bytestream and perform a reinterpret_cast, or conversion to internal type. This way you could also read the multybyte formats like UTF8 and UTF16. Source for conversion you can get from unicode.org: ftp://www.unicode.org/Public/PROGRAMS/CVTUTF/

For full unicodesupport you can use the ICU lib from IBM...

rstaveley

The C++ standard library leaves this whole area as an exercise for the reader. Java provides fundamental differentiation beetween character and byte streams - see http://java.sun.com/docs/books/tutorial/i18n/text/stream.html.

yasimplicity

ASKER

mr jkr:

your code does not convert any thing

unicode still unicode

ansi is still ansi

rstaveley

yasimplicity, that's because wfstreams have an external (= on file) representation that uses narrow characters. I recommend that you look at my codecvt example at http:/Q_20797258.html#9749822 and read my conclusion at the bottom of that thread (that example isn't portable because it makes assumptions about sizeof(wchar_t). As confirmed by chip3d, you need to prevent it from converting the wide characters back to and from narrow ones when it is written/read to/from disk. That's where a "do nothing" codecvt trait is needed, which you need to imbue the stream with.

Beware that a codecvt trait does not handle the BOM (= byte order marker), and you'll need to write/read that from the UNICODE stream do make it a recognisable UNICODE text file.

I'm rusty in this area. It was a few years back that I had a struggle with it and wasn't entirely comfortable with my conclusion. You'll find that the codecvt is usable, however, as long as you bear in mind the need to apply/strip your own bye order marker.

rstaveley

I've just dug up the following code snippet from a project in which I had to load a UNICODE UTF-16LE file. I'd figured out how to imbue streams properly with locales when I wrote this, which wasn't the case in the links I directed you to, which use a funky Microsoft-specific macro. This example also handles the BOM properly - though not portably.

You might find this useful, but beware that it isn't portable, because it makes assumptions that wchar_t is 16-bit (Windows) rather than 32-bit (Linux), because it was designed to work with the UTF-16LE files you get generated by MSXML.

What does it do? Nothing much! It reads a UTF-16LE file (unicode.xml) line by line and writes the content to a UTF-16LE file (unicode_2.xml) line by line, but with a few simple edits you could use it to convert to and from ANSI.

This compiles and works with MS VC 7.1 and compiles but won't work with GCC 3.2+ on Linux, because of the sizeof wchar_t assumption. If you can get this into shape for Linux, let me know. It would be nice to make this portable.

--------8<--------
#include <iostream>
#include <fstream>
#include <string>
#include <locale>
#include <list>
#include <algorithm>
#include <iterator>

namespace noconv {

class codecvt : public std::codecvt<wchar_t,char,mbstate_t> {

public:
      explicit codecvt(size_t refs = 0) : std::codecvt<wchar_t,char,mbstate_t>(refs) {}

protected:
      virtual      result do_in(
            state_type& state
            ,const extern_type *from_begin
            ,const extern_type *from_end
            ,const extern_type *&from_next
            ,intern_type *to_begin
            ,intern_type *to_end
            ,intern_type *&to_next
            ) const
      {
       return noconv;
      }

/* Here's where we convert from the internal representation to the external
representation written to disk */

      virtual      result do_out(
            state_type& state
            ,const intern_type *from_begin
            ,const intern_type *from_end
            ,const intern_type *&from_next
            ,extern_type *to_begin
            ,extern_type *to_end
            ,extern_type *&to_next
            ) const
      {
            const intern_type *src = from_begin;
            const intern_type *src_end = from_end;
            intern_type *dst = reinterpret_cast<intern_type*>(to_begin);
            intern_type *dst_end = reinterpret_cast<intern_type*>(to_end);

            while (dst+1 <= dst_end && src < src_end)
                  *dst++ = *src++;

            from_next = src;
            to_next = reinterpret_cast<extern_type*>(dst);

       return ok;
      }
      virtual      result do_unshift(
            state_type& state
            ,extern_type *to_begin
            ,extern_type *to_end
            ,extern_type *&to_next
            ) const
      {
       return noconv;
      }

      virtual      int do_length(
            state_type& state
            , const extern_type *from_begin
            ,const extern_type *from_end
            ,size_t max_internal_chars
            ) const throw()
      {
       return std::min(max_internal_chars,size_t(from_end-from_begin));
      }

/* Never converts? Not true for us. */

      virtual      bool do_always_noconv() const throw()
      {
       return false;
      }

/* Max extern_type for one intern_type */

      virtual      int do_max_length() const throw()
      {
       return sizeof(intern_type);
      }

/* do_encoding returns one of the following:
      -1, if the external representation of a character uses
      a stateful encoding

      a constant number representing the maximum width in externT
      elements used to represent a character in a fixed-width encoding

      0, if the external representation of the characters in the
      character set uses a variable size encoding
*/

      virtual      int do_encoding() const throw()
      {
       return sizeof(intern_type);
      }
};
} // noconv namespace

int main()
{
      std::list<std::wstring> contents;
      try {
            std::locale loc(std::locale::classic(),new noconv::codecvt);
            std::wifstream fin;
            fin.imbue(loc);
            fin.open("unicode.xml");
            if (!fin) {
                  std::cerr << "Error: Unable to open fin\n";
                  return 2;
            }

            std::wstring wstr;

            wchar_t signature;
            fin.read(&signature,1);

            // Little-endian reading
            if (signature != 0xfeff)
                  return std::cerr << "Error: File is not UTF-16 UNICODE\n",3;

            bool shown = false;
            while (getline(fin,wstr)) {
                  if (!shown) {
                        std::wcout << L'"' << wstr << L'"' << L'\n';
                        shown = true;
                  }
                  contents.push_back(wstr);
            }

            fin.close();

            std::wofstream fout;
            fout.imbue(loc);
            fout.open("unicode_2.xml");
            if (!fout) {
                  std::cerr << "Error: Unable to create fout\n";
                  return 2;
            }

            signature = 0xfeff;
            fout.write(&signature,1);

            //copy(contents.begin(),contents.end(),std::ostream_iterator<std::wstring,wchar_t>(fout,L"\n"));
            copy(contents.begin(),contents.end(),std::ostream_iterator<std::wstring,wchar_t>(fout));
            fout.close();

      }
      catch (std::exception e) {
            std::cerr << "Exception: " << e.what() << std::endl;
      }

}
--------8<--------

yasimplicity

ASKER

doesn't ascii have a BOM??

rstaveley

No. Try this:

--------8<--------
#include <fstream>

int main()
{
      std::ofstream fout("hello.txt");
      fout << "Hello";
      fout.close();
      system("dir hello.txt");
      system("type hello.txt");
}
--------8<--------

Your text file has 5 bytes in it, wit each byte corresponding to a character in the string "Hello".

Now try this:
--------8<--------
#include <iostream>
#include <fstream>
#include <string>
#include <locale>

namespace noconv {

class codecvt : public std::codecvt<wchar_t,char,mbstate_t> {

public:
explicit codecvt(size_t refs = 0) : std::codecvt<wchar_t,char,mbstate_t>(refs) {}

protected:
virtual result do_in(
state_type& state
,const extern_type *from_begin
,const extern_type *from_end
,const extern_type *&from_next
,intern_type *to_begin
,intern_type *to_end
,intern_type *&to_next
) const
{
return noconv;
}

/* Here's where we convert from the internal representation to the external
representation written to disk */

virtual result do_out(
state_type& state
,const intern_type *from_begin
,const intern_type *from_end
,const intern_type *&from_next
,extern_type *to_begin
,extern_type *to_end
,extern_type *&to_next
) const
{
const intern_type *src = from_begin;
const intern_type *src_end = from_end;
intern_type *dst = reinterpret_cast<intern_type*>(to_begin);
intern_type *dst_end = reinterpret_cast<intern_type*>(to_end);

while (dst+1 <= dst_end && src < src_end)
*dst++ = *src++;

from_next = src;
to_next = reinterpret_cast<extern_type*>(dst);

return ok;
}
virtual result do_unshift(
state_type& state
,extern_type *to_begin
,extern_type *to_end
,extern_type *&to_next
) const
{
return noconv;
}

virtual int do_length(
state_type& state
, const extern_type *from_begin
,const extern_type *from_end
,size_t max_internal_chars
) const throw()
{
return std::min(max_internal_chars,size_t(from_end-from_begin));
}

/* Never converts? Not true for us. */

virtual bool do_always_noconv() const throw()
{
return false;
}

/* Max extern_type for one intern_type */

virtual int do_max_length() const throw()
{
return sizeof(intern_type);
}

/* do_encoding returns one of the following:
-1, if the external representation of a character uses
a stateful encoding

a constant number representing the maximum width in externT
elements used to represent a character in a fixed-width encoding

0, if the external representation of the characters in the
character set uses a variable size encoding
*/

virtual int do_encoding() const throw()
{
return sizeof(intern_type);
}
};
} // noconv namespace

int main()
{
try {
std::locale loc(std::locale::classic(),new noconv::codecvt);
std::wofstream fout;
fout.imbue(loc);
fout.open("hello2.txt");
if (!fout) {
std::cerr << "Error: Unable to create fout\n";
return 2;
}

wchar_t signature = 0xfeff;
fout.write(&signature,1);

fout << L"Hello";
fout.close();

       system("dir hello2.txt");
       system("type hello2.txt");
}
catch (std::exception e) {
std::cerr << "Exception: " << e.what() << std::endl;
}
}
--------8<--------

Your unicode text file has its 16-bit BOM in it, indicating that it is a 16 bit little endian file. Allong with the 5 x 16-bits for L"Hello" you have a file size of 12 bytes.

BOMs are an uncomfortable thing. Here's a good look-up for them: http://www.i18nguy.com/unicode/c-unicode.html#BOM

rstaveley

It would be nice if content type and encoding could both be provided by the directory system (cf. MIME). Having BOMs to handle encoding partially and file extensions to cover content type partially is a real mess, isn't it?

yasimplicity

ASKER

copy(

contents.begin(),
contents.end(),
std::ostream_iterator<std::wstring,wchar_t>(fout)

);

how to copy it as ascii not wide unicode?

rstaveley

Converting from wchar_t to char the "standard way" is pretty ugly. You need to use ctype's narrow, which means using a facet from the locale. You would have thought that you'd be able to use iterators with it and convert directly from an istreambuf_iterator to an ostreambuf_iterator, but the standard just has it working with character pointers. In the code below, I load a wchar_t vector and use narrow to convert the wide characters to narrow characters.

Now that I'm writing this explanation I ask myself why I didn't simply copy from an istream_iterator imbued with our no-conversion locale and write to an ostream_iterator imbued with the classic locale. That would definitely be a lot simpler than the following code, but having gone to the effort of putting together the following illustration, I can't bring myself to delete it :-)

Here it is for what it's worth...
--------8<--------
#include <iostream>
#include <fstream>
#include <string>
#include <locale>
#include <iterator>
#include <algorithm>
#include <vector>
#include <cctype>

namespace noconv {

class codecvt : public std::codecvt<wchar_t,char,mbstate_t> {

public:
      explicit codecvt(size_t refs = 0) : std::codecvt<wchar_t,char,mbstate_t>(refs) {}

protected:
      virtual      result do_in(
            state_type& state
            ,const extern_type *from_begin
            ,const extern_type *from_end
            ,const extern_type *&from_next
            ,intern_type *to_begin
            ,intern_type *to_end
            ,intern_type *&to_next
            ) const
      {
              return noconv;
      }

/* Here's where we convert from the internal representation to the external
representation written to disk */

      virtual      result do_out(
            state_type& state
            ,const intern_type *from_begin
            ,const intern_type *from_end
            ,const intern_type *&from_next
            ,extern_type *to_begin
            ,extern_type *to_end
            ,extern_type *&to_next
            ) const
      {
            const intern_type *src = from_begin;
            const intern_type *src_end = from_end;
            intern_type *dst = reinterpret_cast<intern_type*>(to_begin);
            intern_type *dst_end = reinterpret_cast<intern_type*>(to_end);

            while (dst+1 <= dst_end && src < src_end)
                  *dst++ = *src++;

            from_next = src;
            to_next = reinterpret_cast<extern_type*>(dst);

            return ok;
      }
      virtual      result do_unshift(
            state_type& state
            ,extern_type *to_begin
            ,extern_type *to_end
            ,extern_type *&to_next
            ) const
      {
            return noconv;
      }

      virtual      int do_length(
            state_type& state
            , const extern_type *from_begin
            ,const extern_type *from_end
            ,size_t max_internal_chars
            ) const throw()
      {
            return std::min(max_internal_chars,size_t(from_end-from_begin));
      }

/* Never converts? Not true for us. */

      virtual      bool do_always_noconv() const throw()
      {
            return false;
      }

/* Max extern_type for one intern_type */

      virtual      int do_max_length() const throw()
      {
            return sizeof(intern_type);
      }

/* do_encoding returns one of the following:
      -1, if the external representation of a character uses
      a stateful encoding

      a constant number representing the maximum width in externT
      elements used to represent a character in a fixed-width encoding

      0, if the external representation of the characters in the
      character set uses a variable size encoding
*/

      virtual      int do_encoding() const throw()
      {
            return sizeof(intern_type);
      }
};
} // noconv namespace

int main()
{
      try {
            std::locale loc(std::locale::classic(),new noconv::codecvt);
            {
                  std::wofstream wfout;      // Create a 16LE unicode file
                  wfout.imbue(loc);
                  wfout.open("wide.txt");
                  if (!wfout)
                        return std::cerr << "Error: Unable to create wfout\n",2;

                  wchar_t signature = 0xfeff;
                  wfout.write(&signature,1);

                  wfout << L"Hello";
            }
            {
                  std::wifstream wfin;      // Open the 16LE unicode file
                  wfin.imbue(loc);
                  wfin.open("wide.txt");
                  if (!wfin)
                        return std::cerr << "Error: Unable to open wfin\n",2;

                  wchar_t signature;
                  wfin.read(&signature,1);

                  // Little-endian reading
                  if (signature != 0xfeff)
                        return std::cerr << "Error: File is not UTF-16LE UNICODE\n",3;

                  typedef std::istreambuf_iterator<wchar_t> IItr;
                  std::vector<wchar_t> wcontent(IItr(wfin),(IItr()));
                  const int contentLength = wcontent.size();
                  std::vector<char> ncontent(contentLength);
                  bool success = (std::use_facet<std::ctype<wchar_t> >(loc).narrow
                        (&wcontent[0],&wcontent[contentLength],'?',&ncontent[0]) != 0);
                  if (!success)
                        return std::cerr << "Error: Narrow failed\n",4;

                  std::ofstream nfout;      // Create a narrow character (ANSI) file
                  nfout.open("narrow.txt");
                  if (!nfout)
                        return std::cerr << "Error: Unable to create nfout\n",5;

                  typedef std::ostreambuf_iterator<char> OItr;
                  copy(ncontent.begin(),ncontent.end(),OItr(nfout));
            }
      }
      catch (std::exception e) {
            std::cerr << "Exception: " << e.what() << std::endl;
      }
}
--------8<--------

rstaveley

Oh yes... beware that there is no attempt to handle non ASCII characters in that! If it was seriously being used to convert between UTF-8 and UTF-16 as it suggests, it ought to deal with multi-byte UTF-8 characters. It would be safer to use this only to convert between ISO-8859-1 (Latin1) and UTF-16. You need to work with jkr's suggested wcstombs/mbstowcs functions to work with multi-bytes.

All said and done, if you *really* need to convert between XML file formats in Windows, look no further than Microsoft's XSLT support. Scroll to the bottom of http://msdn.microsoft.com/XML/XMLDownloads/default.aspx and follow the link to the Command Line Transformation Utility (msxsl.exe), which comes with source code.

yasimplicity

ASKER

that is it

but there is some errors runtime related to out of the

" wcontent " band when executing

&wcontent[contentLength]

I'll deal with it myself

Anyhow thanks a lot Mr <rstaveley>