• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1203
  • Last Modified:

converting a unicode file to an ascii file..n vice versa?

How to read a unicode text file then convert it into ascii file
and vice versa..?

How to read a unicode file into an ascii string ;
or read an ascii  file into a unicode string ;
and all using  standard c++  
thanks...

 
0
yasimplicity
Asked:
yasimplicity
  • 9
  • 4
  • 2
  • +2
2 Solutions
 
pcgabeCommented:
What have you tried so far?
0
 
jkrCommented:
// UNICODE -> ANSI
#include <fstream>
#include <string>

using namespace std;

int main () {

    wifstream is("unicode.txt");
    ofstream os("ansi.txt");

    while(!is.is_eof()) {

        wstring wstr;

        getline(is,wstr);

        size_t sz = wstr.length() + 1;
        char* p = new char[sz];

        wcstombs(p,wstr.c_str(),sz);

        os << p << endl;

        delete [] p;
    }

    return 0;
}

// ANSI -> UNICODE
#include <fstream>
#include <string>

using namespace std;

int main () {

    ifstream is("ansi.txt");
    wofstream os("unicode.txt");

    while(!is.is_eof()) {

        string str;

        getline(is,str);

        size_t sz = str.length() + 1;
        wchar_t* p = new wchar_t[sz];

        mbstowcs(p,str.c_str(),sz);

        os << p << endl;

        delete [] p;
    }


    return 0;
}
0
 
chip3dCommented:
what kind of unicode text file do you have?
utf8, utf16 o utf32? Or just windows unicode (ucs2)?
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
rstaveleyCommented:
jkr,

Does that really work? I thought wfstreams expected characters to be narrow in their external representation and were only wchar_t when in RAM. If I write to a wofstream I see narrow characters on disk.

Here's some struggling I did with this a couple of years ago http:/Q_20797258.html
0
 
chip3dCommented:
yes, if u use wostream, the codecvt trait will narrow the wchar_t before writing... As long as you have have values below 255, everything is written to the file, as soon as there is a value above like arabic letters, the stream will stop an set the failbit.

you need to change the default codecvt  trait with a trait that perform no conversion just as described in the article of you prev post.

But carefull with cross plattform apps. Most windows compilers are using a wchar_t as 16Bit type and for Linux normally 32Bit...

Or just read the document with a bytestream and perform a reinterpret_cast, or conversion to internal type. This way you could also read the multybyte formats like UTF8 and UTF16. Source for conversion you can get from unicode.org: ftp://www.unicode.org/Public/PROGRAMS/CVTUTF/

For full unicodesupport you can use the ICU lib from IBM...

 



0
 
rstaveleyCommented:
The C++ standard library leaves this whole area as an exercise for the reader. Java provides fundamental differentiation beetween character and byte streams - see http://java.sun.com/docs/books/tutorial/i18n/text/stream.html.
0
 
yasimplicityAuthor Commented:
mr jkr:

your code does not convert any thing

unicode still unicode

ansi is still ansi
0
 
rstaveleyCommented:
yasimplicity, that's because wfstreams have an external (= on file) representation that uses narrow characters. I recommend that you look at my codecvt example at http:/Q_20797258.html#9749822 and read my conclusion at the bottom of that thread (that example isn't portable because it makes assumptions about sizeof(wchar_t). As confirmed by chip3d, you need to prevent it from converting the wide characters back to and from narrow ones when it is written/read to/from disk. That's where a "do nothing" codecvt trait is needed, which you need to imbue the stream with.

Beware that a codecvt trait does not handle the BOM (= byte order marker), and you'll need to write/read that from the UNICODE stream do make it a recognisable UNICODE text file.

I'm rusty in this area. It was a few years back that I had a struggle with it and wasn't entirely comfortable with my conclusion. You'll find that the codecvt is usable, however, as long as you bear in mind the need to apply/strip your own bye order marker.
0
 
rstaveleyCommented:
I've just dug up the following code snippet from a project in which I had to load a UNICODE UTF-16LE file. I'd figured out how to imbue streams properly with locales when I wrote this, which wasn't the case in the links I directed you to, which use a funky Microsoft-specific macro. This example also handles the BOM properly - though not portably.

You might find this useful, but beware that it isn't portable, because it makes assumptions that wchar_t is 16-bit (Windows) rather than 32-bit (Linux), because it was designed to work with the UTF-16LE files you get generated by MSXML.

What does it do? Nothing much! It reads a UTF-16LE file (unicode.xml) line by line and writes the content to a UTF-16LE file (unicode_2.xml) line by line, but with a few simple edits you could use it to convert to and from ANSI.

This compiles and works with MS VC 7.1 and compiles but won't work with GCC 3.2+ on Linux, because of the sizeof wchar_t assumption. If you can get this into shape for Linux, let me know. It would be nice to make this portable.

--------8<--------
#include <iostream>
#include <fstream>
#include <string>
#include <locale>
#include <list>
#include <algorithm>
#include <iterator>

namespace noconv {

class codecvt : public std::codecvt<wchar_t,char,mbstate_t> {

public:
      explicit codecvt(size_t refs = 0) : std::codecvt<wchar_t,char,mbstate_t>(refs) {}

protected:
      virtual      result do_in(
            state_type& state
            ,const extern_type *from_begin
            ,const extern_type *from_end
            ,const extern_type *&from_next
            ,intern_type *to_begin
            ,intern_type *to_end
            ,intern_type *&to_next
            ) const
      {
              return noconv;
      }

/* Here's where we convert from the internal representation to the external
   representation written to disk */

      virtual      result do_out(
            state_type& state
            ,const intern_type *from_begin
            ,const intern_type *from_end
            ,const intern_type *&from_next
            ,extern_type *to_begin
            ,extern_type *to_end
            ,extern_type *&to_next
            ) const
      {
            const intern_type *src = from_begin;
            const intern_type *src_end = from_end;
            intern_type *dst = reinterpret_cast<intern_type*>(to_begin);
            intern_type *dst_end = reinterpret_cast<intern_type*>(to_end);

            while (dst+1 <= dst_end && src < src_end)
                  *dst++ = *src++;

            from_next = src;
            to_next = reinterpret_cast<extern_type*>(dst);

              return ok;
      }
      virtual      result do_unshift(
            state_type& state
            ,extern_type *to_begin
            ,extern_type *to_end
            ,extern_type *&to_next
            ) const
      {
              return noconv;
      }

      virtual      int do_length(
            state_type& state
            , const extern_type *from_begin
            ,const extern_type *from_end
            ,size_t max_internal_chars
            ) const throw()
      {
              return std::min(max_internal_chars,size_t(from_end-from_begin));
      }

/* Never converts? Not true for us. */

      virtual      bool do_always_noconv() const throw()
      {
              return false;
      }

/* Max extern_type for one intern_type */

      virtual      int do_max_length() const throw()
      {
              return sizeof(intern_type);
      }

/* do_encoding returns one of the following:
      -1, if the external representation of a character uses
      a stateful encoding

      a constant number representing the maximum width in externT
      elements used to represent a character in a fixed-width encoding

      0, if the external representation of the characters in the
      character set uses a variable size encoding
 */

      virtual      int do_encoding() const throw()
      {
              return sizeof(intern_type);
      }
};
} // noconv namespace

int main()
{
      std::list<std::wstring> contents;
      try {
            std::locale loc(std::locale::classic(),new noconv::codecvt);
            std::wifstream fin;
            fin.imbue(loc);
            fin.open("unicode.xml");
            if (!fin) {
                  std::cerr << "Error: Unable to open fin\n";
                  return 2;
            }

            std::wstring wstr;

            wchar_t signature;
            fin.read(&signature,1);

            // Little-endian reading
            if (signature != 0xfeff)
                  return std::cerr << "Error: File is not UTF-16 UNICODE\n",3;

            bool shown = false;
            while (getline(fin,wstr)) {
                  if (!shown) {
                        std::wcout << L'"' << wstr << L'"' << L'\n';
                        shown = true;
                  }
                  contents.push_back(wstr);
            }

            fin.close();

            std::wofstream fout;
            fout.imbue(loc);
            fout.open("unicode_2.xml");
            if (!fout) {
                  std::cerr << "Error: Unable to create fout\n";
                  return 2;
            }

            signature = 0xfeff;
            fout.write(&signature,1);

            //copy(contents.begin(),contents.end(),std::ostream_iterator<std::wstring,wchar_t>(fout,L"\n"));
            copy(contents.begin(),contents.end(),std::ostream_iterator<std::wstring,wchar_t>(fout));
            fout.close();

      }
      catch (std::exception e) {
            std::cerr << "Exception: " << e.what() << std::endl;
      }

}
--------8<--------
0
 
yasimplicityAuthor Commented:
doesn't  ascii  have  a BOM??
0
 
rstaveleyCommented:
No. Try this:

--------8<--------
#include <fstream>

int main()
{
      std::ofstream fout("hello.txt");
      fout << "Hello";
      fout.close();
      system("dir hello.txt");
      system("type hello.txt");
}
--------8<--------

Your text file has 5 bytes in it, wit each byte corresponding to a character in the string "Hello".

Now try this:
--------8<--------
#include <iostream>
#include <fstream>
#include <string>
#include <locale>

namespace noconv {

class codecvt : public std::codecvt<wchar_t,char,mbstate_t> {

public:
     explicit codecvt(size_t refs = 0) : std::codecvt<wchar_t,char,mbstate_t>(refs) {}

protected:
     virtual     result do_in(
          state_type& state
          ,const extern_type *from_begin
          ,const extern_type *from_end
          ,const extern_type *&from_next
          ,intern_type *to_begin
          ,intern_type *to_end
          ,intern_type *&to_next
          ) const
     {
             return noconv;
     }

/* Here's where we convert from the internal representation to the external
   representation written to disk */

     virtual     result do_out(
          state_type& state
          ,const intern_type *from_begin
          ,const intern_type *from_end
          ,const intern_type *&from_next
          ,extern_type *to_begin
          ,extern_type *to_end
          ,extern_type *&to_next
          ) const
     {
          const intern_type *src = from_begin;
          const intern_type *src_end = from_end;
          intern_type *dst = reinterpret_cast<intern_type*>(to_begin);
          intern_type *dst_end = reinterpret_cast<intern_type*>(to_end);

          while (dst+1 <= dst_end && src < src_end)
               *dst++ = *src++;

          from_next = src;
          to_next = reinterpret_cast<extern_type*>(dst);

             return ok;
     }
     virtual     result do_unshift(
          state_type& state
          ,extern_type *to_begin
          ,extern_type *to_end
          ,extern_type *&to_next
          ) const
     {
             return noconv;
     }

     virtual     int do_length(
          state_type& state
          , const extern_type *from_begin
          ,const extern_type *from_end
          ,size_t max_internal_chars
          ) const throw()
     {
             return std::min(max_internal_chars,size_t(from_end-from_begin));
     }

/* Never converts? Not true for us. */

     virtual     bool do_always_noconv() const throw()
     {
             return false;
     }

/* Max extern_type for one intern_type */

     virtual     int do_max_length() const throw()
     {
             return sizeof(intern_type);
     }

/* do_encoding returns one of the following:
     -1, if the external representation of a character uses
     a stateful encoding

     a constant number representing the maximum width in externT
     elements used to represent a character in a fixed-width encoding

     0, if the external representation of the characters in the
     character set uses a variable size encoding
 */

     virtual     int do_encoding() const throw()
     {
             return sizeof(intern_type);
     }
};
} // noconv namespace

int main()
{
     try {
          std::locale loc(std::locale::classic(),new noconv::codecvt);
          std::wofstream fout;
          fout.imbue(loc);
          fout.open("hello2.txt");
          if (!fout) {
               std::cerr << "Error: Unable to create fout\n";
               return 2;
          }

          wchar_t signature = 0xfeff;
          fout.write(&signature,1);

          fout << L"Hello";
          fout.close();

        system("dir hello2.txt");
        system("type hello2.txt");
     }
     catch (std::exception e) {
          std::cerr << "Exception: " << e.what() << std::endl;
     }
}
--------8<--------

Your unicode text file has its 16-bit BOM in it, indicating that it is a 16 bit little endian file. Allong with the 5 x 16-bits for L"Hello" you have a file size of 12 bytes.

BOMs are an uncomfortable thing. Here's a good look-up for them: http://www.i18nguy.com/unicode/c-unicode.html#BOM
0
 
rstaveleyCommented:
It would be nice if content type and encoding could both be provided by the directory system (cf. MIME). Having BOMs to handle encoding partially and file extensions to cover content type partially is a real mess, isn't it?
0
 
yasimplicityAuthor Commented:
copy(

contents.begin(),
contents.end(),
std::ostream_iterator<std::wstring,wchar_t>(fout)

);

how to copy it as ascii not wide unicode?
0
 
rstaveleyCommented:
Converting from wchar_t to char the "standard way" is pretty ugly. You need to use ctype's narrow, which means using a facet from the locale. You would have thought that you'd be able to use iterators with it and convert directly from an istreambuf_iterator to an ostreambuf_iterator, but the standard just has it working with character pointers. In the code below, I load a wchar_t vector and use narrow to convert the wide characters to narrow characters.

Now that I'm writing this explanation I ask myself why I didn't simply copy from an istream_iterator imbued with our no-conversion locale and write to an ostream_iterator imbued with the classic locale. That would definitely be a lot simpler than the following code, but having gone to the effort of putting together the following illustration, I can't bring myself to delete it :-)

Here it is for what it's worth...
--------8<--------
#include <iostream>
#include <fstream>
#include <string>
#include <locale>
#include <iterator>
#include <algorithm>
#include <vector>
#include <cctype>

namespace noconv {

class codecvt : public std::codecvt<wchar_t,char,mbstate_t> {

public:
      explicit codecvt(size_t refs = 0) : std::codecvt<wchar_t,char,mbstate_t>(refs) {}

protected:
      virtual      result do_in(
            state_type& state
            ,const extern_type *from_begin
            ,const extern_type *from_end
            ,const extern_type *&from_next
            ,intern_type *to_begin
            ,intern_type *to_end
            ,intern_type *&to_next
            ) const
      {
               return noconv;
      }

/* Here's where we convert from the internal representation to the external
   representation written to disk */

      virtual      result do_out(
            state_type& state
            ,const intern_type *from_begin
            ,const intern_type *from_end
            ,const intern_type *&from_next
            ,extern_type *to_begin
            ,extern_type *to_end
            ,extern_type *&to_next
            ) const
      {
            const intern_type *src = from_begin;
            const intern_type *src_end = from_end;
            intern_type *dst = reinterpret_cast<intern_type*>(to_begin);
            intern_type *dst_end = reinterpret_cast<intern_type*>(to_end);

            while (dst+1 <= dst_end && src < src_end)
                  *dst++ = *src++;

            from_next = src;
            to_next = reinterpret_cast<extern_type*>(dst);

            return ok;
      }
      virtual      result do_unshift(
            state_type& state
            ,extern_type *to_begin
            ,extern_type *to_end
            ,extern_type *&to_next
            ) const
      {
            return noconv;
      }

      virtual      int do_length(
            state_type& state
            , const extern_type *from_begin
            ,const extern_type *from_end
            ,size_t max_internal_chars
            ) const throw()
      {
            return std::min(max_internal_chars,size_t(from_end-from_begin));
      }

/* Never converts? Not true for us. */

      virtual      bool do_always_noconv() const throw()
      {
            return false;
      }

/* Max extern_type for one intern_type */

      virtual      int do_max_length() const throw()
      {
            return sizeof(intern_type);
      }

/* do_encoding returns one of the following:
      -1, if the external representation of a character uses
      a stateful encoding

      a constant number representing the maximum width in externT
      elements used to represent a character in a fixed-width encoding

      0, if the external representation of the characters in the
      character set uses a variable size encoding
 */

      virtual      int do_encoding() const throw()
      {
            return sizeof(intern_type);
      }
};
} // noconv namespace

int main()
{
      try {
            std::locale loc(std::locale::classic(),new noconv::codecvt);
            {
                  std::wofstream wfout;      // Create a 16LE unicode file
                  wfout.imbue(loc);
                  wfout.open("wide.txt");
                  if (!wfout)
                        return std::cerr << "Error: Unable to create wfout\n",2;

                  wchar_t signature = 0xfeff;
                  wfout.write(&signature,1);

                  wfout << L"Hello";
            }
            {
                  std::wifstream wfin;      // Open the 16LE unicode file
                  wfin.imbue(loc);
                  wfin.open("wide.txt");
                  if (!wfin)
                        return std::cerr << "Error: Unable to open wfin\n",2;

                  wchar_t signature;
                  wfin.read(&signature,1);

                  // Little-endian reading
                  if (signature != 0xfeff)
                        return std::cerr << "Error: File is not UTF-16LE UNICODE\n",3;

                  typedef std::istreambuf_iterator<wchar_t> IItr;
                  std::vector<wchar_t> wcontent(IItr(wfin),(IItr()));
                  const int contentLength = wcontent.size();
                  std::vector<char> ncontent(contentLength);
                  bool success = (std::use_facet<std::ctype<wchar_t> >(loc).narrow
                        (&wcontent[0],&wcontent[contentLength],'?',&ncontent[0]) != 0);
                  if (!success)
                        return std::cerr << "Error: Narrow failed\n",4;

                  std::ofstream nfout;      // Create a narrow character (ANSI) file
                  nfout.open("narrow.txt");
                  if (!nfout)
                        return std::cerr << "Error: Unable to create nfout\n",5;

                  typedef std::ostreambuf_iterator<char> OItr;
                  copy(ncontent.begin(),ncontent.end(),OItr(nfout));
            }
      }
      catch (std::exception e) {
            std::cerr << "Exception: " << e.what() << std::endl;
      }
}
--------8<--------
0
 
rstaveleyCommented:
There's actually more to this than my previous posting suggests, alas. It has been a while since I had a go with all of this and I'd forgotten that you need to open the noconv streams in binary mode, and having opened them in binary mode, you need to do you own '\n' -> "\r\n" handling in Windows.

The following code takes a UTF-8 XML file, expecting the first line to be <?xml version="1.0" encoding="UTF-8"?>, and converts it into a UTF-16 file (as 16LE), rewriting the first line as <?xml version="1.0" encoding="UTF-16"?>. Converting in the opposite direction is analogous and left as an excercide for the reader. This works!

--------8<--------
#include <iostream>
#include <fstream>
#include <iterator>
#include <algorithm>
#include <string>
#include <cctype>
#include <locale>

namespace noconv {

class codecvt : public std::codecvt<wchar_t,char,mbstate_t> {

public:
      explicit codecvt(size_t refs = 0) : std::codecvt<wchar_t,char,mbstate_t>(refs) {}

protected:
      virtual      result do_in(
            state_type& state
            ,const extern_type *from_begin
            ,const extern_type *from_end
            ,const extern_type *&from_next
            ,intern_type *to_begin
            ,intern_type *to_end
            ,intern_type *&to_next
            ) const
      {
            return noconv;
      }

/* Here's where we convert from the internal representation to the external
   representation written to disk */

      virtual      result do_out(
            state_type& state
            ,const intern_type *from_begin
            ,const intern_type *from_end
            ,const intern_type *&from_next
            ,extern_type *to_begin
            ,extern_type *to_end
            ,extern_type *&to_next
            ) const
      {
            const intern_type *src = from_begin;
            const intern_type *src_end = from_end;
            intern_type *dst = reinterpret_cast<intern_type*>(to_begin);
            intern_type *dst_end = reinterpret_cast<intern_type*>(to_end);

            while (dst+1 <= dst_end && src < src_end)
                  *dst++ = *src++;

            from_next = src;
            to_next = reinterpret_cast<extern_type*>(dst);

            return ok;
      }
      virtual      result do_unshift(
            state_type& state
            ,extern_type *to_begin
            ,extern_type *to_end
            ,extern_type *&to_next
            ) const
      {
            return noconv;
      }

      virtual      int do_length(
            state_type& state
            , const extern_type *from_begin
            ,const extern_type *from_end
            ,size_t max_internal_chars
            ) const throw()
      {
            return std::min(max_internal_chars,size_t(from_end-from_begin));
      }

/* Never converts? Not true for us. */

      virtual      bool do_always_noconv() const throw()
      {
            return false;
      }

/* Max extern_type for one intern_type */

      virtual      int do_max_length() const throw()
      {
            return sizeof(intern_type);
      }

/* do_encoding returns one of the following:
      -1, if the external representation of a character uses
      a stateful encoding

      a constant number representing the maximum width in externT
      elements used to represent a character in a fixed-width encoding

      0, if the external representation of the characters in the
      character set uses a variable size encoding
 */

      virtual      int do_encoding() const throw()
      {
            return sizeof(intern_type);
      }
};
} // noconv namespace

int main(int argc,const char* argv[])
{
      if (argc != 3)
            return std::cerr << "Usage: " << argv[0] << " {UTF-8 XML filename} {UTF-16 16LE XML filename}\n",1;
      try {
            std::locale loc(std::locale::classic(),new noconv::codecvt);

            // Use the standard local to open a wide character stream with
            // narrow characters in the external representation
            std::wifstream fin(argv[1]);
            if (!fin)
                  return std::cerr << "Error: Unable to open UTF-8 input file " << argv[1] << '\n',1;
            std::wstring firstline;
            if (!getline(fin,firstline))
                  return std::cerr << "Error: Unable to read first line from UTF-8 input file " << argv[1] << '\n',2;

            std::wstring::size_type pos = firstline.find(L"encoding=");
            if (pos == firstline.npos)
                  std::wcerr << L"Warning: No encoding specified in " << firstline << L'\n';
            else if (firstline.size() > pos+9+6) {
                  std::wstring encoding = firstline.substr(pos+9+1,5);
                  transform(encoding.begin(),encoding.end(),encoding.begin(),toupper);
                  if (encoding != L"UTF-8")
                        return std::wcerr << L"Error: Unsuitable encoding " << encoding << L'\n',3;
                  firstline.replace(pos+9+1,5,L"UTF-16");
            }
            else
                  return std::wcerr << L"Error: Invalid encoding in " << firstline << L'\n',4;

            // Create a 16LE unicode file, bu using the noconv::codecvt
            std::wofstream fout;
            fout.imbue(loc);
            fout.open(argv[2]
                  ,std::ios::binary      // Need to open in binary mode!
                  );
            if (!fout)
                  return std::cerr << "Error: Unable to create Unicode 16LE input file " << argv[2] << '\n',5;

            wchar_t signature = 0xfeff;
            fout.write(&signature,1);

            fout << firstline << L"\r\n";

#if 0      /* Need to do our own CR+LF handling in binary mode */

            typedef std::istreambuf_iterator<wchar_t> IItr;
            typedef std::ostreambuf_iterator<wchar_t> OItr;
            copy(IItr(fin),(IItr()),OItr(fout));

#else /* Read lines from our non-binary mode input file */

            std::wstring line;
            while (getline(fin,line))
                  fout << line << L"\r\n"; /* Write out orn CR+LF */

#endif
      }
      catch (std::exception e) {
            std::cerr << "Exception: " << e.what() << std::endl;
      }
}
--------8<--------
0
 
rstaveleyCommented:
Oh yes... beware that there is no attempt to handle non ASCII characters in that! If it was seriously being used to convert between UTF-8 and UTF-16 as it suggests, it ought to deal with multi-byte UTF-8 characters. It would be safer to use this only to convert between ISO-8859-1 (Latin1) and UTF-16. You need to work with jkr's suggested wcstombs/mbstowcs functions to work with multi-bytes.

All said and done, if you *really* need to convert between XML file formats in Windows, look no further than Microsoft's XSLT support. Scroll to the bottom of http://msdn.microsoft.com/XML/XMLDownloads/default.aspx and follow the link to the Command Line Transformation Utility (msxsl.exe), which comes with source code.
0
 
yasimplicityAuthor Commented:
that is it

but there is some errors  runtime related to out of the

" wcontent "  band when executing  

&wcontent[contentLength]

I'll deal with it myself

Anyhow thanks  a lot Mr <rstaveley>
0

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

  • 9
  • 4
  • 2
  • +2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now