Solved

Borland C++ Builder 6 and reading Unicode (chinese characters) from a text file. Reads a blank. Something I'm doing wrong?

Posted on 2003-11-13
13
4,138 Views
Last Modified: 2012-08-13
Hi, I'm using Borland C++ Builder 6. I am trying to read a text file which contains Unicode characters (chinese characters specifically). I think the code I've come up with should work, except that when I run it, the "getline" that's in the loop only runs once and the messagebox displays a blank. I'm new to to unicode and widechars. is there something I am doing wrong?

My code is below:

        wifstream file("test.txt");

        if (!file)
        {
                Application->MessageBox("Sorry, Can't open file","Error!",MB_OK);
                return;
        }

        wstring buff;

        // loop while not at the end of file
        while (!file.eof())
        {

                // read a line from the file
                getline(file,buff);

                //Display the read line in a messagebox
                MessageBoxW(NULL, buff.data(), L"Output",MB_OK);
        }
        // close the file
        file.close();
0
Comment
Question by:theblip
  • 9
  • 4
13 Comments
 
LVL 17

Expert Comment

by:rstaveley
ID: 9739197
> buff.data()

This does not give you a '\0' terminated string.

Use buff.c_str().
0
 

Author Comment

by:theblip
ID: 9739586
buff.c_str() still gives me a blank message box :(
0
 
LVL 17

Expert Comment

by:rstaveley
ID: 9740259
Can you try putting the following definitions at the top of your .cpp file before your includes...

#define _UNICODE
#define UNICODE

...and then use the MessageBox function rather than the MessageBoxW function?
0
 
LVL 17

Expert Comment

by:rstaveley
ID: 9741435
Sorry about this, my help has been misguided. I've just tried some tests myself with VC 7.1, but am finding that wide characters seem to be written/read to/from files by <fstream> no differently from standard chars.

Here is my test program:
--------8<--------
/*
  Compile and link from command line as follows:

      cl /EHsc test.cpp user32.lib
 */
#include <windows.h>
#include <string>
#include <fstream>

int main()
{
      std::wstring str0(L"This is a line of text");
      MessageBoxW(0,str0.c_str(),L"About to write wstring...",MB_OK);

      std::wofstream file0("test2.txt");
      if (!file0) {
            MessageBox(0,"Sorry, Can't write file","Error!",MB_OK);
            return 1;
      }

      // Write the wstring to the file
      file0 << str0 << std::endl;

      // close the file
      file0.close();

      std::wstring str(L"Hello, Wide World");
      MessageBoxW(0,str.c_str(),L"About to read wstring from the file...",MB_OK);

      std::wifstream file("test2.txt");
      if (!file) {
            MessageBox(0,"Sorry, Can't open file","Error!",MB_OK);
            return 1;
      }

      // Loop while not at the end of file
      while (getline(file,str)) {
            //Display the read line in a messagebox
            MessageBoxW(NULL,str.c_str(),L"wstring read is...",MB_OK);
      }

      // close the file
      file.close();

      std::string str2("Hello, Normal World");
      MessageBox(0,str2.c_str(),"Acout to read string from the file...",MB_OK);

      std::ifstream file2("test2.txt");
      if (!file2) {
            MessageBox(0,"Sorry, Can't open file","Error!",MB_OK);
            return 1;
      }

      // Loop while not at the end of file
      while (getline(file2,str2)) {
            //Display the read line in a messagebox
            MessageBox(NULL,str2.c_str(),"string read is...",MB_OK);
      }

      // close the file
      file2.close();
}
--------8<--------
What I find is that test2.txt is created by wofstream as a 24 byte file... i.e. not as UNICODE at all. Its contents are read by wifstream/ifstream with no regard for the character width. That doesn't bode well for real wide characters.

I'm almost certainly missing something with regard to my understanding of UNICODE. Sorry for the bum steer!
0
 
LVL 17

Expert Comment

by:rstaveley
ID: 9741865
I've posted a question on this myself at http:/Q_20797258.html in case this has gone off the boil with other experts. If a get a good answer I'll direct them here to scoop up your points too :-)
0
 

Author Comment

by:theblip
ID: 9744039
Thanks for trying to help... I guess we both have to wait a little longer :)
0
Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

 
LVL 17

Expert Comment

by:rstaveley
ID: 9744296
I've got a feeling that it is a case of imbuing the file with a locale which gives it UTF-16 character traits... or something like that.

I'm not getting any good leads from Google, though I'd expect it to be a common line of enquiry.

How can you create a UNICODE text file using STL? It seems like a modest request... 8-)
0
 
LVL 17

Accepted Solution

by:
rstaveley earned 250 total points
ID: 9746979
I read that the facet codecvt, used by basic_filebuf, is responsible for converting between internal and external character encoding. Presumably, external encoding (i.e. the stuff that gets written to disk) is 8-bit in the US/European locales, because it is more compact.

There is some useful commentary at http:/Q_20536226.html with regard to dealing with UTF-16 vs UTF-8 UNICODE. Do you know what encoding your test.txt file has?

If your default locale isn't Chinese on your system, my guess is that you need to imbue the Chinese locale onto wofstream to get it to assume the appropriate (wide?) external encoding. Otherwise, assuming you have a UTF-16 file, the first character wofstream picks up from the 8-bit external encoding is a '\0'; that is therefore treated as a regular '\0' character in your MessageBoxW and that is then displayed as nothing, when it is converted to an LPCWSTR by c_str().

Let me know how you get on, theblip.  Hopefully we'll get some input from an expert with i18n experience!
0
 
LVL 17

Expert Comment

by:rstaveley
ID: 9749927
For what it is worth, here http:/Q_20797258.html#9749822 is how to write a wchar_t file (courtesy of Dan Rollins, who should get all the credit, if he puts a post here).

The reading counterpart is as follows:
--------8<--------
#include <iostream>
#include <fstream>
#include <string>
#include <locale>

class Simple_codecvt : public std::codecvt<wchar_t,char,mbstate_t> {
public:
        typedef wchar_t _E;
        typedef char _To;
        typedef mbstate_t _St;
        explicit Simple_codecvt(size_t _R = 0) : std::codecvt<wchar_t,char,mbstate_t>(_R) {}
protected:
        virtual result do_in(_St& _State,const _To *_F1,const _To *_L1,const _To *& _Mid1,_E *_F2,_E *_L2,_E *& _Mid2) const {return noconv;}
        virtual result do_out(_St& _State,const _E *_F1,const _E *_L1,const _E *& _Mid1,_To *_F2, _To *_L2,_To *& _Mid2) const {return noconv;}
        virtual result do_unshift(_St& _State,_To *_F2, _To *_L2,_To *& _Mid2) const {return noconv;}
        virtual int do_length(_St& _State, const _To *_F1,const _To *_L1, size_t _N2) const throw() {return (_N2 < (size_t)(_L1-_F1)?_N2 :_L1 - _F1);}
        virtual bool do_always_noconv() const throw() {return true;}
        virtual int do_max_length() const throw() {return 2;}
        virtual int do_encoding() const throw() {return 2;}
};

int main()
{
        try {
                std::locale loc = std::_ADDFAC(std::locale::classic(),new Simple_codecvt);
                std::wifstream file;
                file.imbue(loc);
                file.open("three_wchars.txt",std::ios::binary);
                if (!file) {
                        std::cerr << "Error: Unable to open file\n";
                        return 1;
                }

                std::wstring wstr;
                file >> wstr;
                file.close();

                std::wcout << L"We read: " << wstr << std::endl;
        }
        catch (std::exception e) {
                std::cerr << "Exception: " << e.what() << std::endl;
        }
        return 0;
}
--------8<--------

This technique may work on your test.txt file, if it is UTF-16 unicode and doesn't have a leading wchar_t(0) like you see in UTF-16 XML files. It is certainly worth trying....

                std::locale loc = std::_ADDFAC(std::locale::classic(),new Simple_codecvt);
                std::wifstream file;
                file.imbue(loc);
                file.open("test.txt",std::ios::binary);

Note that by wchar_t, I mean as implemented on Windows as 32 bits. You'll find that it is implemented as 32 bits on Linux.
0
 

Author Comment

by:theblip
ID: 9750177
After more research, it has dawned on me that just using widechars does not equal a UNICODE document. :P
A widechar merely means it can be used to represent characters which require more than 8-bits.

My basic understanding is as follows:

A UTF-16 document, (which uses 2bytes for each character to be stored) should start with a Byte Order Marker which indicates the order or the "endian"ess of the bytes are stored. ("0xFFFE" or "0xFEFF").

A UTF-8 document, (which can use from 1 to 4 bytes for each character to be stored) has a different Marker to indicate it is UTF-8. Basic text which are in the ANSI range are stored using 1 byte. Characters outside of the ANSI range use more bytes (up to 4) as necessary to represent the character. The storing of the characters follow the rules as detailed in link provided above by rstaveley (http:/Q_20536226.html)

The file I'm trying to read is a UTF-8 file. The best solution that I can see at the moment is to read in the file byte by byte and to convert them into widechars based on the UTF-8 rules. I do not think there are any standard functions to handle this, so custom code would be necessary.

Since I am using BCB6, I'm going to take a look at this library of delphi unicode components I have found which are usable in BCB6. It is available here as source: http://home.ccci.org/wolbrink/tnt/delphi_unicode_controls.htm
You can see an example of how loading a unicode file is done. Check out the "TntClasses.pas" file and search for the TTntStrings.LoadFromStream() method. The code is in Pascal but it should give an idea of how it is done. I think I will be trying this out and will let you know how it turns out.
0
 
LVL 17

Expert Comment

by:rstaveley
ID: 9750380
It is surprising to me that IOStreams doesn't have an easy mechanism for doing the conversion. I would have thought that the boost libraries would have a set of codecvt templates for going basic UTF-8/UTF-16 <> wchar_t file I/O, but I can't seem to see anything. Let us know how you get on, theblip.
0
 

Author Comment

by:theblip
ID: 9752719
I have found a page that is quite useful. http://www.i18nguy.com/unicode/c-unicode.html#streams

Regarding codecvt templates, this is what I found from the link above:

----- Quote -----

2. Stream I/O will convert Unicode data from/to native (ANSI) code page on read/write, not UTF-8 or UTF-16. However the stream class can be modified to read/write UTF-8. You can implement a facet to convert between Unicode and UTF-8.

codecvt <wchar_t, char_traits <wchar_t> >

---- UnQuote ----

Regarding my original question with the "blanks", the best answer I've found is at this link: http://www.i18nfaq.com/vcpp.html
Please see "3. I am reading a Unicode file Using fgetws(). But I get null embedded strings. why ?"
The answer given is to open the file in binary mode.

I feel that I have enough information regarding reading unicode data to be able to successfully try something.
Thanks alot, Rstaveley! Your comments and links were able to point me in the right direction.
0
 
LVL 17

Expert Comment

by:rstaveley
ID: 9753210
This has all been very interesting for me too :-)
0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Suggested Solutions

Many modern programming languages support the concept of a property -- a class member that combines characteristics of both a data member and a method.  These are sometimes called "smart fields" because you can add logic that is applied automaticall…
This article shows you how to optimize memory allocations in C++ using placement new. Applicable especially to usecases dealing with creation of large number of objects. A brief on problem: Lets take example problem for simplicity: - I have a G…
The goal of the video will be to teach the user the difference and consequence of passing data by value vs passing data by reference in C++. An example of passing data by value as well as an example of passing data by reference will be be given. Bot…
The viewer will be introduced to the member functions push_back and pop_back of the vector class. The video will teach the difference between the two as well as how to use each one along with its functionality.

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now