asked on

How to read Unicode files in Visual C++ Multibyte Application

An MFC application using Multibyte Character Set cannot read Chinese (PRC) Unicode files created by C# .NET.
It can read legacy files which do not have BOM and are MBCS. The Unicode file begins with BOM FF FE. "Male" is stored as 37 75 in the Unicode file but loads as 0xe7 'ç' 0x94 '”' 0xb7 '·'
The Multibyte file stores "Male" as C4 'Ä' D0 0xd0 'Ð'.
What's the best way to read the Unicode files if the application is MFC Visual C++ using MBCS?
1. Convert the unicode string to MBCS when writing the file in C#?
2. Modify the C++ app to correctly read the Unicode files?
3. Create another C++ app in Unicode to read and convert these files?
I have tried in Visual C++ ismbblead, setLocale, CFile, fopen, _open, and C# FileStream. No matter what I try, I can never get the hex bytes as they are stored inside the file. I always get the bytes encoded. If the file format doesn't match the app format, I'm stuck. This is my current code in the multibyte C++ app:
CString pathName = fileDlg.GetPathName();
//char *pLocale = setlocale(LC_CTYPE, "zh-hk"); //has no effect on encoding
//_setmbcp(_MB_CP_LOCALE); //has no effect on encoding
FILE *fh = fopen(pathName, "rb");
const int MAX_COUNT = 100;
char buffer[MAX_COUNT];
memset(buffer, 0, MAX_COUNT);
fgets(buffer, MAX_COUNT, fh); //Male

And this is code in C# .NET test app that reads Unicode but not MBCS
using (StreamReader sr = new StreamReader(vpdName))
{
int lineIndex = 0;
while (sr.Peek() >= 0)
{
string str = sr.ReadLine();

This is tough! I've worked on it for 3 days and spent lots of hours searching this forum and others for help on this problem. My goal is to be able to read the Unicode file and convert the Chinese strings so that they will display properly in a multibyte app. I think this means that I need to convert Unicode 0x75 37 to MBCS 0x C4 D0. Can this be done? But first I need to get that Unicode string! And the multibyte app always reads and encodes the Unicode file so that the strings are garbage--don't display properly and cannot be converted.

evilrix

Before we go any further can we just get some terminology straight because Microsoft's terminology is pretty confusing.

What format is the file. You say Unicode, but that is not a format that is a character set. Is it UTF8, 16 or 32? I would guess UTF16 since this is what Microsoft generally call Unicode.

When you say your MFC app is Multibyte. What format is that? UTF8, ANSI (or even UTF16 because, contrary to what Microsoft would have you belief UTF16 is also a multibyte encoding format)?

Generally the simplest way to handle Unicode files for cross platform/application pollination is UTF8 because this is easy to handle on all platforms and the basic data types are alway char. If course if your C# app is creating UTF16 you are probably stuck with that so the best solution, in my view, would be to read the file as UTF16 and convert internally.

The tools for Unicode Character Encoding on Windows are pretty poor. I'd suggest you consider using ICU, which is a cross-platform Unicode handling framework from IBM. It's free and open source.

http://site.icu-project.org/

Forehand

ASKER

Encoding of the Unicode file from the BOM FF FE it is UTF-16. But when I read it in C# it reads correctly and the FileStream.Encoding.EncodingName after "Male" is read is UTF8. Documentation states that all files will automatically be encoded correctly using the BOM, and this certainly seems to be true, providing the app was built with Unicode as the character set or built with C# .NET where the default is UTF-16.
How can I tell what the format is of the MFC Application? I look in properties and see only 2 relevant properites. One is "Using Multibyte Character Set." The other is in the pre processor C++ properties and is "MBCS." So the application is not Unicode. I would probably guess that the files are read with ANSI encoding.
I am now working on this solution: Create a small consule app in Unicode Visual C++ which can read the file. Then try to convert it to Chinese using WideCharToMultibyte. I wish that I could make conversions without going through a separate application. However, I don't know how to write a multibyte file using Unicode strings in C#. Also I have only garbage when I read a Unicode file in a Visual C++ Application built with MBCS.

evilrix

That BOM suggests it's UTF16 Little Endian.

>> So the application is not Unicode.

That doesn't mean anything other than it will use wide and not narrow char types - this is why I really hate the fact Microsoft call it Unicode. It's not. There are three things that come into play when dealing with text:

1. The data type - wide (wchar_t) or narrow (char). When UNICODE and UNICODE_ is defined the natural char type is wide and wide versions of C and API functions are called. These expect UTF16 encoding. When MBSC is defined the natural char type is narrow.

2. The character encoding. This could be ANSI (think code pages), UTF8 or UTF16 (or other, but we'll consider these just for simplicity). UTF16 is the standard for UNICODE and ANSI is the standard for MBCS

3. The character set. Unicode is a 32 bit character set, UTFx is a way of encoding these 32 bit into smaller data types, which could be wide or narrow.

So, you see, it matters not one little bit if your app is build with MBCS or UNICODE when it comes to reading a file. What matters is you know what the format is and you treat it accordingly. If it's UTF16 you can read that regardless of what type of app you've built. You just need to read it into wide (wchar_t) types and treat it as UTF16. If you want to handle it as UTF8 or ANSI you will need to re-encode it. You can do that using ICU or there are some API functions provided by Windows.

http://en.wikipedia.org/wiki/Character_encoding

Forehand

ASKER

OK. I'm going to try opening the Unicode file with a wide character version of fopen. But I don't know ahead of time whether the file is Unicode format or ANSI format. I thought I could read the first byte (ha!). The BOM never gets returned when I open the file and read it. Do you know if there is a way to determine how the file is encoded? My MBCS app has to be able to read both Unicode and Multibyte (ANSI) files.

evilrix

>> But I don't know ahead of time whether the file is Unicode format or ANSI format.

That's what the BOM is there for - to help you figure this out. You should open the file and read it as a series of bytes. Process the BOM and then tread that series of bytes either as a series of chars or a series of wchar_t.

But, I say again - look at using ICU as it will take care of all of this for you... and it's really simple to use.

>> My MBCS app has to be able to read both Unicode and Multibyte (ANSI) files.

As I said above, it can. Forget it's a MBCS app... it's not relevant and is just confusing you. Think only about the file. It's a Unicode file, with a BOM. You need to open it and handle it in this way - the fact your app is MBCS doesn't change or even hinder that.

Forehand

ASKER

I have tried every method of opening this file: fopen, CFile, _open, Windows CreateFile. No matter what I try, I never get the BOM. I only get garbage encoding. Both of my test apps were created with Visual Studio defaults. I would love to be able to read the BOM!! I thought fopen(filename, "rb") would read the file as a series of bytes. Nope! It uses the BOM to encode the file according to your locale and code page (I guess). But even if I change the locale it doesn't matter. I NEVER GET THE BOM!
I am going to try now to read the file using your suggestion to use unicode functions. I hesitate to use site.icu because I work for a corporation that doesn't like programmers to use 3rd party tools without permission. I will give you partial credit if your advice succeeds. I would also like to mark your suggestions as helpful, but I don't know if that will close my question, and I haven't solved the problem yet.

evilrix

"Character Set Detection"
http://userguide.icu-project.org/conversion/detection

evilrix

>> I have tried every method of opening this file
You should just be opening it as a binary file.

fstream in("myfile.txt", std::ios::binary);

or

fopen("myfile.txt", "rb");

evilrix

>> I will give you partial credit if your advice succeeds.

There is no rush to close this -- I'm not in it for the points, so take your time. We'll work it out together and get to a point (I hope) where you understand what is going on.

evilrix

A very quick and dirty example of reading the file. Your BOM will be the first 2 bytes in intext.

#include <iostream>
#include <fstream>
#include <iomanip>

wchar_t const outtext[] = L"hello world";
char const BOM[] = { 0xFF, 0xFE };
size_t const insize = sizeof(outtext) + sizeof(BOM);

int main()
{
   std::ofstream out("c:/temp/myfile.txt", std::ios::binary);
   out.write(BOM, sizeof(BOM));
   out.write((char *)outtext, sizeof(outtext));
   out.close();

   wchar_t intext[insize];

   std::ifstream in("c:/temp/myfile.txt", std::ios::binary);
   in.read((char *)intext, insize);
   in.close();

   std::cout.write((char *)intext, insize); // wide stream
   std::cout << "\n";

   char * ptext = (char *)intext;
   for(size_t i = 0 ; i < insize ; ++i)
   {
      std::cout << std::hex << (0xFF & (int)ptext[i]);
   }
}

Open in new window

Forehand

ASKER

Results of fread:
FILE *fh = fopen(pathName, "rb");
const int MAX_COUNT = 100;
char buffer[MAX_COUNT];
memset(buffer, 0, MAX_COUNT);
fread(buffer, 1, 1, fh); <-- buffer[0] contains an asterisk
The first line of the Unicode file begins with
FF FE 2A 00 42 00
FF FE is the BOM. After that comes the string, "*BEGINDATA*". As you see, fread, even with "rb" skips the BOM. Also, I can't override the encoding with the ccs option. If I try ANSI, the program crashes because you aren't allowed to have ANSI if the BOM is FF FE.

Next I tried fstream. In this case, I seem to always get 0xcc or 204 no matter what.
CString pathName = fileDlg.GetPathName();
fstream in(pathName, std::ios::binary);
byte by;
in.read((char *)&by, 1);
in.read((char *)&by, 1);
in.read((char *)&by, 1);
in.close();

Forehand

ASKER

Maybe I have to try a different app type. My test app is built using the default for MFC app in Visual Studio 2005. I will try your example later this weekend or Monday. Thanks a lot.

evilrix

Your fread is wrong. Look at what you wrote

fread(buffer, 1, 1, fh);

This will read in 1 byte only.

The spec for fread is as follows.
size_t fread ( void * ptr, size_t size, size_t count, FILE * stream );
http://www.cplusplus.com/reference/clibrary/cstdio/fread/

Your code should be either

fread(buffer, sizeof(buffer), 1, fh);

or

fread(buffer, 1, sizeof(buffer), fh);

Try the code I posted above, it works -- I know, I tested it :)

>> Maybe I have to try a different app type.

Please trust me... it is nothing to do with the app type... forget about this as you are just confusing yourself. Nothing, I repeat nothing about the app type is going to prevent you reading the file as a series of bytes that you can then treat as UTF16.

evilrix

>> In this case, I seem to always get 0xcc

This is the default value assigned by the debugger to an uninitialised char -- your file stream is NOT being opened successfully. In other words, the reason it's failing is because you are not reading anything.

Forehand

ASKER

I have copied your code into my project. I'm still getting 0xcc for intext. You have been incredibly patient. About reading 1 byte-I thought why should I read more? If the first byte is 0xFF, then I have a Unicode file. It seems as though the failure to read is based on something else besides the size of the variable. I feel as though your help has brought me so close to a solution! And yet somehow the
read is not working! insize is 26 which is not quite correct. The size should be 24. Because outtext is 11 chars * 2 = 22. 0xFF, 0xFE is probably counted as 4 bytes but should be counted as 2 bytes.

CString pathName = fileDlg.GetPathName();
fstream in(pathName, std::ios::binary);
wchar_t const outtext[] = L"*BEGINDATA*";
char const BOM[] = { 0xFF, 0xFE };
size_t const insize = sizeof(outtext) + sizeof(BOM);
wchar_t intext[insize];
char * ptext = (char *)intext;
in.read((char *)intext, insize);
in.close();

evilrix

>> I have copied your code into my project. I'm still getting 0xcc for intext

Verbatim? I tested it with VS2008 and it does exactly what I would expect. You should see it output "hello world" followed by another line with the hex that represents the wide chars.

>> If the first byte is 0xFF, then I have a Unicode file.

Maybe... but not necessarily.

>> And yet somehow the read is not working!

It would seem so... try putting in some additional code to check the file is open and also that the stream has not gone into an error state.

>> insize is 26 which is not quite correct.

Sure it is... don't forget there is a null at the end of L"hello world"

Forehand

ASKER

OK I had used fstream instead of ifstream. When I corrected this call, I got the correct text but still no BOM. ptext (defined as char*) contains
0x0012f1d4 "*BEGINDATA*"All RespondeÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌ"
intext defined as wchar contains garbage Chinese characters. The first character is
0x422a L'¿'
I also tried this in a Unicode app. You are right there is no difference.
I looked at your file "mytext.txt" in a hex editor. The first 3 bytes were FF FE 68 00 65 00
Your file reads great. The first byte of infile is 0xFFFE as expected.
My file in the hex editor is FF FE 42 00 62 00
My file reads like garbage in my apps. I am very puzzled.

evilrix

>> When I corrected this call, I got the correct text but still no BOM.

How are you reaching this conclusion? When I step through the debugger the two bytes are clearly there. You are just reading a binary file. As long as you have opened it as a binary file C++ makes absolutely no translations on any of the content read. I can only assume you are not correctly opening the file -- do you check this?

>> I also tried this in a Unicode app. You are right there is no difference.

Ta daah :)

>> My file reads like garbage in my apps. I am very puzzled.

Ok, please attach your file and your code (full, so I can compile it and test it).

Forehand

ASKER

The hex dump of my file was inaccurate! I was relying on a text processor, UltraEdit, to display files in hex. The file that I was reading, chinacd.vpd, DOES NOT HAVE A BOM. It displays properly if I open it binary mode in Visual Studio V6.0. I guess UltraEdit somehow detected that this file needed translation into Chinese, translated it, and then displayed the results of the translation in hex instead of the actual bytes of the file. chinacd.txt I wish I could do that translation.
I still have a problem, but it is not the problem I thought it was. I thought that my source code was reading the file improperly. Instead, the source code was reading correctly. It was the hex dump from UltraEdit that was incorrect.
I attach the file and source code as you requested, but the source code works. It reads my file as a Unicode file and reverses each 2 character byte, which results in garbage. This file is not a Unicode file so I shouldn't read it as Unicode, I guess. But now I'm puzzled. How can this file be distinguished from a multibyte character set file. I need to be able to read both files and display them properly. C# .NET can read the first one ok. C++ multibyte app using CFile and CArchive can read the second one ok. How can I know how to tell the difference between these files when they both start the same way? chinvp.txt
ChinaCD.txt can be translated correctly into Chinese characters by Excel, Outlook, UltraEdit, Notepad.
ChinVP.txt also seems to display "properly" by Notepad, etc. It looks funny with an English (United States) locale. But it displays Chinese characters fine with Hong Kong S.A.R locale. Do you think I can read both files in the same application? Will your classes that you recommended earlier do this job?

// UnicodeToMultibyte.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include <iostream>
#include <fstream>
#include <iomanip>
      
  wchar_t const outtext[] = L"*BEGINDATA*";
  char const BOM[] = { 0xFF, 0xFE };
  char const MALE[] = { 0x37, 0x75 };
  size_t const insize = sizeof(outtext) + sizeof(BOM) + sizeof(MALE);

int _tmain(int argc, _TCHAR* argv[])
{
   std::ofstream out("c:/temp/myfile2.txt", std::ios::binary);
   out.write(BOM, sizeof(BOM));   
   out.write((char *)outtext, sizeof(outtext));
   out.write(MALE, sizeof(MALE));
   out.close();

   wchar_t intext[300];  //big enough to hold all of chinacd.vpd

   std::ifstream in("c:/temp/myfile2.txt", std::ios::binary);
   in.read((char *)intext, insize);
   in.close();
   char * ptext = (char *)intext;
  /* std::cout.write((char *)intext, insize); // wide stream
   std::cout << "\n";
   char * ptext = (char *)intext;
   for(size_t i = 0 ; i < insize ; ++i)
   {
      std::cout << std::hex << (0xFF & (int)ptext[i]);
   }
*/
   std::ifstream chinaIn("c:/imswin/data/chinacd.vpd", std::ios::binary);
   chinaIn.read((char *)intext, 296); //295 is sizeof chinacd.vpd   
   chinaIn.close();
   
   return 0;
}

Open in new window

SOLUTION

evilrix

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ASKER CERTIFIED SOLUTION

Forehand

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

evilrix

Just to point out in case you had not realised, this started out as a C++ question (although I realise the files are created in C#) and I know almost nothing about C# so I'm afraid I cannot comment or provide you with any sensible suggestions in that area.

Also, I have no objection if you want to keep this question open whilst you still try and figure out what you are doing in the C++ side of things but as from now I will be offline for the next 5 days (it's my birthday and I'm going to party it up for a few days on a mini-holiday). I will post an alert to some other C++ experts to hopefully keep an eye on things here for you.

-Rx.

evilrix

Forehand,

Thank you for your kind words in your closing comment. The appreciation means more than any points (although those are nice too) :)

not-so-evilrix.

Forehand

ASKER

Evilrix truly deserves his genius rating and in addition should get an Angel rating because he helped me so much. (But maybe that would cancel out Evil?) I spoke to a representative to give advice on how to grade. I wanted to give fewer points and an A grade because there is no complete solution in the sense that there is no easy way to get encoding from a file. Everything that Evilrx said was true, but didn't work for me because UltraEdit hex representation of my file didn't match the way it was actually stored. I decided it was easier to control the way the files were written which I CAN control rather than to spend the effort modifying an older app to be able to read and detect everything.