Solved

How to read Unicode files in Visual C++ Multibyte Application

Posted on 2010-11-12
24
4,434 Views
Last Modified: 2012-05-10
An MFC application using Multibyte Character Set cannot read Chinese (PRC) Unicode files created by C# .NET.
It can read legacy files which do not have BOM and are MBCS.  The Unicode file begins with BOM FF FE.  "Male" is stored as 37 75 in the Unicode file but loads as 0xe7 'ç'  0x94 '”' 0xb7 '·'
The Multibyte file stores "Male" as C4 'Ä' D0 0xd0 'Ð'.
What's the best way to read the Unicode files if the application is MFC Visual C++ using MBCS?
1. Convert the unicode string to MBCS when writing the file in C#?
2. Modify the C++ app to correctly read the Unicode files?
3. Create another C++ app in Unicode to read and convert these files?
I have tried in Visual C++ ismbblead, setLocale, CFile, fopen, _open, and C# FileStream.  No matter what I try, I can never get the hex bytes as they are stored inside the file.  I always get the bytes encoded.  If the file format doesn't match the app format, I'm stuck.  This is my current code in the  multibyte C++ app:
  CString pathName = fileDlg.GetPathName();  
   //char *pLocale = setlocale(LC_CTYPE, "zh-hk"); //has no effect on encoding
   //_setmbcp(_MB_CP_LOCALE); //has no effect on encoding
   FILE *fh = fopen(pathName, "rb");  
   const int MAX_COUNT = 100;
   char buffer[MAX_COUNT];
   memset(buffer, 0, MAX_COUNT);
   fgets(buffer, MAX_COUNT, fh); //Male

And this is code in C# .NET test app that reads Unicode but not MBCS
            using (StreamReader sr = new StreamReader(vpdName))
            {
               int lineIndex = 0;
               while (sr.Peek() >= 0)
               {
                  string str = sr.ReadLine();    

This is tough!  I've worked on it for 3 days and spent lots of hours searching this forum and others for help on this problem.  My goal is to be able to read the Unicode file and convert the Chinese strings so that they will display properly in a multibyte app.  I think this means that I need to convert Unicode 0x75 37 to MBCS 0x C4 D0.  Can this be done?  But first I need to get that Unicode string!  And the multibyte app always reads and encodes the Unicode file so that the strings are garbage--don't display properly and cannot be converted.
0
Comment
Question by:Forehand
  • 14
  • 10
24 Comments
 
LVL 40

Expert Comment

by:evilrix
ID: 34124641
Before we go any further can we just get some terminology straight because Microsoft's terminology is pretty confusing.

What format is the file. You say Unicode, but that is not a format that is a character set. Is it UTF8, 16 or 32? I would guess UTF16 since this is what Microsoft generally call Unicode.

When you say your MFC app is Multibyte. What format is that? UTF8, ANSI (or even UTF16 because, contrary to what Microsoft would have you belief UTF16 is also a multibyte encoding format)?

Generally the simplest way to handle Unicode files for cross platform/application pollination is UTF8 because this is easy to handle on all platforms and the basic data types are alway char. If course if your C# app is creating UTF16 you are probably stuck with that so the best solution, in my view, would be to read the file as UTF16 and convert internally.

The tools for Unicode Character Encoding on Windows are pretty poor. I'd suggest you consider using ICU, which is a cross-platform Unicode handling framework from IBM. It's free and open source.

http://site.icu-project.org/
0
 

Author Comment

by:Forehand
ID: 34124776
Encoding of the Unicode file from the BOM FF FE it is UTF-16.  But when I read it in C# it reads correctly and the FileStream.Encoding.EncodingName after "Male" is read is UTF8.  Documentation states that all files will automatically be encoded correctly using the BOM, and this certainly seems to be true, providing the app was built with Unicode as the character set or built with C# .NET where the default is UTF-16.
How can I tell what the format is of the MFC Application?  I look in properties and see only 2 relevant properites.  One is "Using Multibyte Character Set."  The other is in the pre processor C++ properties and is "MBCS."  So the application is not Unicode.  I would probably guess that the files are read with ANSI encoding.
I am now working on this solution:  Create a small consule app in Unicode Visual C++ which can read the file.  Then try to convert it to Chinese using WideCharToMultibyte.  I wish that I could make conversions without going through a separate application.  However, I don't know how to write a multibyte file using Unicode strings in C#.  Also I have only garbage when I read a Unicode file in a Visual C++ Application built with MBCS.
0
 
LVL 40

Expert Comment

by:evilrix
ID: 34124834
That BOM suggests it's UTF16 Little Endian.

>> So the application is not Unicode.

That doesn't mean anything other than it will use wide and not narrow char types - this is why I really hate the fact Microsoft call it Unicode. It's not. There are three things that come into play when dealing with text:

1. The data type - wide (wchar_t) or narrow (char). When UNICODE and UNICODE_ is defined the natural char type is wide and wide versions of C and API functions are called. These expect UTF16 encoding. When MBSC is defined the natural char type is narrow.

2. The character encoding. This could be ANSI (think code pages), UTF8 or UTF16 (or other, but we'll consider these just for simplicity). UTF16 is the standard for UNICODE and ANSI is the standard for MBCS

3. The character set. Unicode is a 32 bit character set, UTFx is a way of encoding these 32 bit into smaller data types, which could be wide or narrow.

So, you see, it matters not one little bit if your app is build with MBCS or UNICODE when it comes to reading a file. What matters is you know what the format is and you treat it accordingly. If it's UTF16 you can read that regardless of what type of app you've built. You just need to read it into wide (wchar_t) types and treat it as UTF16. If you want to handle it as UTF8 or ANSI you will need to re-encode it. You can do that using ICU or there are some API functions provided by Windows.

http://en.wikipedia.org/wiki/Character_encoding
0
 

Author Comment

by:Forehand
ID: 34124908
OK.  I'm going to try opening the Unicode file with a wide character version of fopen.  But I don't know ahead of time whether the file is Unicode format or ANSI format.  I thought I could read the first byte (ha!).  The BOM never gets returned when I open the file and read it.  Do you know if there is a way to determine how the file is encoded?  My MBCS app has to be able to read both Unicode and Multibyte (ANSI) files.
0
 
LVL 40

Expert Comment

by:evilrix
ID: 34124942
>>  But I don't know ahead of time whether the file is Unicode format or ANSI format.

That's what the BOM is there for - to help you figure this out. You should open the file and read it as a series of bytes. Process the BOM and then tread that series of bytes either as a series of chars or a series of wchar_t.

But, I say again - look at using ICU as it will take care of all of this for you... and it's really simple to use.

>> My MBCS app has to be able to read both Unicode and Multibyte (ANSI) files.

As I said above, it can. Forget it's a MBCS app... it's not relevant and is just confusing you. Think only about the file. It's a Unicode file, with a BOM. You need to open it and handle it in this way - the fact your app is MBCS doesn't change or even hinder that.
0
 

Author Comment

by:Forehand
ID: 34125018
I have tried every method of opening this file: fopen, CFile, _open, Windows CreateFile.  No matter what I try, I never get the BOM.  I only get garbage encoding.  Both of my test apps were created with Visual Studio defaults.  I would love to be able to read the BOM!!  I thought fopen(filename, "rb") would read the file as a series of bytes.  Nope!  It uses the BOM to encode the file according to your locale and code page (I guess).  But even if I change the locale it doesn't matter.  I NEVER GET THE BOM!
I am going to try now to read the file using your suggestion to use unicode functions.  I hesitate to use site.icu because I work for a corporation that doesn't like programmers to use 3rd party tools without permission.  I will give you partial credit if your advice succeeds.  I would also like to mark your suggestions as helpful, but I don't know if that will close my question, and I haven't solved the problem yet.
0
 
LVL 40

Expert Comment

by:evilrix
ID: 34125019
0
 
LVL 40

Expert Comment

by:evilrix
ID: 34125034
>> I have tried every method of opening this file
You should just be opening it as a binary file.

fstream in("myfile.txt", std::ios::binary);

or

fopen("myfile.txt", "rb");
0
 
LVL 40

Expert Comment

by:evilrix
ID: 34125040
>> I  will give you partial credit if your advice succeeds.

There is no rush to close this -- I'm not in it for the points, so take your time. We'll work it out together and get to a point (I hope) where you understand what is going on.
0
 
LVL 40

Expert Comment

by:evilrix
ID: 34125207
A very quick and dirty example of reading the file. Your BOM will be the first 2 bytes in intext.
#include <iostream>

#include <fstream>

#include <iomanip>



wchar_t const outtext[] = L"hello world";

char const BOM[] = { 0xFF, 0xFE };

size_t const insize = sizeof(outtext) + sizeof(BOM);



int main()

{

   std::ofstream out("c:/temp/myfile.txt", std::ios::binary);

   out.write(BOM, sizeof(BOM));

   out.write((char *)outtext, sizeof(outtext));

   out.close();



   wchar_t intext[insize];



   std::ifstream in("c:/temp/myfile.txt", std::ios::binary);

   in.read((char *)intext, insize);

   in.close();



   std::cout.write((char *)intext, insize); // wide stream

   std::cout << "\n";



   char * ptext = (char *)intext;

   for(size_t i = 0 ; i < insize ; ++i)

   {

      std::cout << std::hex << (0xFF & (int)ptext[i]);

   }

}

Open in new window

0
 

Author Comment

by:Forehand
ID: 34125210
Results of fread:
FILE *fh = fopen(pathName, "rb");  
 const int MAX_COUNT = 100;
 char buffer[MAX_COUNT];
 memset(buffer, 0, MAX_COUNT);
 fread(buffer, 1, 1, fh); <-- buffer[0] contains an asterisk
The first line of the Unicode file begins with
FF FE 2A 00 42 00
FF FE is the BOM.  After that comes the string, "*BEGINDATA*".  As you see, fread, even with "rb" skips the BOM.  Also, I can't override the encoding with the ccs option.  If I try ANSI, the program crashes because you aren't allowed to have ANSI if the BOM is FF FE.

Next I tried fstream.  In this case, I seem to always get 0xcc or 204 no matter what.
 CString pathName = fileDlg.GetPathName();  
   fstream in(pathName, std::ios::binary);
   byte by;
   in.read((char *)&by, 1);
   in.read((char *)&by, 1);
   in.read((char *)&by, 1);
   in.close();
0
 

Author Comment

by:Forehand
ID: 34125232
Maybe I have to try a different app type.  My test app is built using the default for MFC app in Visual Studio 2005.  I will try your example later this weekend or Monday.  Thanks a lot.
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 40

Expert Comment

by:evilrix
ID: 34125268
Your fread is wrong. Look at what you wrote

fread(buffer, 1, 1, fh);

This will read in 1 byte only.

The spec for fread is as follows.
size_t fread ( void * ptr, size_t size, size_t count, FILE * stream );
http://www.cplusplus.com/reference/clibrary/cstdio/fread/

Your code should be either

fread(buffer, sizeof(buffer), 1, fh);

or

fread(buffer, 1, sizeof(buffer), fh);

Try the code I posted above, it works -- I know, I tested it :)


>> Maybe I have to try a different app type.

Please trust me... it is nothing to do with the app type... forget about this as you are just confusing yourself. Nothing, I repeat nothing about the app type is going to prevent you reading the file as a series of bytes that you can then treat as UTF16.
0
 
LVL 40

Expert Comment

by:evilrix
ID: 34125287
>> In this case, I seem to always get 0xcc

This is the default value assigned by the debugger to an uninitialised char -- your file stream is NOT being opened successfully. In other words, the reason it's failing is because you are not reading anything.
0
 

Author Comment

by:Forehand
ID: 34128196
 I have copied your code into my project.  I'm still getting 0xcc for intext.  You have been incredibly patient.  About reading 1 byte-I thought why should I read more?  If the first byte is 0xFF, then I have a Unicode file.  It seems as though the failure to read is based on something else besides the size of the variable.  I feel as though your help has brought me so close to a solution!  And yet somehow the
read is not working!  insize is 26 which is not quite correct.  The size should be 24.  Because outtext is 11 chars * 2 = 22.  0xFF, 0xFE is probably counted as 4 bytes but should be counted as 2 bytes.

  CString pathName = fileDlg.GetPathName();
   fstream in(pathName, std::ios::binary);
   wchar_t const outtext[] = L"*BEGINDATA*";
   char const BOM[] = { 0xFF, 0xFE };
   size_t const insize = sizeof(outtext) + sizeof(BOM);
   wchar_t intext[insize];
   char * ptext = (char *)intext;
   in.read((char *)intext, insize);
   in.close();
0
 
LVL 40

Expert Comment

by:evilrix
ID: 34128625
>> I have copied your code into my project.  I'm still getting 0xcc for intext

Verbatim? I tested it with VS2008 and it does exactly what I would expect. You should see it output "hello world" followed by another line with the hex that represents the wide chars.

>>  If the first byte is 0xFF, then I have a Unicode file.

Maybe... but not necessarily.

>> And yet somehow the read is not working!

It would seem so... try putting in some additional code to check the file is open and also that the stream has not gone into an error state.

>>  insize is 26 which is not quite correct.

Sure it is... don't forget there is a null at the end of L"hello world"
0
 

Author Comment

by:Forehand
ID: 34131358
OK I had used fstream instead of ifstream.  When I corrected this call, I got the correct text but still no BOM.  ptext (defined as char*) contains
0x0012f1d4 "*BEGINDATA*"All RespondeÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌÌ"
intext defined as wchar contains garbage Chinese characters.  The first character is
0x422a L'¿'
I also tried this in a Unicode app.  You are right there is no difference.
I looked at your file "mytext.txt" in a hex editor.  The first 3 bytes were FF FE 68 00 65 00
Your file reads great.  The first byte of infile is 0xFFFE as expected.
My file in the hex editor is FF FE 42 00 62 00
My file reads like garbage in my apps.  I am very puzzled.
0
 
LVL 40

Expert Comment

by:evilrix
ID: 34131754
>> When I corrected this call, I got the correct text but still no BOM.

How are you reaching this conclusion? When I step through the debugger the two bytes are clearly there. You are just reading a binary file. As long as you have opened it as a binary file C++ makes absolutely no translations on any of the content read. I can only assume you are not correctly opening the file -- do you check this?

>> I also tried this in a Unicode app.  You are right there is no difference.

Ta daah :)

>> My file reads like garbage in my apps.  I am very puzzled.

Ok, please attach your file and your code (full, so I can compile it and test it).
0
 

Author Comment

by:Forehand
ID: 34138047
The hex dump of my file was inaccurate!  I was relying on a text processor, UltraEdit, to display files in hex.  The file that I was reading, chinacd.vpd, DOES NOT HAVE A BOM.  It displays properly if I open it binary mode in Visual Studio V6.0.  I guess UltraEdit somehow detected that this file needed translation into Chinese, translated it, and then displayed the results of the translation in hex instead of the actual bytes of the file. chinacd.txt  I wish I could do that translation.  
I still have a problem, but it is not the problem I thought it was.  I thought that my source code was reading the file improperly.  Instead, the source code was reading correctly.  It was the hex dump from UltraEdit that was incorrect.
I attach the file and source code as you requested, but the source code works.  It reads my file as a Unicode file and reverses each 2 character byte, which results in garbage.  This file is not a Unicode file so I shouldn't read it as Unicode, I guess.  But now I'm puzzled.  How can this file be distinguished from a multibyte character set file.  I need to be able to read both files and display them properly.  C# .NET can read the first one ok.  C++  multibyte app using CFile and CArchive can read the second one ok.   How can I know how to tell the difference between these files when they both start the same way? chinvp.txt
ChinaCD.txt can be translated correctly into Chinese characters by Excel, Outlook, UltraEdit, Notepad.
ChinVP.txt also seems to display "properly" by Notepad, etc.  It looks funny with an English (United States) locale.  But it displays Chinese characters fine with Hong Kong S.A.R locale.  Do you think I can read both files in the same application?  Will your classes that you recommended earlier do this job?  
// UnicodeToMultibyte.cpp : Defines the entry point for the console application.

//



#include "stdafx.h"

#include <iostream>

#include <fstream>

#include <iomanip>

      

  wchar_t const outtext[] = L"*BEGINDATA*";

  char const BOM[] = { 0xFF, 0xFE };

  char const MALE[] = { 0x37, 0x75 };

  size_t const insize = sizeof(outtext) + sizeof(BOM) + sizeof(MALE);



int _tmain(int argc, _TCHAR* argv[])

{

   std::ofstream out("c:/temp/myfile2.txt", std::ios::binary);

   out.write(BOM, sizeof(BOM));   

   out.write((char *)outtext, sizeof(outtext));

   out.write(MALE, sizeof(MALE));

   out.close();



   wchar_t intext[300];  //big enough to hold all of chinacd.vpd



   std::ifstream in("c:/temp/myfile2.txt", std::ios::binary);

   in.read((char *)intext, insize);

   in.close();

   char * ptext = (char *)intext;

  /* std::cout.write((char *)intext, insize); // wide stream

   std::cout << "\n";

   char * ptext = (char *)intext;

   for(size_t i = 0 ; i < insize ; ++i)

   {

      std::cout << std::hex << (0xFF & (int)ptext[i]);

   }

*/

   std::ifstream chinaIn("c:/imswin/data/chinacd.vpd", std::ios::binary);

   chinaIn.read((char *)intext, 296); //295 is sizeof chinacd.vpd   

   chinaIn.close();

   

   return 0;

}

Open in new window

0
 
LVL 40

Assisted Solution

by:evilrix
evilrix earned 500 total points
ID: 34138134
>> How can this file be distinguished from a multibyte character set file.

I refer you back to this: http:#34125019

You have to figure it out by analysing the content. It is for this reason I strongly suggest you consider ICU. Trying to write a Unicode decoder is not a trivial task -- this is why ICU is used by so many big named companies.

http://site.icu-project.org/#TOC-Who-Uses-ICU-

Consider the various possible encodings you need to try and detect. Now consider that a BOM is completely optional... it might not even exist (as you've discovered). The only way to know how it's encoded is to parse it and figure it out.

I appreciate what you said about needing to get this cleared but the effort in trying to code a proper Unicode parser is going to be significant if you need to handle any possible combination. It's not so painful if you can assume it'll always be a specific format but from what you've said I don't think that is the case for you.
0
 

Accepted Solution

by:
Forehand earned 0 total points
ID: 34150039
I am able to read Unicode files in C# .NET using StreamReader.  If I specify the Encoding, I can read files created with the Chinese code page 936 (Chinese Simplified (GB2312).
         Encoding mbcs = Encoding.GetEncoding(936); //encoding for China big5 950
         string mbName = @"c:\imswin\data\china_mb.txt"; //multibyte character set 0XC4 D0
        StreamReader srMbcs = new StreamReader(mbName, mbcs);
         string str = srMbcs.ReadLine();
         srMbcs.Close();
If I don't specify the encoding, the files are read correctly when they were created with a Unicode format.  In this case, the files had an ASCII? Unicode format so that \U7537 was written as 0Xe7 94 b7.
        string ucName = @"c:\imswin\data\china_uc.txt"; //unicode  
        StreamReader sr = new StreamReader(ucName);
         string unicodeString = sr.ReadLine(); //\U7537
         sr.Close();        
Currently it is almost impossible to detect encoding of the file without a BOM.  Whenever I read the multibyte file, it always is read with UTF-8 encoding, even though this is the wrong encoding for this file.  IsTextUnicode always returns true.  Here are some other references that indicate how difficult it is.

Rick Strahl's web log http://www.codeproject.com/KB/recipes/DetectEncoding.aspx
StreamReader() specifically has an overload that's supposed to help with detection of byte order marks and based on that is supposed to sniff the document's encoding. It actually works but only if
 the content is encoded as UTF-8/16/32 - ie. when it actually has a byte order mark.
 It doesn't revert back to Encoding.Default if it can't find a byte order mark - the default
 without a byte order mark is UTF-8 which usually will result in invalid text parsing.
 
Complex way to detect how file is encoded
 http://www.codeproject.com/KB/recipes/DetectEncoding.aspx

My solution is to translate the Unicode strings to Multibyte character set strings before creating the file that will be read by the legacy multibyte app.  That way, the file will always be in the correct format for reading multibyte character set.
       Encoding unicode = new UnicodeEncoding(true, false);          // Convert the string into a byte[].
        byte[] unicodeBytes = unicode.GetBytes(unicodeString);
         // Perform the conversion from one encoding to the other.
        Encoding mbcs = Encoding.GetEncoding(936); //encoding for China big5 950
         byte[] mbcsBytes = Encoding.Convert(unicode, mbcs, unicodeBytes);
0
 
LVL 40

Expert Comment

by:evilrix
ID: 34150188
Just to point out in case you had not realised, this started out as a C++ question (although I realise the files are created in C#) and I know almost nothing about C# so I'm afraid I cannot comment or provide you with any sensible suggestions in that area.

Also, I have no objection if you want to keep this question open whilst you still try and figure out what you are doing in the C++ side of things but as from now I will be offline for the next 5 days (it's my birthday and I'm going to party it up for a few days on a mini-holiday). I will post an alert to some other C++ experts to hopefully keep an eye on things here for you.

-Rx.
0
 
LVL 40

Expert Comment

by:evilrix
ID: 34150344
Forehand,

Thank you for your kind words in your closing comment. The appreciation means more than any points (although those are nice too) :)

not-so-evilrix.
0
 

Author Closing Comment

by:Forehand
ID: 34182485
Evilrix truly deserves his genius rating and in addition should get an Angel rating because he helped me so much.  (But maybe that would cancel out Evil?)  I spoke to a representative to give advice on how to grade.  I wanted to give fewer points and an A grade because there is no complete solution in the sense that there is no easy way to get encoding from a file.  Everything that Evilrx said was true, but didn't work for me because UltraEdit hex representation of my file didn't match the way it was actually stored.  I decided it was easier to control the way the files were written which I CAN control rather than to spend the effort modifying an older app to be able to read and detect everything.
0

Featured Post

What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

Join & Write a Comment

Calculating holidays and working days is a function that is often needed yet it is not one found within the Framework. This article presents one approach to building a working-day calculator for use in .NET.
Whether you've completed a degree in computer sciences or you're a self-taught programmer, writing your first lines of code in the real world is always a challenge. Here are some of the most common pitfalls for new programmers.
The goal of the tutorial is to teach the user how to use functions in C++. The video will cover how to define functions, how to call functions and how to create functions prototypes. Microsoft Visual C++ 2010 Express will be used as a text editor an…
The viewer will learn how to pass data into a function in C++. This is one step further in using functions. Instead of only printing text onto the console, the function will be able to perform calculations with argumentents given by the user.

760 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now