Link to home
Start Free TrialLog in
Avatar of PMH4514
PMH4514

asked on

Read Unicode string from text file into CString

I have a text file which was created with a C# program like this:
             using (StreamWriter sw = new StreamWriter(defaultFile, false))
                {
                    sw.WriteLine(profile_name);
                }      

Where profile_name = 10.67µm steps to 277.42µm

This file created has a single line that reads as follows when I open it in Notepad:
10.67µm steps to 277.42µm

Note the µ character.

I have a C++ (Visual Studio 2010) project that needs to read this line into a CString, but I always get garbage using CStdioFile.

I even tried using that CStdioFileEx  (http://www.codeproject.com/Articles/4119/CStdioFile-derived-class-for-multibyte-and-Unicode) but still I get only junk.

What is the proper way to read this value into a C++ CString?

Thanks
Avatar of jkr
jkr
Flag of Germany image

Are you sure you are writing teh file as UNICODE?  The 'µ' character is not specific for that, it can also be used in ASCII. Can you try to explicitly open the file as either UNICODE or ASCII using Notepad (you can choose the encoding in the 'File Open' dialog)?
Avatar of PMH4514
PMH4514

ASKER

Hi jkr -

I have assumed the  'µ' character was unicode, perhaps that is my first problem.
I tried your suggestion, choosing UNICODE in Notepad during File Open and as Unicode, it appears completely wrong (as I will attempt to paste below)

¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿

So given the text file clearly isn't actually Unicode, why then does my attempt to read the line fail?

When I try the following:

CString sLine = _T("");
CStdioFile file;
if (file.Open(szFilePath, CFile::modeRead))
{
      file.ReadString(strLine);
}

I end up with:

strLine = "10.67µm steps to 277.42µm"
Is your project set toUNICODE (which I assume, since it is the default)? Try setting it to ASCII switching "Project Properties|C/C++|General|Use Character Set" from UNICODE to "Multi-Byte".
Avatar of PMH4514

ASKER

My project is UNICODE. I couldn't set it to MBCS and recompile/run because other libraries in use by it (unrelated to this query) require UNICODE.
OK, so why not using STL's file I/O for that purpose, since it allows you to explicitly open and read ANSI/UNICODE files? E.g.

#include <fstream>
#include <string>

std::string sLine;

std::ifstream is("file.txt"); // hard-code that for testing purposes, we might have to do a conversion from UNICODE here

getline(is,sLine);

Open in new window

BTW, another - maybe easier - option would be to ensure that the C# project writes UNICODE ;o)
Avatar of PMH4514

ASKER

The ifstream version produces exactly the same result

sLine = "10.67µm steps to 277.42µm"

The C# side has other dependencies I'd rather not open that box.
Avatar of PMH4514

ASKER

I can "hack" a fix:

strLine.Replace(_T("Â"), _T(""));

and get the line I need, but it's a hack and I don't understand it, so I don't like it :)
Hmm, OK - what is 'Â' in hexadecimal? And, do you find the same value in the file in question qhen you open that with a hex editor?
Avatar of PMH4514

ASKER

If I type  into a new file, and view it in Hex mode,  it reads C2
(also verified C2 using the little converter about halfway down this page: http://www.thehistoryprofessor.us/bin/header/ascii.html)

When I view the text file in question in hex mode, I do not see C2 anywhere.
Well, that's right, but what does it evaluate to in your case when you see it in the debugger?
Of, even better: Could you attach the file to this thread?
Avatar of PMH4514

ASKER

I'm attaching the file.

In the debugger when I roll over the line, and right click and choose Hexadecimal display, nothing changes.
Avatar of PMH4514

ASKER

woops, didn't attach to previous...
1542.txt
Avatar of PMH4514

ASKER

I tried making a new text file with notepad, and typed exactly the same values in, copying and pasting 'µ' from character map, and then saving (just to take the C# project out of the equation.)

the C++ attempt to read it still produces the same result.
OK, *now* it is getting weird. Just using the following code:

#include <fstream>
#include <string>
#include <iostream>
using namespace std;

int main () {

  string sLine;

  ifstream is("1542.txt"); // hard-code that for testing purposes, we might have to do a conversion from UNICODE here

  getline(is,sLine);

  cout << sLine << endl;
}

Open in new window


With VC++, I get the same result that you get:

10.67Ám steps to 277.42Ám

Open in new window


Using g++, that is

10.67µm steps to 277.42µm

Open in new window


The VC++ debugger correctly shows that as

           [5]      0xb5 'µ'      char

And when I try the same using

#include <windows.h>

#include <fstream>
#include <string>
#include <iostream>
using namespace std;

int main () {

  string sLine;

  ifstream is("1542.txt"); // hard-code that for testing purposes, we might have to do a conversion from UNICODE here

  getline(is,sLine);

  //cout << sLine << endl;
  MessageBox(NULL,sLine.c_str(),"Test",MB_OK);
}
                                            

Open in new window


I get the attached message box with the result you can see - the correct one. Seems that all we have here is a console codepage issue, probably not even worth to bother ;o)
1542.png
Avatar of PMH4514

ASKER

Weird indeed!
Unfortunately, my applied problem seems to go beyond a console/debugger codepage issue because I need to use the string to format a path to an actual datafile to load.

For example, I may have several "profile" files (which are just CSV text given an extension of .profile rather than .txt)

c:\profiles\1µm steps to 10µm.profile
c:\profiles\2µm steps to 11µm.profile
c:\profiles\3µm steps to 12µm.profile
c:\profiles\4µm steps to 13µm.profile
c:\profiles\10.67µm steps to 277.42µm.profile  
etc..


The file we are reading (the one attached earlier) holds as its first and only line, the name of the default datafile to load. So I'm using the string I read in to form a fully qualified path after reading sLine:

CString sPath = _T("");
sPath.Format(_T("c:\\profiles\\%s.profile"), sLine);

The problem then is I end up with this path:
sPath = _T("c:\\profiles\\10.67µm steps to 277.42µm.profile")

Which is not a file that exists.

I check for existence like:

BOOL FileExists(CString path)
{
   CFileStatus status;
   return CFile::GetStatus( path, status );  
}

If I use my earlier described "hack" to strip out that 'Â' character, in order to check for and open the file at:

sPath = _T("c:\\profiles\\10.67µm steps to 277.42µm.profile")

everything works as expected (that is, yes they are weird filenames, but there is no inherent issue opening and reading from them)

Thanks for all your attention!
That's even more odd, since when checking with the debugger, there was no such character at all :-/
Avatar of PMH4514

ASKER

I guess sometimes hacks have their place :)
the Ám is a typical output for an utf-8 character that was shown by a program that could not handle UTF-8 (such as notepad or windows command interpreter).

the vs editor (and debugger) can handle utf-8 and would silently show the appropriate ansi character (if any). you could verify that i was right by opening the file in visual studio with the hex editor (use the drop-down box at the open button in the open file dialog).

You can recognize utf8 characters by their prefix code which is not printable in ascii.

the common utf-8 characters have 2 bytes and would begin with hex c2, c3, ...

the µ has utf-8 code sequence "CEBC" which you should find in the hex table if it is utf-8.

Sara
Avatar of PMH4514

ASKER

Thank you Sara.
Your comments imply, as JKR had suggested, that the issue is merely the display of the character within the IDE. But this doesn't seem to be the case, given that:

1. I can strip/replace the 'Á' character and be left with the IDE properly displaying 'µm'

2. When I format a string representing a path to a file, if it contains  (er, "displays as") 'Á'  the file is not found (implying the character is real, not just a display thing) whereas if I strip that character, the file can then be found.

this is an odd one for sure.
ASKER CERTIFIED SOLUTION
Avatar of jkr
jkr
Flag of Germany image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of PMH4514

ASKER

sorry for the delay.
I wasn't enclosing the resulting path in quotes!
my mistake, plus debugger code page weirdness is all it was I guess.