Go Premium for a chance to win a PS4. Enter to Win

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 698
  • Last Modified:

Read Unicode string from text file into CString

I have a text file which was created with a C# program like this:
             using (StreamWriter sw = new StreamWriter(defaultFile, false))
                {
                    sw.WriteLine(profile_name);
                }      

Where profile_name = 10.67µm steps to 277.42µm

This file created has a single line that reads as follows when I open it in Notepad:
10.67µm steps to 277.42µm

Note the µ character.

I have a C++ (Visual Studio 2010) project that needs to read this line into a CString, but I always get garbage using CStdioFile.

I even tried using that CStdioFileEx  (http://www.codeproject.com/Articles/4119/CStdioFile-derived-class-for-multibyte-and-Unicode) but still I get only junk.

What is the proper way to read this value into a C++ CString?

Thanks
0
PMH4514
Asked:
PMH4514
  • 12
  • 10
1 Solution
 
jkrCommented:
Are you sure you are writing teh file as UNICODE?  The 'µ' character is not specific for that, it can also be used in ASCII. Can you try to explicitly open the file as either UNICODE or ASCII using Notepad (you can choose the encoding in the 'File Open' dialog)?
0
 
PMH4514Author Commented:
Hi jkr -

I have assumed the  'µ' character was unicode, perhaps that is my first problem.
I tried your suggestion, choosing UNICODE in Notepad during File Open and as Unicode, it appears completely wrong (as I will attempt to paste below)

¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿

So given the text file clearly isn't actually Unicode, why then does my attempt to read the line fail?

When I try the following:

CString sLine = _T("");
CStdioFile file;
if (file.Open(szFilePath, CFile::modeRead))
{
      file.ReadString(strLine);
}

I end up with:

strLine = "10.67µm steps to 277.42µm"
0
 
jkrCommented:
Is your project set toUNICODE (which I assume, since it is the default)? Try setting it to ASCII switching "Project Properties|C/C++|General|Use Character Set" from UNICODE to "Multi-Byte".
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
PMH4514Author Commented:
My project is UNICODE. I couldn't set it to MBCS and recompile/run because other libraries in use by it (unrelated to this query) require UNICODE.
0
 
jkrCommented:
OK, so why not using STL's file I/O for that purpose, since it allows you to explicitly open and read ANSI/UNICODE files? E.g.

#include <fstream>
#include <string>

std::string sLine;

std::ifstream is("file.txt"); // hard-code that for testing purposes, we might have to do a conversion from UNICODE here

getline(is,sLine);

Open in new window

0
 
jkrCommented:
BTW, another - maybe easier - option would be to ensure that the C# project writes UNICODE ;o)
0
 
PMH4514Author Commented:
The ifstream version produces exactly the same result

sLine = "10.67µm steps to 277.42µm"

The C# side has other dependencies I'd rather not open that box.
0
 
PMH4514Author Commented:
I can "hack" a fix:

strLine.Replace(_T("Â"), _T(""));

and get the line I need, but it's a hack and I don't understand it, so I don't like it :)
0
 
jkrCommented:
Hmm, OK - what is 'Â' in hexadecimal? And, do you find the same value in the file in question qhen you open that with a hex editor?
0
 
PMH4514Author Commented:
If I type  into a new file, and view it in Hex mode,  it reads C2
(also verified C2 using the little converter about halfway down this page: http://www.thehistoryprofessor.us/bin/header/ascii.html)

When I view the text file in question in hex mode, I do not see C2 anywhere.
0
 
jkrCommented:
Well, that's right, but what does it evaluate to in your case when you see it in the debugger?
0
 
jkrCommented:
Of, even better: Could you attach the file to this thread?
0
 
PMH4514Author Commented:
I'm attaching the file.

In the debugger when I roll over the line, and right click and choose Hexadecimal display, nothing changes.
0
 
PMH4514Author Commented:
woops, didn't attach to previous...
1542.txt
0
 
PMH4514Author Commented:
I tried making a new text file with notepad, and typed exactly the same values in, copying and pasting 'µ' from character map, and then saving (just to take the C# project out of the equation.)

the C++ attempt to read it still produces the same result.
0
 
jkrCommented:
OK, *now* it is getting weird. Just using the following code:

#include <fstream>
#include <string>
#include <iostream>
using namespace std;

int main () {

  string sLine;

  ifstream is("1542.txt"); // hard-code that for testing purposes, we might have to do a conversion from UNICODE here

  getline(is,sLine);

  cout << sLine << endl;
}

Open in new window


With VC++, I get the same result that you get:

10.67Ám steps to 277.42Ám

Open in new window


Using g++, that is

10.67µm steps to 277.42µm

Open in new window


The VC++ debugger correctly shows that as

           [5]      0xb5 'µ'      char

And when I try the same using

#include <windows.h>

#include <fstream>
#include <string>
#include <iostream>
using namespace std;

int main () {

  string sLine;

  ifstream is("1542.txt"); // hard-code that for testing purposes, we might have to do a conversion from UNICODE here

  getline(is,sLine);

  //cout << sLine << endl;
  MessageBox(NULL,sLine.c_str(),"Test",MB_OK);
}
                                            

Open in new window


I get the attached message box with the result you can see - the correct one. Seems that all we have here is a console codepage issue, probably not even worth to bother ;o)
1542.png
0
 
PMH4514Author Commented:
Weird indeed!
Unfortunately, my applied problem seems to go beyond a console/debugger codepage issue because I need to use the string to format a path to an actual datafile to load.

For example, I may have several "profile" files (which are just CSV text given an extension of .profile rather than .txt)

c:\profiles\1µm steps to 10µm.profile
c:\profiles\2µm steps to 11µm.profile
c:\profiles\3µm steps to 12µm.profile
c:\profiles\4µm steps to 13µm.profile
c:\profiles\10.67µm steps to 277.42µm.profile  
etc..


The file we are reading (the one attached earlier) holds as its first and only line, the name of the default datafile to load. So I'm using the string I read in to form a fully qualified path after reading sLine:

CString sPath = _T("");
sPath.Format(_T("c:\\profiles\\%s.profile"), sLine);

The problem then is I end up with this path:
sPath = _T("c:\\profiles\\10.67µm steps to 277.42µm.profile")

Which is not a file that exists.

I check for existence like:

BOOL FileExists(CString path)
{
   CFileStatus status;
   return CFile::GetStatus( path, status );  
}

If I use my earlier described "hack" to strip out that 'Â' character, in order to check for and open the file at:

sPath = _T("c:\\profiles\\10.67µm steps to 277.42µm.profile")

everything works as expected (that is, yes they are weird filenames, but there is no inherent issue opening and reading from them)

Thanks for all your attention!
0
 
jkrCommented:
That's even more odd, since when checking with the debugger, there was no such character at all :-/
0
 
PMH4514Author Commented:
I guess sometimes hacks have their place :)
0
 
sarabandeCommented:
the Ám is a typical output for an utf-8 character that was shown by a program that could not handle UTF-8 (such as notepad or windows command interpreter).

the vs editor (and debugger) can handle utf-8 and would silently show the appropriate ansi character (if any). you could verify that i was right by opening the file in visual studio with the hex editor (use the drop-down box at the open button in the open file dialog).

You can recognize utf8 characters by their prefix code which is not printable in ascii.

the common utf-8 characters have 2 bytes and would begin with hex c2, c3, ...

the µ has utf-8 code sequence "CEBC" which you should find in the hex table if it is utf-8.

Sara
0
 
PMH4514Author Commented:
Thank you Sara.
Your comments imply, as JKR had suggested, that the issue is merely the display of the character within the IDE. But this doesn't seem to be the case, given that:

1. I can strip/replace the 'Á' character and be left with the IDE properly displaying 'µm'

2. When I format a string representing a path to a file, if it contains  (er, "displays as") 'Á'  the file is not found (implying the character is real, not just a display thing) whereas if I strip that character, the file can then be found.

this is an odd one for sure.
0
 
jkrCommented:
I'd still check with the debugger what the actual strings are. Also, are you enclosing the resulting path in quotes? Since they'll contain spaces, these are required.
0
 
PMH4514Author Commented:
sorry for the delay.
I wasn't enclosing the resulting path in quotes!
my mistake, plus debugger code page weirdness is all it was I guess.
0

Featured Post

Get your Disaster Recovery as a Service basics

Disaster Recovery as a Service is one go-to solution that revolutionizes DR planning. Implementing DRaaS could be an efficient process, easily accessible to non-DR experts. Learn about monitoring, testing, executing failovers and failbacks to ensure a "healthy" DR environment.

  • 12
  • 10
Tackle projects and never again get stuck behind a technical roadblock.
Join Now