how to open a file with unicode characters (Slovenian) using C++ and std

Hi Experts,

Our application needs to load a file into a std::ifstream object. The filename contains Slovenian characters - i.e. Latin Small Letter Z with Caron. (U+017E). Assume that I have the filename in a std::wstring object, how can I open a std::ifstream object and associate it wiith that filename?

For example:

std::wstring filename;
std::ifstream sin(filename.c_str(), std::ios::in | std::ios::in | std:::ios::binary | ios_base::ate);

Doing this gives me a compile error that it cannot convert parameter 1 from const wchar_t * to const char *

What is the best way of doing this?
rrehmatAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

jkrCommented:
The STL file streams take their names as ANSI (Which I also think should be changed), so a hearty

#include <ctype>

std::wstring filename;
std::ifstream sin(narrow(filename.begin(),filename.end()).c_str(), std::ios::in | std::ios::in | std:::ios::binary | ios_base::ate);

will help you here and convert the string on the fly.
0
jkrCommented:
Oh, yet you always can use
string ToAnsi(const string& r) {
 
  return narrow(r.begin(),r.end()).c_str();
}
 
to convert the string.

Open in new window

0
IchijoCommented:
>> #include <ctype>

I get:

> fatal error C1083: Cannot open include file: 'ctype': No such file or directory

Did you mean either ctype.h or cctype?

>> string ToAnsi(const string& r) {
>> 
>>   return narrow(r.begin(),r.end()).c_str();
>> }

Is that ios::narrow() or ctype::narrow()?

I get:

> error C2228: left of '.c_str' must have class/struct/union
>         type is ''unknown-type''
> error C3861: 'narrow': identifier not found

(This is VS2005.)

Jkr, my STL reference (Josuttis) doesn't even mention a form of narrow() that takes iterators. Can you recommend a book that does?
0
Angular Fundamentals

Learn the fundamentals of Angular 2, a JavaScript framework for developing dynamic single page applications.

IchijoCommented:
>> Jkr, my STL reference (Josuttis) doesn't even mention a form of narrow() that takes iterators. Can you recommend a book that does?

I just realized I wasn't even looking in the right book. Josuttis does in fact mention a narrow() with iterators, but it seems to involve some kind of use_facet wizardry.
0
jkrCommented:
>>Did you mean either ctype.h or cctype?

Sorry, neither nor. I mixed up a helper with the standard, sorry. The whole thing would look like


#include <iostream>
#include <string>
using namespace std;
 
template <class InputIterator>
std::string narrow(InputIterator first,InputIterator last,const std::locale& loc = std::locale(),char def = '\0')
{
  typedef typename std::iterator_traits<InputIterator>::value_type wcharT;
  const std::ctype<wcharT>& cty = std::use_facet<std::ctype<wcharT> >(loc);
  std::string str;
  str.reserve(std::distance(first,last));
  for ( ; first != last; ++first)
    str.push_back(cty.narrow(*first,def));
  return str;
}
 
string ToAnsi(const wstring& r) {
 
  string s = narrow(r.begin(),r.end());
 
  return s;
}
 
void main () {
 
wstring wstr = L"Test";
 
wcout << wstr << endl;
 
string str = ToAnsi(wstr);
 
cout << str << endl;
 
}

Open in new window

0
jkrCommented:
Oh, and it seems that you already have the right books ;o)

BTW, for the sake of completeness, the 'old and simple' way would be
string ToAnsi(const string& r) {
 
  string s;
  size_t len = r.length() + 1
  char* p = new char[len];
 
  wcstombs(p,r.c_str(),len);
 
  s = p;
 
  return s;
}

Open in new window

0
rrehmatAuthor Commented:
A bit of searching also points me to the function WideCharToMultiByte() from which I can then obtain a LPSTR to use to open the file. Are there any drawbacks in using this approach if this application is specific to windows OS?
0
itsmeandnobodyelseCommented:
>>>> Are there any drawbacks in using this approach if this application is specific to windows OS?
No, there are various conversion functions from wide chars to ANSI (misleadingly called 'multi-byte') which only strip the higher byte of each two-byte UNICODE char (which is 0 for any ANSI character). The most popular (and portable) function maybe is wcstombs:

   std::wstring filename = L"test.txt";
   ....
   std::string ansi_filename(filename.length(), ' \0');
   wcstombs(&ansi_filename(0), filename.c_str(), filename.length());
   std::ifstream ifs(ansi_filename.c_str());
0
rrehmatAuthor Commented:
so specifically this is what I am running into with using wcstombs()

The user uses a File Browser to pick a file. Eventually after picking the file I have a std::wstring object with the text "C:\Documents and Settings\rahim\Desktop\`C}ac~2.jpg"  - note this filename has slovenian characters.

When I call wcstombs(), the ansi_filename contains the text "C:\Documents and Settings\rahim\Desktop\" and then of course the call to open the file stream fails as it got removed from the text after the conversion.

Is this expected? How can I call std::ifstream sin(ansi_filename.c_str()); so that it does not fail.

Thanks,
Rahim

0
IchijoCommented:
So, the file picker returned a Unicode string but the call to open the file stream is expecting an ANSI (MBCS) string? I would solve this by consistently using ANSI (MBCS), or Unicode, or TCHAR (which compiles to ANSI or Unicode depending on your "Character Set" project setting). Then there would be no need to convert strings at runtime. Is this an option?
// ANSI (MBCS)
std::string filename;
std::ifstream sin(filename.c_str(), std::ios::in | std::ios::in | std:::ios::binary | ios_base::ate);
 
// Unicode (wide characters)
std::wstring filename;
std::wifstream sin(filename.c_str(), std::ios::in | std::ios::in | std:::ios::binary | ios_base::ate);
 
// TCHAR (compiles for ANSI or Unicode depending on your "Character Set" project setting)
std::basic_string<TCHAR> filename;
std::basic_ifstream<TCHAR> sin(filename.c_str(), std::ios::in | std::ios::in | std:::ios::binary | ios_base::ate);

Open in new window

0
jkrCommented:
Unfortunately, a TCHAR won't help either, since you get a 'wstring' when compiling UNICODE - and 'wstring::c_str()' as well as 'std::basic_string<TCHAR>::c_str()' retrurn a 'wchar_t*' which you cannot use with streams, since they require ANSI filenames.
0
rrehmatAuthor Commented:
So we are using unicode and I have tried:

// Unicode (wide characters)
std::wstring filename;
std::wifstream sin(filename.c_str(), std::ios::in | std::ios::in | std:::ios::binary | ios_base::ate);

BUT get the compile error on the line where the wifstream is created:

cannot convert parameter 1 from const wchar_t * to const char *  
 
0
IchijoCommented:
Sorry, as jkr stated, my way won't work. I wonder why the STL committee designed wifstream that way?

Rrehmat, is your current locale set to Slovenian? If not, this is just a guess, but it could explain why wcstombs is having trouble converting a Unicode string containing Slovenian to ANSI. What is the return value of wcstombs?
0
jkrCommented:
Well, even a 'wifstream' takes a 'const char*' as it's argument for 'open()'. I don't know why they made it work that way, honestly. Using the template argument (or a pointer to that type) would seem more logical to me as well.
0
rrehmatAuthor Commented:
Hi Ichijo, return valu for wcstombs is just the path leading up to the filename but does not include the filename from the original wstring object. Only the filename has the Slovenian characters. As a result, feeding that value into the ifstream constructor fails to create the stream object.
The way I get Windows XP to name the file with Slovenian characters is that I added Slovenian language support and switch to that from the language bar and type it the special characters - "S with caron", "Z with caron" and "C with caron"

http://www.fileformat.info/info/unicode/char/017d/index.htm  (example of character)
0
IchijoCommented:
Rrehmat, when wcstombs succeeds the return value is the number of bytes written to the output string, not counting the terminating NULL. When the function fails, it returns -1. I'm guessing that it's returning -1.

In XP, there's a "Regional and Language Options" icon in the control panel, and on its Advanced tab is a setting which says: "Select a language to match the language version of the non-Unicode programs you want to use:". Changing it to Slovenian if (it isn't already) may have an effect.

Some other options include:
1. Use _wcstombs_l to specify which code page to use regardless of the control panel setting above,
2. Rename your files to use only those characters available in 7-bit ASCII,
3. Compile with VC++8 (Visual Studio 2005) which provides an ifstream/wifstream that accepts a wide character string for the filename (I just verified that it works there and not in VC++7.1 (Visual Studio 2003)), or
4. Don't use STL file streams.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
rrehmatAuthor Commented:
Hi Ichijo, thanks for the options.
Is there anything special besides including stdlib.h to get _wcstombs_l to work? Right now compiler complains that the _wcstombs_l identifier is not found?
0
IchijoCommented:
Hmm, apparently it's only available for Visual Studio 2005 and 2008.

I believe WideCharToMultiByte() is available for earlier versions. I guess you would use the 8859-2 code page (28592).
0
jkrCommented:
>>Hmm, apparently it's only available for Visual Studio 2005 and 2008.

No, already used that in VC6. Leave out the underscores for that and make it like the code I posted earlier:


string ToAnsi(const string& r) {
 
  string s;
  size_t len = r.length() + 1
  char* p = new char[len];
 
  wcstombs(p,r.c_str(),len);
 
  s = p;
 
  return s;
}

Open in new window

0
IchijoCommented:
The one with underscores lets the programmer choose the locale.
0
jkrCommented:
?
0
rrehmatAuthor Commented:
Essentially, since our applications in the end would transfer the file to a webserver, supporting unicode in filenames is even more tricky so we are opting to detecting this at the client side and performing some renaming before transferring up to the server. On the local PC, we decided not to use STL file stream to read in the file.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Editors IDEs

From novice to tech pro — start learning today.