Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 9042
  • Last Modified:

Convert old code to Unicode

Hello;

What are some of the things I can do to convert old code with char[] arrays to unicode?

RJ
0
RJSoft
Asked:
RJSoft
  • 11
  • 7
  • 2
  • +1
1 Solution
 
AxterCommented:
Are you targeting standard C/C++ code, or is this for Win32 type code.

For standard C/C++ code, look at the following functions:
mbstowcs
wcstombs

These functions can be used to convert UNICODE to ANSI string and ANSI string to UNICODE.
0
 
AxterCommented:
For Win32 applications, you also have the option of using the following functions:
WideCharToMultiByte
MultiByteToWideChar

These are WIN32 API functions that can be used to convert between UNICODE and ANSI string.
0
 
AxterCommented:
The following is example code using wcstombs C/C++ function to convert the string.


const char * FileWithWideChar = "widecharfile.txt";
const char * FileWithAnsiChar = "ansicharfile.txt";

void SetupTestEnvironment()
{
    wstring Data1 = L"Hello1, my name is Axter";
    wstring Data2 = L"Hello2, my name is Axter";
    wstring Data3 = L"Hello3, my name is Axter";
    wofstream wfile(FileWithWideChar);
    wfile << Data1 << endl;
    wfile << Data2 << endl;
    wfile << Data3 << endl;
}

int main(int argc, char* argv[])
{
    SetupTestEnvironment();

    wifstream wide_file(FileWithWideChar);
    wstring TmpLineData;
    string CmpFileData_InAnsi, AnsiTmpLine;
    while(getline(wide_file, TmpLineData))
    {
         AnsiTmpLine.resize(TmpLineData.size(), 0);
         wcstombs(AnsiTmpLine.begin(), TmpLineData.begin(), TmpLineData.size());
         CmpFileData_InAnsi += AnsiTmpLine + "\n";
    }

    ofstream ansi_file(FileWithAnsiChar);
    ansi_file.write(CmpFileData_InAnsi.begin(), CmpFileData_InAnsi.size());
    ansi_file << endl;
   
     return 0;
}
 
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
AxterCommented:
A windows project can use the MultiByteToWideChar API function to convert an ANSI string to a UNICODE string.
Example:

void Function(void)
{
   char dataBuff[] = "abcdefghijklmnopq";

   DWORD Pos = 10;

   CString tmpStr = "";
   wchar_t* pwsz = tmpStr.GetBufferSetLength ((Pos+1)*sizeof(wchar_t));
   MultiByteToWideChar(CP_ACP, 0, dataBuff, strlen(dataBuff), pwsz, (Pos+1)*sizeof(wchar_t));
   tmpStr.ReleaseBuffer();
}


0
 
AxterCommented:
For WIN32 code, WideCharToMultiByte API function can be used to convert UNICODE to ANSI string.
Example:


      const wchar_t WideChrData[] = L"Hello World";
      char AnsiBuffer[255];
      WideCharToMultiByte(CP_ACP, 0, WideChrData, wcslen(WideChrData)+1, AnsiBuffer , sizeof(AnsiBuffer), NULL, NULL);
      CString Msg = AnsiBuffer;
      AfxMessageBox(Msg);
0
 
AxterCommented:
As you see in above example, to declare a string literal that is of a wide character type, you need to prefix it with an L

const wchar_t WideChrData[] = L"Hello World";
wstring Data1 = L"Hello1, my name is Axter";

The L tells the compiler that the string literal is a wide character string.

In a UNICODE project, you do not need to make *ALL* strings as UNICODE.
There are some strings that will have to be ANSI string.
For example, to open a file that contains a wide character string, you need to use a file name that is of ANSI type.

const char * FileWithWideChar = "widecharfile.txt";  //This is an ANSI string
wofstream wfile(FileWithWideChar);
wfile << L"Hello World" << endl;

So although you may have to modify your original code by prefixing the string literals with L, you don't necessarily want to do this to all your string literals.

If this is a Windows application, in the future you may want to consider using _T() macro and using _tcs??? functions like _tcsstr, _tcscpy, _tcslen, etc..
This allows for minimal modifications when converting ANSI string code to UNICODE projects.
0
 
rstaveleyCommented:
Axter, SetupTestEnvironment is a bit misleading, because you don't actually generate a wide character file.
--------8<--------
#include <iostream>
#include <fstream>
#include <string>
#include <sys/types.h>
#include <sys/stat.h>
using namespace std;

int main()
{
      const string filename = "three.dat";
      wstring wstr = L"123";
      wofstream fout(filename.c_str());
      fout << wstr;
      fout.close();
      struct stat s;
      if (stat(filename.c_str(),&s) == 0)
            wcout << L"File for \"" << wstr << L"\" is " << s.st_size << L" bytes long\n";
}
--------8<--------

The external representation is char rather than wchar_t. Only internally do you get your wchar_t. Externally thay are not wide character files :-(
0
 
msjammuCommented:
By looking at your profile I assume You are windows Programmer!.  and talking about VC++ environment .

Try using TEXT() Macro and TCHAR datatype, Also there are different function for UNICODE like:

size_t __cdecl wcslen(const wchar_t *);

to calculate length of a Uniccode string and many more

Otherwise it depends upon your programs  requirement  what you need.

Regards,
msjammu
0
 
RJSoftAuthor Commented:
My apology I need to be more specific.

Note: in the post below when I say wchar. I am not sure if it should be TCHAR wchar or wchar_t.
I am using VC++6.0 and have seen of TCHAR and L macro. But wondering which to use.

What I have is many classes which have many functions that use char array, instead of CString or wchar arrays.
(I know, I know, stupid )

Since I did not really understand back then how to use CString I settled for string.h functions.
Mainly strcpy and strcat.

I think that I would be happy with replacing all the many many instances of char with wchar or TCHAR or whatever.
As I assume arrays of one these data type to hold the values of Unicode.

I want this so my applications can be used internationally.

The main issue is the many dialog classes. Since I did not use CString and I liked controlling the values to and from dialog child objects I used GetWindowText or GetText to obtain the value and SetWindowText to set static, edit and listbox ect.. Although this may work to my advantage.

example:
MyEditBox.GetWindowText(CharString,MYMAXVALUE);
MyListBox.GetText(Index, CharString);

or

strcpy(ss,"fun fun fun!");
Dlg.MyStaticText.SetWindowText(ss);
Dlg.DoModal();

etc...

MYMAXVALUE is a largish value that I feel confident wont be a problem.

After the dialog instance I would take that value CharString and copy to some other variable
strcpy(Other,Dlg.CharString); allong with strcat etc...

So, I am looking for a way to simplify the change. I was wondering if I could use typedef somehow?
My other thoughts is to have my own string.h  class, Mystring.h class in which I woud use wchar and the L macro?

Some of my classes output text to a large area static using DrawText. I am wondering if Unicode will effect that. Also will newlines '\n' be interpreted correctly. (Probably will).

As far as file routines, I use fopen, fwrite, fread. in binary mode only. So I see no problem there.
My applications dont depend upon reading any pre-existing file. The file routines are not so many so I could be happy with doing find/replace on them. Most of them write and read a structure I define, so in reality I would only be needing to change each of the char array structure members to wchar array.

Also I am not sure exactly what Windows api's are effected by modifying all the refferences to char array to wchar array or TCHAR array. So I am hoping none.

In order to export my applications to other countries I need to use the Unicode data set. For both input from user and output text displayed to user (My application specific text)

Apreciate any opinions/ gotchas etc..
RJ
0
 
AxterCommented:
>>So, I am looking for a way to simplify the change. I was wondering if I could use typedef somehow?

That best way to simplify the change, is to convert all your char* strings to either CString or TCHAR.
0
 
AxterCommented:
In an MFC type project, you should try to use CString as much as possible.  In general, it's very efficient, and it has a lot of functionallity.

You can even use CString with the example code you posted.

example:
CString MyData;
MyEditBox.GetWindowText(MyData.GetBuffer(MYMAXVALUE) ,MYMAXVALUE);
MyData.ReleaseBuffer();

MyListBox.GetText(Index, MyData);//Can be used with CString directly

or

MyData = _T("fun fun fun!"); //Notice the use of _T() macro
Dlg.MyStaticText.SetWindowText(MyData );
Dlg.DoModal();


By using above method with _T() macro, this makes your code easy to compile in both ANSI and UNICODE projects, with little-to no modifications.
0
 
RJSoftAuthor Commented:
Oops. I should have concentrated on your post better Axter.

>>So although you may have to modify your original code by prefixing the string literals with L, you don't necessarily want to do this to all your string literals.

I dont really see a problem with this unless something like fopen could not take Unicode. Is this correct?

>>If this is a Windows application, in the future you may want to consider using _T() macro and using _tcs??? functions like _tcsstr, _tcscpy, _tcslen, etc.. This allows for minimal modifications when converting ANSI string code to UNICODE projects.

Ok. So I should do simple find/replace here. I guess I could not get away with a typedef here....? I dont know if typedef can modify something like _tcscpy to strcpy? or #define???

I have Jeff Prosise book so I am familiar with the _T() but book is not in front of me now.

But mostly I believe your on track...
Any comments are apreciated.
RJ
0
 
AxterCommented:
I strongly recommend that you convert your code to CString and get a lot more familiar with this class.
I garantee it will be worth your time, and you'll be surprise of the functionallity this class has.

I'm not a big MFC fan, and IMHO most of the MFC classes that imulate the classes in the STL, are really poor.
CArray, CMap, .....
There implementation leaves a lot to be desire.

However, IMHO, the CString class is one of the best class in MFC, and IMHO it's even better then the std::string class.

(IMHO) CString is one class Microsoft got right!
0
 
AxterCommented:
>>I dont really see a problem with this unless something like fopen could not take Unicode. Is this correct?
For the most part, in an MFC application, that should be the only type of code you have to worry about.


>> I dont know if typedef can modify something like _tcscpy to strcpy? or #define???

No, it can't, and I wouldn't recommend it if it could.  You can just do a search trough all your project files using VC++ FindInFile tool.

I *think* you can also find them all if you temporarily comment out all the str??? functions in the string.h header.
0
 
RJSoftAuthor Commented:
Cool.
Yes. I got familiar with CString allong the way and agree with you. But since I had already nailed my coffin in the direction I took. I found it hard to change gears. Even had absolute need to format a string using CString format to express float values.  

Next project for sure.

I guess I am lazy. But I am mostly sure it will take me all day to replace string literals with a leading L.

It is definitley goint to take some time replacing the char with wchar. ( I assume I should use wchar rather than TCHAR (not sure)?).

The char change is gonna be a pain...

But I guess that is the proper way to do it. So I got allot of replacing to do...
 
Thanks
RJ
0
 
RJSoftAuthor Commented:
Also another issue that I am reminded of.

When I first started using VC I was not sure why but I swear I could not put CString variables in a struct to be read into and wrote from using fopen/fread/fwrite.

Ex.

struct SS
{
int a;
int b;
CString as;
CString bs;
int flag;
}Inst;

Then when I loaded values into Inst from some user input and stored it as a file. I believe I remember that I could not read the values back correctly.

Inst.a=1;
Inst.b=2;
Inst.as="test";
Inst.bs="this";
flag=0;

fopen...
fwrite(&Inst,sizeof(Inst),1,fptr);
fclose...
fopen...
fread(&Inst,sizeof(Inst),1,fptr);
fclose...

The data would be corrupt. I was not sure why but it seemed that the CString was somehow messed up.
the size. But I could be mis-quoting myself because this was a while back when I made the decision to avoid CString.

It could have also been that I was trying to fwrite and fread a single CString with bad results dont remember.

This is also important because items are saved and used in structures.

RJ
0
 
AxterCommented:
>> ( I assume I should use wchar rather than TCHAR (not sure)?).

I recommend that you use TCHAR instead of wchar if this is an MFC project.

If you use TCHAR, then you can easily convert your program from ANSI to UNICODE, and then back to ANSI if need be.

If you use wchar, changing your code back to ANSI will require the same amount of work to convert it to UNICODE.

>>The data would be corrupt. I was not sure why but it seemed that the CString was somehow messed up.
That's correct.  You can not read directly from a file to an object that contains NON-POD types.
A CString is a non-POD type.

>>This is also important because items are saved and used in structures.
For this type of requirement, it would not be good to use CString unless you created an operator>>() and operator<<() for your struct

You could use TCHAR instead of char for your struct.
0
 
RJSoftAuthor Commented:
Thanks.

RJ
0
 
msjammuCommented:
There are, of course, certain disadvantages to using Unicode. First and foremost is that every string in your program will occupy twice as much space. In addition, you’ll observe that the functions in wide-character run-time library are larger than the usual functions. For this reason, you might want to create two versions of your program-one with ASCII strings and the other with UNICODE strings. The best solution would be to maintain a single source code file that you could compile for either ASCII or UNICODE.

That’s a bit of  a problem, though, because the run-time library functions have different names, you’re defining characters differently, and then there’s that nuisance of preceding the string literals with an L.

One answer is to use the TCHAR.H header file included with MS VC++. This header file is not part of the ANSI C standard, so every function and macro definition defined therein is preceded by an underscore. TCHAR.H provides a set of alternative names for the normal run-time library functions requiring string parameters e g _tprintf, _tcslen. Thses are something referred to as generic function names because they can refer to either the UNICODE or non-UNICODE versions of the functions.

If an Identifier named _UINCODE is defines and the TCHAR.h header file is included in your program, -tcslen is defined to be wcslen.

If _UNICODE isn’t defined, _tcslen is defined to be strlen

And so on. TCHAR.h also solves the problem of the two character data types with a new data type named TCHAR. If the UNICODE identifier is defined TCHAR is wchar_t otherwise, TCHAR is simply a char.

Therefore, the choice is yours.

Regards,
msjammu
 
0
 
RJSoftAuthor Commented:
Good one. msjammu.
Thanks
RJ
0
 
RJSoftAuthor Commented:
My notes;

Pod = plain old data.

Basically the reason it can be written to file is because the data is contiguos. Where as a Non pod class created item (like CString) has things like private constructors and virtual members so it is not contiguos data.

To write a structure to file requires that all elements be contiguos. Be pod. There are conditions that a class object can meet to be considered a pod.

reff.
http://www.tempest-sw.com/cpp/draft/ch06-classes.html
0

Featured Post

Vote for the Most Valuable Expert

It’s time to recognize experts that go above and beyond with helpful solutions and engagement on site. Choose from the top experts in the Hall of Fame or on the right rail of your favorite topic page. Look for the blue “Nominate” button on their profile to vote.

  • 11
  • 7
  • 2
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now