Solved

Convert old code to Unicode

Posted on 2004-09-10
21
8,960 Views
Last Modified: 2013-12-14
Hello;

What are some of the things I can do to convert old code with char[] arrays to unicode?

RJ
0
Comment
Question by:RJSoft
  • 11
  • 7
  • 2
  • +1
21 Comments
 
LVL 30

Expert Comment

by:Axter
Comment Utility
Are you targeting standard C/C++ code, or is this for Win32 type code.

For standard C/C++ code, look at the following functions:
mbstowcs
wcstombs

These functions can be used to convert UNICODE to ANSI string and ANSI string to UNICODE.
0
 
LVL 30

Expert Comment

by:Axter
Comment Utility
For Win32 applications, you also have the option of using the following functions:
WideCharToMultiByte
MultiByteToWideChar

These are WIN32 API functions that can be used to convert between UNICODE and ANSI string.
0
 
LVL 30

Expert Comment

by:Axter
Comment Utility
The following is example code using wcstombs C/C++ function to convert the string.


const char * FileWithWideChar = "widecharfile.txt";
const char * FileWithAnsiChar = "ansicharfile.txt";

void SetupTestEnvironment()
{
    wstring Data1 = L"Hello1, my name is Axter";
    wstring Data2 = L"Hello2, my name is Axter";
    wstring Data3 = L"Hello3, my name is Axter";
    wofstream wfile(FileWithWideChar);
    wfile << Data1 << endl;
    wfile << Data2 << endl;
    wfile << Data3 << endl;
}

int main(int argc, char* argv[])
{
    SetupTestEnvironment();

    wifstream wide_file(FileWithWideChar);
    wstring TmpLineData;
    string CmpFileData_InAnsi, AnsiTmpLine;
    while(getline(wide_file, TmpLineData))
    {
         AnsiTmpLine.resize(TmpLineData.size(), 0);
         wcstombs(AnsiTmpLine.begin(), TmpLineData.begin(), TmpLineData.size());
         CmpFileData_InAnsi += AnsiTmpLine + "\n";
    }

    ofstream ansi_file(FileWithAnsiChar);
    ansi_file.write(CmpFileData_InAnsi.begin(), CmpFileData_InAnsi.size());
    ansi_file << endl;
   
     return 0;
}
 
0
 
LVL 30

Expert Comment

by:Axter
Comment Utility
A windows project can use the MultiByteToWideChar API function to convert an ANSI string to a UNICODE string.
Example:

void Function(void)
{
   char dataBuff[] = "abcdefghijklmnopq";

   DWORD Pos = 10;

   CString tmpStr = "";
   wchar_t* pwsz = tmpStr.GetBufferSetLength ((Pos+1)*sizeof(wchar_t));
   MultiByteToWideChar(CP_ACP, 0, dataBuff, strlen(dataBuff), pwsz, (Pos+1)*sizeof(wchar_t));
   tmpStr.ReleaseBuffer();
}


0
 
LVL 30

Expert Comment

by:Axter
Comment Utility
For WIN32 code, WideCharToMultiByte API function can be used to convert UNICODE to ANSI string.
Example:


      const wchar_t WideChrData[] = L"Hello World";
      char AnsiBuffer[255];
      WideCharToMultiByte(CP_ACP, 0, WideChrData, wcslen(WideChrData)+1, AnsiBuffer , sizeof(AnsiBuffer), NULL, NULL);
      CString Msg = AnsiBuffer;
      AfxMessageBox(Msg);
0
 
LVL 30

Accepted Solution

by:
Axter earned 500 total points
Comment Utility
As you see in above example, to declare a string literal that is of a wide character type, you need to prefix it with an L

const wchar_t WideChrData[] = L"Hello World";
wstring Data1 = L"Hello1, my name is Axter";

The L tells the compiler that the string literal is a wide character string.

In a UNICODE project, you do not need to make *ALL* strings as UNICODE.
There are some strings that will have to be ANSI string.
For example, to open a file that contains a wide character string, you need to use a file name that is of ANSI type.

const char * FileWithWideChar = "widecharfile.txt";  //This is an ANSI string
wofstream wfile(FileWithWideChar);
wfile << L"Hello World" << endl;

So although you may have to modify your original code by prefixing the string literals with L, you don't necessarily want to do this to all your string literals.

If this is a Windows application, in the future you may want to consider using _T() macro and using _tcs??? functions like _tcsstr, _tcscpy, _tcslen, etc..
This allows for minimal modifications when converting ANSI string code to UNICODE projects.
0
 
LVL 17

Expert Comment

by:rstaveley
Comment Utility
Axter, SetupTestEnvironment is a bit misleading, because you don't actually generate a wide character file.
--------8<--------
#include <iostream>
#include <fstream>
#include <string>
#include <sys/types.h>
#include <sys/stat.h>
using namespace std;

int main()
{
      const string filename = "three.dat";
      wstring wstr = L"123";
      wofstream fout(filename.c_str());
      fout << wstr;
      fout.close();
      struct stat s;
      if (stat(filename.c_str(),&s) == 0)
            wcout << L"File for \"" << wstr << L"\" is " << s.st_size << L" bytes long\n";
}
--------8<--------

The external representation is char rather than wchar_t. Only internally do you get your wchar_t. Externally thay are not wide character files :-(
0
 
LVL 6

Expert Comment

by:msjammu
Comment Utility
By looking at your profile I assume You are windows Programmer!.  and talking about VC++ environment .

Try using TEXT() Macro and TCHAR datatype, Also there are different function for UNICODE like:

size_t __cdecl wcslen(const wchar_t *);

to calculate length of a Uniccode string and many more

Otherwise it depends upon your programs  requirement  what you need.

Regards,
msjammu
0
 
LVL 3

Author Comment

by:RJSoft
Comment Utility
My apology I need to be more specific.

Note: in the post below when I say wchar. I am not sure if it should be TCHAR wchar or wchar_t.
I am using VC++6.0 and have seen of TCHAR and L macro. But wondering which to use.

What I have is many classes which have many functions that use char array, instead of CString or wchar arrays.
(I know, I know, stupid )

Since I did not really understand back then how to use CString I settled for string.h functions.
Mainly strcpy and strcat.

I think that I would be happy with replacing all the many many instances of char with wchar or TCHAR or whatever.
As I assume arrays of one these data type to hold the values of Unicode.

I want this so my applications can be used internationally.

The main issue is the many dialog classes. Since I did not use CString and I liked controlling the values to and from dialog child objects I used GetWindowText or GetText to obtain the value and SetWindowText to set static, edit and listbox ect.. Although this may work to my advantage.

example:
MyEditBox.GetWindowText(CharString,MYMAXVALUE);
MyListBox.GetText(Index, CharString);

or

strcpy(ss,"fun fun fun!");
Dlg.MyStaticText.SetWindowText(ss);
Dlg.DoModal();

etc...

MYMAXVALUE is a largish value that I feel confident wont be a problem.

After the dialog instance I would take that value CharString and copy to some other variable
strcpy(Other,Dlg.CharString); allong with strcat etc...

So, I am looking for a way to simplify the change. I was wondering if I could use typedef somehow?
My other thoughts is to have my own string.h  class, Mystring.h class in which I woud use wchar and the L macro?

Some of my classes output text to a large area static using DrawText. I am wondering if Unicode will effect that. Also will newlines '\n' be interpreted correctly. (Probably will).

As far as file routines, I use fopen, fwrite, fread. in binary mode only. So I see no problem there.
My applications dont depend upon reading any pre-existing file. The file routines are not so many so I could be happy with doing find/replace on them. Most of them write and read a structure I define, so in reality I would only be needing to change each of the char array structure members to wchar array.

Also I am not sure exactly what Windows api's are effected by modifying all the refferences to char array to wchar array or TCHAR array. So I am hoping none.

In order to export my applications to other countries I need to use the Unicode data set. For both input from user and output text displayed to user (My application specific text)

Apreciate any opinions/ gotchas etc..
RJ
0
 
LVL 30

Expert Comment

by:Axter
Comment Utility
>>So, I am looking for a way to simplify the change. I was wondering if I could use typedef somehow?

That best way to simplify the change, is to convert all your char* strings to either CString or TCHAR.
0
6 Surprising Benefits of Threat Intelligence

All sorts of threat intelligence is available on the web. Intelligence you can learn from, and use to anticipate and prepare for future attacks.

 
LVL 30

Expert Comment

by:Axter
Comment Utility
In an MFC type project, you should try to use CString as much as possible.  In general, it's very efficient, and it has a lot of functionallity.

You can even use CString with the example code you posted.

example:
CString MyData;
MyEditBox.GetWindowText(MyData.GetBuffer(MYMAXVALUE) ,MYMAXVALUE);
MyData.ReleaseBuffer();

MyListBox.GetText(Index, MyData);//Can be used with CString directly

or

MyData = _T("fun fun fun!"); //Notice the use of _T() macro
Dlg.MyStaticText.SetWindowText(MyData );
Dlg.DoModal();


By using above method with _T() macro, this makes your code easy to compile in both ANSI and UNICODE projects, with little-to no modifications.
0
 
LVL 3

Author Comment

by:RJSoft
Comment Utility
Oops. I should have concentrated on your post better Axter.

>>So although you may have to modify your original code by prefixing the string literals with L, you don't necessarily want to do this to all your string literals.

I dont really see a problem with this unless something like fopen could not take Unicode. Is this correct?

>>If this is a Windows application, in the future you may want to consider using _T() macro and using _tcs??? functions like _tcsstr, _tcscpy, _tcslen, etc.. This allows for minimal modifications when converting ANSI string code to UNICODE projects.

Ok. So I should do simple find/replace here. I guess I could not get away with a typedef here....? I dont know if typedef can modify something like _tcscpy to strcpy? or #define???

I have Jeff Prosise book so I am familiar with the _T() but book is not in front of me now.

But mostly I believe your on track...
Any comments are apreciated.
RJ
0
 
LVL 30

Expert Comment

by:Axter
Comment Utility
I strongly recommend that you convert your code to CString and get a lot more familiar with this class.
I garantee it will be worth your time, and you'll be surprise of the functionallity this class has.

I'm not a big MFC fan, and IMHO most of the MFC classes that imulate the classes in the STL, are really poor.
CArray, CMap, .....
There implementation leaves a lot to be desire.

However, IMHO, the CString class is one of the best class in MFC, and IMHO it's even better then the std::string class.

(IMHO) CString is one class Microsoft got right!
0
 
LVL 30

Expert Comment

by:Axter
Comment Utility
>>I dont really see a problem with this unless something like fopen could not take Unicode. Is this correct?
For the most part, in an MFC application, that should be the only type of code you have to worry about.


>> I dont know if typedef can modify something like _tcscpy to strcpy? or #define???

No, it can't, and I wouldn't recommend it if it could.  You can just do a search trough all your project files using VC++ FindInFile tool.

I *think* you can also find them all if you temporarily comment out all the str??? functions in the string.h header.
0
 
LVL 3

Author Comment

by:RJSoft
Comment Utility
Cool.
Yes. I got familiar with CString allong the way and agree with you. But since I had already nailed my coffin in the direction I took. I found it hard to change gears. Even had absolute need to format a string using CString format to express float values.  

Next project for sure.

I guess I am lazy. But I am mostly sure it will take me all day to replace string literals with a leading L.

It is definitley goint to take some time replacing the char with wchar. ( I assume I should use wchar rather than TCHAR (not sure)?).

The char change is gonna be a pain...

But I guess that is the proper way to do it. So I got allot of replacing to do...
 
Thanks
RJ
0
 
LVL 3

Author Comment

by:RJSoft
Comment Utility
Also another issue that I am reminded of.

When I first started using VC I was not sure why but I swear I could not put CString variables in a struct to be read into and wrote from using fopen/fread/fwrite.

Ex.

struct SS
{
int a;
int b;
CString as;
CString bs;
int flag;
}Inst;

Then when I loaded values into Inst from some user input and stored it as a file. I believe I remember that I could not read the values back correctly.

Inst.a=1;
Inst.b=2;
Inst.as="test";
Inst.bs="this";
flag=0;

fopen...
fwrite(&Inst,sizeof(Inst),1,fptr);
fclose...
fopen...
fread(&Inst,sizeof(Inst),1,fptr);
fclose...

The data would be corrupt. I was not sure why but it seemed that the CString was somehow messed up.
the size. But I could be mis-quoting myself because this was a while back when I made the decision to avoid CString.

It could have also been that I was trying to fwrite and fread a single CString with bad results dont remember.

This is also important because items are saved and used in structures.

RJ
0
 
LVL 30

Expert Comment

by:Axter
Comment Utility
>> ( I assume I should use wchar rather than TCHAR (not sure)?).

I recommend that you use TCHAR instead of wchar if this is an MFC project.

If you use TCHAR, then you can easily convert your program from ANSI to UNICODE, and then back to ANSI if need be.

If you use wchar, changing your code back to ANSI will require the same amount of work to convert it to UNICODE.

>>The data would be corrupt. I was not sure why but it seemed that the CString was somehow messed up.
That's correct.  You can not read directly from a file to an object that contains NON-POD types.
A CString is a non-POD type.

>>This is also important because items are saved and used in structures.
For this type of requirement, it would not be good to use CString unless you created an operator>>() and operator<<() for your struct

You could use TCHAR instead of char for your struct.
0
 
LVL 3

Author Comment

by:RJSoft
Comment Utility
Thanks.

RJ
0
 
LVL 6

Expert Comment

by:msjammu
Comment Utility
There are, of course, certain disadvantages to using Unicode. First and foremost is that every string in your program will occupy twice as much space. In addition, you’ll observe that the functions in wide-character run-time library are larger than the usual functions. For this reason, you might want to create two versions of your program-one with ASCII strings and the other with UNICODE strings. The best solution would be to maintain a single source code file that you could compile for either ASCII or UNICODE.

That’s a bit of  a problem, though, because the run-time library functions have different names, you’re defining characters differently, and then there’s that nuisance of preceding the string literals with an L.

One answer is to use the TCHAR.H header file included with MS VC++. This header file is not part of the ANSI C standard, so every function and macro definition defined therein is preceded by an underscore. TCHAR.H provides a set of alternative names for the normal run-time library functions requiring string parameters e g _tprintf, _tcslen. Thses are something referred to as generic function names because they can refer to either the UNICODE or non-UNICODE versions of the functions.

If an Identifier named _UINCODE is defines and the TCHAR.h header file is included in your program, -tcslen is defined to be wcslen.

If _UNICODE isn’t defined, _tcslen is defined to be strlen

And so on. TCHAR.h also solves the problem of the two character data types with a new data type named TCHAR. If the UNICODE identifier is defined TCHAR is wchar_t otherwise, TCHAR is simply a char.

Therefore, the choice is yours.

Regards,
msjammu
 
0
 
LVL 3

Author Comment

by:RJSoft
Comment Utility
Good one. msjammu.
Thanks
RJ
0
 
LVL 3

Author Comment

by:RJSoft
Comment Utility
My notes;

Pod = plain old data.

Basically the reason it can be written to file is because the data is contiguos. Where as a Non pod class created item (like CString) has things like private constructors and virtual members so it is not contiguos data.

To write a structure to file requires that all elements be contiguos. Be pod. There are conditions that a class object can meet to be considered a pod.

reff.
http://www.tempest-sw.com/cpp/draft/ch06-classes.html
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Suggested Solutions

What is C++ STL?: STL stands for Standard Template Library and is a part of standard C++ libraries. It contains many useful data structures (containers) and algorithms, which can spare you a lot of the time. Today we will look at the STL Vector. …
Go is an acronym of golang, is a programming language developed Google in 2007. Go is a new language that is mostly in the C family, with significant input from Pascal/Modula/Oberon family. Hence Go arisen as low-level language with fast compilation…
The viewer will learn how to use NetBeans IDE 8.0 for Windows to connect to a MySQL database. Open Services Panel: Create a new connection using New Connection Wizard: Create a test database called eetutorial: Create a new test tabel called ee…
The viewer will learn how to user default arguments when defining functions. This method of defining functions will be contrasted with the non-default-argument of defining functions.

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now