Convert a C program to Unicode

I need some directional help on how to convert a C program to unicode. The program has no UI and pulls information from a PC to write to a text file, as a sort of discovery process. However when it runs on Japanese or Russian machines it doesnt work properly. The spec roughly is...

- All incoming strings (from registry, windows api, wmi, WNet functions etc) should be able to cope with unicode
- The program should still work without installation on clean build Windows 95 or later
- No reference to MFC or other DLLs that might not be there is allowed

The existing program is about 4000 lines long, with most output strings set to char*, and uses CRT funcs like strncat, memset, malloc, free, memcpy, strtok, strlen, etc

1. What string types should I use instead of char / char * ? I would like to standardise on a single string type throughout the program if poss
2. Should I continue to use malloc and just multiple all the string lengths by 2 ?
3. What string functions should I use in place of those CRT functions.

At the moment I think some of the incoming data to the program (e.g. from WMI) is probably already in unicode or BSTR form. Some functions seem to declare incoming data as BSTR, some as WCHAR_T. It has intermediate calls to wcstombs, SysAllocString and the like

I would also like to know if    

        L"Win32_ComputerSystem"

is actually a unicode string or some other kind of string.

As you can probably tell my biggest problem is only basic knowledge of C programming with unicode so the more help and direction you can give me the better...

And it would be nice if I can wrap up the string conversion into functions away from the core logic so any suggestions there would be appreciated. ..

thanks in advance
LVL 8
plqAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Dariusz DziaraProgrammerCommented:
1. Use TCHAR, LPCTSTR, LPTSTR types. They will evaluate according to char, const char *, char * or to wchar, const wchar*,  wchar * types depending on _UNICODE constant is defined or not.
2. You can use also new operator like:
LPTSTR lpszText = new TCHAR[100];
and it will do what's necessary
3. There are special macroes for each (almost each) function.
Instead of strlen() use _tcslen() which wil be replaced by strlen() or wcslen() depending on _UNICODE constant is defined or not.
And so on (read MSDN to find correct macroes names).
hoomanvCommented:
could be helpful
http://icu.sourceforge.net/
Dariusz DziaraProgrammerCommented:
L"Win32_ComputerSystem" is unicode string
BSTR is also unicode string but it contains additionally counter of characters in string (berore the first character maybe but I am not sure now) so it can contain '\0' character in the middle if I remember well. I guess this is used for so called marshalling (COM interfaces) when parameters are passed crossing process or machine boundary.
Exploring SharePoint 2016

Explore SharePoint 2016, the web-based, collaborative platform that integrates with Microsoft Office to provide intranets, secure document management, and collaboration so you can develop your online and offline capabilities.

Dragon_KromeCommented:
You need to do a little bit of research on internationalisation issues before you start, so you don't run into trouble.

http://en.wikipedia.org/wiki/Internationalization   this page provides more links to valuable resources reffering to this matter.

http://www.i18nfaq.com/
Dariusz DziaraProgrammerCommented:
There are also macroes for converting between ANSI, UNICODE strings like T2, T2BSTR, A2W and so on. Search MSDN for "String Conversion Macroes"
plqAuthor Commented:
Thanks for the feedback so far..

So I have to #define _UNICODE at the top.. ok. And then change all my char * and char declarations to be TCHAR etc as above .. ok too. And I'm OK with digging out or rewriting replacement functions for the CRT functions currently in use.

Looking around the web it seems that a Unicode character in Japan might have a different meaning to a unicode character in Russia even if they have the same two byte character code? So I presume I need to output the LOCALE as well so the receiving backend can know what locale the incoming data is in ???

Also can you recommend functions to replace sprintf ?

And finally this is our code to write a file containing what up to now has been char data:

SetErrorMode(SEM_NOOPENFILEERRORBOX | SEM_FAILCRITICALERRORS);
HANDLE hFile = CreateFile(obj.outputfile, GENERIC_WRITE, FILE_SHARE_WRITE, NULL, TRUNCATE_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
if (hFile == INVALID_HANDLE_VALUE)
{
      hFile = CreateFile(mtpcacmdline.outputfile, GENERIC_WRITE, FILE_SHARE_WRITE, NULL, CREATE_NEW, FILE_ATTRIBUTE_NORMAL, NULL);
}
if (hFile != INVALID_HANDLE_VALUE)
{
      WriteFile(hFile, sbuffer, strlen(sbuffer), &dwBytesWritten, NULL);
      SetEndOfFile(hFile);
      CloseHandle(hFile);
}

Is WriteFile likely to workOK with UNICODE data, and does "CreateFile" need to specify that the file is a unicode file ?

thanks
Dragon_KromeCommented:
hoomanvCommented:
> Also can you recommend functions to replace sprintf ?
if you use ICU library as I have described above it provides all these functionalities
http://icu.sourceforge.net/apiref/icu4c/ustdio_8h.html  ---> for unicode I/O like sprintf
http://icu.sourceforge.net/apiref/icu4c/ustring_8h.html  ---> for Strings and Character Iteration like strlen , strcat
Dariusz DziaraProgrammerCommented:
use _stprintf() macro (just find in MSDN sprintf() description and below you will find what's necessary - _stprintf() macro)

I also think (I hope that I am not wrong) that LOCALE is only for multi byte strings (1 byte per character) where the same code can mean different characters. However so far I was convinced that this is not the case for UNICODE strings where we have 2 bytes per char (65,xxx total character which should be enough for every character). Let me know if I am wrong :)

"CreateFile()" works fine with binary data so it will also work fine with UNICODE which is special case of binary data.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
plqAuthor Commented:
Excellent responses - thank you

I am sure there will be a few more in this TA from me in the next few days (hours)

thanks very much
plqAuthor Commented:
One more quickie - anything wrong with saying TCHAR * instead of LPTSTR - or does that sound silly to you ?
Dariusz DziaraProgrammerCommented:
No, it is exactly the same but frakly saying I've never used TCHAR * for LPTSTR ;)
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
C++

From novice to tech pro — start learning today.