Link to home
Start Free TrialLog in
Avatar of formula420
formula420

asked on

Unicode, UTF's, wchar_t and C programming

I have a few questions about unicode and the utfs that I wonder any of you could answer.

I think understand the differences between unicode and the different encoding methods for it, but I haven't really grasped how everything is related.

 For example, how do I go about making my program encode wide char strings as UTF16 instead of UTF 8 like it seems to be doing by default (gcc 3.4.6)? It seems my program running in windows, compiled with visual studio .net 2002, encodes its wide strings as UTF16 or UCS2, so how do I go about making it use UTF8 or any of the other encoding methods?

When do I need to use the functions declared in wchar.h in place of the normal string.h functions ( eg wcslen() vs strlen() )? I've seen code examples where regular old printf() and strcpy() were used with wide strings. When do I need to use their wide string equivalent?

I don't think I fully understand the relationship between locales and UTF8, either. Can someone explain it to me?

I'm sorry for the laundry list of questions. I'm very new to unicode and I can tell that it's a subject every programmer should be confident with, and seems to be a fairly complex subject.. I'd really appreciate any help given
Avatar of bpmurray
bpmurray
Flag of Ireland image

Yes, it's quite a list, touching on most aspects of this stuff.

First, a wide string is always UTF16, i.e. a string of 16-bit characters, although some implementations implement surrogates (where you need two 16-bit entities to encode the character) and some don't. Note that UTF16 and UCS2 are identical except for the fact that UCS2 doesn't encode surrogates.

gcc sets the value of a wchar_t to the size of that unit, which is platform-dependent. So, it could actually be UTF-32. If gcc is encoding it as utf-8, it's doing something it shouldn't do. Try using -fwide-exec-charset=UTF-16 to see if that helps. the MS compilers don't use "wide" for any non-UTF16 strings, so you'll have to use the multi-byte stuff there for UTF-8 or whatever.

Well, if you're using wide strings, you need the wchar.h methods. If the standard calls are used with Unicode data, the null bytes will function as terminators for any strings, so they won't work reliably.

Locales and encodings are quite different. For example, if you're in the US, your currency symbol is "$", the date format is mm/dd/yyyy and 1,234.56 is a legitimate number; in Germany, these would be €, yyyy/mm/dd and 1.234,56. This information is associated with the locale. A character encoding is used to store text: if you use a universal encoding such as UTF-8/16/32 you can encode text from multiple locales, while a codepage such as Shift-JIS is generally specific to Japanese.
Avatar of formula420
formula420

ASKER

Thanks a lot for the reply. Just a few more questions though....


"First, a wide string is always UTF16, i.e. a string of 16-bit characters"
So, UTF8 and UTF32 do not use wchar_t, but just normal c style char* strings with surrogates to bind them into one logical character?

"gcc sets the value of a wchar_t to the size of that unit"
What do you mean by "the size of that unit"? You mean the size of a wchar_t on my platform, (which is, btw, 4 bytes) or the size of the unicode character itself?

"Locales and encodings are quite different."
 Am I correct in believing that setting the locale has no effect on how the characters are encoded and then displayed, but is used to figure out how to format that output based on the customs in the selected locale? I've seen  several examples with setlocale(LC_ALL, "") in them and the explanation is normally along the lines of "make sure to call setlocale() at the beginning of your program".

Again, I really appreciate the help

UTF32 encodes each character in 32 bits and is not supported on any commercial O/S.

wchar_t is normally two bytes, can default back to 1 byte if Unicode is switched off.

The "locale" concept includes a character set for 8-bit characters. This is known through the code page identifier. On Windows platforms which support Unicode, ie: WinNT, Win2K, Win2K3 and XP 8-bit character strings are translated to Unicode before being displayed by the procedure MultibyteToWideChar. There fore the encoding of the codepage is important.  On the non-Unicode machines, Unicode is supported only in COM not for display and the procedure WideCharToMultiByte translates the Unicode characters back to code page characters before being displayed.

UTF-8 has become an important encoding due to XML which allows Unicode characters to be encoded in 8-bit documents.
The size of the wchar_t is determined by the processor - as you point out, it's 32 bits on your machine and, as BigRat mentions, on Windows, if you don't define UNICODE, it's a simple char. In C/C++ any memory position can be pointed to by a char *, so you're sorta right, but really, you would use an eight-bit value (unsigned char) for a simple 8-bit character set and a 32-bit value (unsigned long) for UTF-16, so char* isn't really appropriate for that. As a clarification on wchar_t, here's a link to a doc by Markus Scherer, one of IBM's experts in this area: http://icu.sourceforge.net/docs/papers/unicode_wchar_t.html.

While UTF32 isn't formally supported by any OS in its UI, etc. the fact that wchar_t is often 4 bytes means that it actually is doing so by default. Remember that UTF32 is simply zero-extending the characters in the UTF-16 BMP (Basic Multilingual Plane), although non-BMP characters are handled differently.

Well, the locale is really about the names of the days, the numeric formatting, address format, collation (sorting sequences), etc. The Unicode site has the CLDR, which defines the  locale information for Posix, and these all assume the use of UTF-8. However, the locale can be associated with a default encoding, but it's really a help rather than being normative (mandatory). There are many encodings schemes, but they're basically of four types:
   - A single 8-bit value represents a character, e.g. ASCII (which is only 7 bits really) or ANSI Latin-1 or Windows 1252
   - A single fixed-size value represents a chacater, e.g. Unicode utf-16, utf-32, ISO 10646
   - A variable-length sequence of bytes, containing characters that are one, two or more bytes, usually Asian, e.g. Windows 932 (Japan), UTF-8
   - A switched encoding, where a specific escape sequence or similar changes between different encodings, e.g. ISO-2022-JP

For good information on this stuff, have a look at the Unicode web side (www.unicode.org), or the ICU site (icu.sourceforge.net).
Great, thanks for all the info.

The problem I was having that prompted me to start this thread was a problem I was having while sending a wchar_t string from the client of my program (written in c++, using MFC, compiled with UNICODE and _UNICODE) to the server app (written in C, gcc 3.4.6). Whenever I sent a wchar_t string from my client app to my server app and print it to the terminal, it would show up as ?'s.

Whenever I cat the string from the client to a wide string on the server and print it using fwprintf, it would show up, as "SELECT * FROM users WHERE name='???????'",  unable to display the string sent from the client (though the string was just "test").

When I would print the contents of that query (wchar_t) string in gdb before concatenating to it the username string from the client app, I got this
(gdb) print qstr1
$1 = {83, 69, 76, 69, 67, 84, 32, 42, 32, 70, 82, 79, 77, 32, 117, 115, 101, 114, 115, 32, 87, 72, 69, 82, 69, 32, 110, 97, 109, 101, 61, 39, 0}

("SELECT * FROM users WHERE name='")

Not exactly sure how to tell which encoding type this is, but since I would probably mistake that as simple ASCII if I didn't know better, and seeing as how that's the beauty of UTF8 if you're an English speaker,  I made the assumption in an earlier post that GCC was defaulting to UTF8. qstr1 is being declared and instantiated like this

wchar_t qstr1[]=L"SELECT * FROM users WHERE name='";



Printing contents of username sent from the client yields
(gdb) print username
$2 = {6619252, 7602291, 0, 0, 0, 0, -842150451, -842150451, 0, 134685416, 134686800, 196609, 3223598, 1967406706, 97, 2097154}

This looks like gibberish to me, which is what I would expect from UTF16. I can still find my text hidden away in there
(gdb) print ((char*)username)[0]
$3 = 116 't'
(gdb) print ((char*)username)[2]
$4 = 101 'e'
(gdb) print ((char*)username)[4]
$5 = 115 's'
(gdb) print ((char*)username)[6]
$6 = 116 't'


Compiling with -fwide-exec-charset=UTF-16 caused different problems. This code prints out  "????????" for some reason

wchar_t qstr1[]=L"SELECT * FROM users WHERE name='";
flockfile(stdout);
fwprintf(stdout,L"\n %s",qstr1);
fflush(stdout);
funlockfile(stdout);

And now printing its contents in gdb shows something more like the username sent from the client app than the ascii text it looked like before

(gdb) print qstr1
$1 = {5504767, 4980805, 4390981, 2097236, 2097194, 5374022, 5046351, 7667744, 6619251, 7536754, 5701664, 4522056, 4522066, 7208992, 7143521,
  3997797, 39}


 bpmurray I'm gonna give you the points for answering my original questions, but I'll throw in the rest of my points (100) if you can tell me what's going on to cause this problem. Again, I really appreciate the info given so far by both bp and bigrat
OK, you have:
$1 = {5504767, 4980805, 4390981, 2097236, 2097194, 5374022, 5046351, 7667744, 6619251, 7536754, 5701664, 4522056, 4522066, 7208992, 7143521,
  3997797, 39}

This is (in hex - even though I don't play with charsets any more I still think in hex):
0x0053FEFF, 0x004C0045, 0x00430045, 0x00200054, 0x0020002A,  0x00520046, 0x004D004F, 0x00750020, 0x00650073, 0x00730072, 0x00570020, 0x00450048, 0x00450052, 0x006E0020, 0x006D0061, 0x003D0065, 0x00000027

Can you see the pattern? They're 32-bit values that actually contain 2 x 16-bit values. This directly refers back to Markus's article on the icu site, where he warns against using wchar_t - this is one of the cases where it would be more appropriate to use unsigned short. Anyway, we have to swab the values (switch each 16-bit value with its neighbour) and we get:
 <Byte-Order Mark> SELECT * FROM users WHERE name="

This indicates to me that your wchar_t is 32 bits, but you're telling gcc to use 16-bit values. Have you tried -fwide-exec-charset=UTF-32 instead? I have a sneaking suspicion that it'll make this work much better. Just to verify this, let's look at the username value above:
6619252, 7602291, 0 = 0x00650074, 0x00740073, NULL = et,ts.

OK. Here's a potential solution: on the gcc end, stop using wchar_t. use:  typedef unsigned short MyWideChar instead, and use  -fwide-exec-charset=UTF-16. That will force compatibility between both platforms.
Ok, using -fwide-exec-charset=UTF-32 fixed the problem i was having with printing wide strings created from within the server app.

The problem i'm finding with using unsigned short instead of wchar_t and compiling with UTF-16  (and i also tried with -fshort-wchar and and leaving wchar_t instead of using unsigned short) is that I can't use my wide-char functions, like fwprintf() or wcslen() (I guess my version of glibc.  Even simple expressions like fwprintf(stdout, L"test") or fwprintf(stdout, "test") outputs gibberish. I've googled around a bit and, from what I've seen, it seems that the consensus is that unless you want to create your own unicode library, changing the size of wchar_t isn't very useful. I started getting all sorts of memory corruption when I used unsigned short* strings instead of wchar_t*  and passed them into the wide character utility functions. I'm thinking that since the server app rarely ever needs to treat the data payload in the messages sent from the client as anything other than a blob of bytes, it'd  be easier to just convert those strings from the client that I do need to read and manipulate to UTF-32. Haven't really looked into how to do that yet, but I'm sure it's a problem that's already been solved and documented. If you have any more thoughts on this please let me know. I really appreciate the help you've given me.
Oops, forgot to finish a sentence there. Meant to say "(I guess my version of glibc always expects wchar_t to be 4 bytes wide)"  =)
ASKER CERTIFIED SOLUTION
Avatar of bpmurray
bpmurray
Flag of Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I've been out of town these past few days, and haven't until now been able to work on the program or test your solution. Works fine as far as I can tell, so I want to thank you again for your help. Definitely cleared up a lot of questions I had.