We help IT Professionals succeed at work.

wchar_t question

on
Medium Priority
1,022 Views
I have a question about how wchar_t works.  I understand that wchar_t is either two or four bytes, depending on your system, and allows a much wider range of characters to be stored than would be permitted by char, which is limited to 00000000 to 11111111 (0 to 255).

But I'm trying to figure out exactly how wchar_t stores the data.  I tried a test case where I assign a value of È to a wide character variable.

When I ouput the value of the wchar_t, casting it to an int, I see that È is equivalent to 200.

Now, if I use a char* type, and do:

unsigned char* c = (unsigned char*) "È";

I find I am actually assigning two bytes to the character, and indeed strlen(c) will return a value of 2.

So I examine these two bytes individually:

printf("%d\n",*c);
printf("%d\n",*(c+1));

And I get:

195
136

Now, my question(s):

1. How does 195 and 136 combine to give me a value of 200?

2. Since it is possible to store a value of 200 in only one byte, why are the values spread out among two bytes here?
Comment
Watch Question

View Solution Only

Commented:
>> unsigned char* c = (unsigned char*) "È";
Compiling this with Visual-C++ (VS 2003) gives me a string that has only one char with the value of 200.
It looks like you have saved your source code in UTF-8 Format and the compiler read it as ISO-5988-1 or something similar.

Commented:
>> 1. How does 195 and 136 combine to give me a value of 200?
This is a 2 byte UTF-8 encoding (http://en.wikipedia.org/wiki/UTF-8) You can decode it this way:
char ch0 = 195; // 0xc3
char ch1 = 136; // 0x88
int ch = ( ( ch0 & 0x1f ) << 6 ) | ( ch1 & 0x3f );
printf( "ch:%d\n", ch );

>> 2. Since it is possible to store a value of 200 in only one byte, why are the values spread out among two bytes here?
As I said above, because the source file was obviously saved as UTF-8. In UTF-8 all character codes above 127 get encodes as 2 or more bytes (the highest bit of a byte is used to mark multi byte encodings in UTF-8).

Commented:
I think the answer of x4u is correct, but as an addition, here is all information about the È character: http://www.eki.ee/letter/chardata.cgi?dcode=200

Commented:
x4u,

>>  int ch = ( ( ch0 & 0x1f ) << 6 ) | ( ch1 & 0x3f );

Thanks for the information.  Can you explain the above line a little?  It seems this formula uses multiple bitwise operators to get 200 from the two values.  My question is,

What is the significance of 0x1f and 0x3f (31 and 63), and what is the significance of left bitwise shift 6?

Also, does this same formula decode all multibyte values into a single utf8 value?

Commented:
This line is from a function that I use to decode UTF-8 in Java, where it is used to decode the 2 byte encodings. The 0x1f and 0x3f constants are there to extract tha data bits of the byte value and to mask out the UFT-8 marker bits. The 6 bit shift is there to shift the bits of the 1st byte into a higher position because the lower 6 bits are taken from the 2nd byte (0x3f is a 6 bit wide mask). These constants are derived from the UFT-8 standard (http://www.ietf.org/rfc/rfc3629.txt).

This table shows how unicode characters are to be encoded in 1 to 4 byte sequences with UTF-8:
Char. number range  |        UTF-8 octet sequence
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

As you can see the 2 byte case uses 5 bits (0x1f) of the 1st byte and 6 bits of the 2nd (0x3f).

This is the code snippet that I use to decode the 1-3 bytes encodings of UTF-8 (Java does not use the 4 byte encodings). It's not entirey standards complient as it also accepts and ignores invalid byte sequences.

int ch = buf[ idx++ ];
if( ( ch & ~0x7f ) != 0 )
{
switch( ch >> 4 )
{
case 12: case 13:
ch = ( ( ch & 0x1f ) << 6 ) | ( buf[ idx++ ] & 0x3f );
break;
case 14:
ch = ( ( ch & 0x0f ) << 12 ) | ( ( buf[ idx++ ] & 0x3f ) << 6 );
ch |= buf[ idx++ ] & 0x3f;
break;
default:
ch &= 0xff;
break;
// throw new UTFDataFormatException( "pos: " + idx + ", max: " + max );
}
}

Commented:
x4u,

Thank you very much for that code snippet.  You are really helping me to understand how this works.  I just have a few more questions about this.  First let me see if I understand correctly.  Here is your code with my comments/alterations added in to reflect my understanding of it:

c = NEXTBYTE

if ((c & ~0x7f) != 0) {      // check that c is between 128 and 255
switch(c >> 4) { // shift bits in c to the right by 4
case 12: // decode for values 192 through 207 (same procedure as 208 through 223)
case 13: { c = ( ( c & 0x1f ) << 6 ) | ( NEXTBYTE & 0x3f ); break; } // decode for values 208 through 223
case 14: { // decode for values 224 through 239
c = ( ( c & 0x0f ) << 12 ) | ( ( NEXTBYTE & 0x3f ) << 6 );
c |= NEXTBYTE & 0x3f;
break;
}
default: {
c &= 0xff; // make sure c is in range of 0 to 255;
break;
}

Questions:

1. Is my understanding of your code fairly accurate?
2. Doesn't this code require unsigned characters only, otherwise a value > 127 is not possible
3. What about values 240 through 255?
4. This code doesn't address Unicode single byte ASCII-equivalent values?
5. This code can be used to convert multi-byte narrow chars (in utf8) to wide characters?

Commented:
>> 1. Is my understanding of your code fairly accurate?

>> 2. Doesn't this code require unsigned characters only, otherwise a value > 127 is not possible
This is correct but with the masking of bits (i.e. & 0x3f) all other bits except those in the mask get filtered out anyway which results in the same values for signed and unsigned integers here. In C I would still have used unsigned ints of course but Java does not have unsigned types.

>> 3. What about values 240 through 255?
They would be the start of the 4 byte encodings which are not used by Java but should be treated special somehow. These values don't fit in 16 wchars anymore and need to be stored either as 32 bit values or as two 16 bit wchars. But the Unicode symbols in this range above 2^16 are used for rather exotic symbols or languages and are not yet widely used or supported as far as I know (many of them were only defined in recent revisions of the Unicode standard).

>> 4. This code doesn't address Unicode single byte ASCII-equivalent values?
It does handle them but it is not neccessary to convert them (the c=NEXTBYTE already assigns them the correct value). All US-ASCII 7-Bit encodings are always valid Unicode encodings too which makes single byte encodings the by far most common encoding type for western languages and leads to a rather small documents compared to 16 bit encodings or even 32 bit encodings for these languages.

>> 5. This code can be used to convert multi-byte narrow chars (in utf8) to wide characters?
Yes, that's what I use it for. But the UTF-8 encoding has some redundancy in it's use of the bits which makes certain byte sequences illegal and a real compliant decoding should detect them and issue some warning. My code simply ignores them and even produces garbage instead in rare cases (i.e. 4 and more byte encodings), although it gives correct output for all valid UTF-8 inputs with up to 3 byte encodings.

UFT-8 can also have 5 and 6 byte encodings to cover the entire range of up to 2^32 Unicode symbols. You can see the correct treatment of all these here: http://home.tiscali.nl/t876506/utf8tbl.html

Not the solution you were looking for? Getting a personalized solution is easy.

Commented:
Okay thanks for all your help.

Commented:
you're welcome and thanks for the A. ;-)
Thanks for using Experts Exchange.

• View three pieces of content (articles, solutions, posts, and videos)
• Ask the experts questions (counted toward content limit)
• Customize your dashboard and profile