• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 2747
  • Last Modified:

How to convert from UTF8 to UCS2??

Anyone have an idea on how to convert UTF8 to UCS2 in c++??
1 Solution
Not as difficult as you might think. However, it is probably easiest to do this by first converting to UCS4 and then convert from UCS4 to UCS2.

First from UTF-8 to UCS4.

Read the file byte by byte:

0x00 to 0x7f  -> 0x000000 -> 0x00007f
I.e. if the high bit is not set the byte is the code.

I will use 0b.... to indicate binary (radix 2). I will also use x y z u to indicate either an unspecified binary digit (bit) or an unspecified hex digit 0-9a-f, so 0xax means a byte with the upper bits set to 1010 and the lower 4 bits unspecified.

0x0xxxxxxx -> 0b0 0000 0000 0000 0xxx xxxx

0b10xxxxxx should not occur by itself (it is a byte that can only follow other bytes).

0x110xxxxx 0x10yyyyyy -> 0b00 0000 0000 0xxx xxyy yyyy
so 0xc0..0xdf is followed by a byte in the range 0x80..0xbf
and those two bytes make a code in the range 0x000080 to 0x0007ff. Note that it is not legal to make any code in the range 0x00..0x7f this way so 0xc0 and 0xc1 is not valid codes.
0x1110xxxx 0x10yyyyyy 0x10zzzzzz ->
0b0 0000 xxxx yyyy yyzz zzzz

Again the codes 0xe0 followed by 0xcx or 0xdx is not valid since that would produce codes that could have used the shorter 2 byte representation.

0x1111 0xxx 0x10yyyyyy 0x10zzzzzz 0x10uuuuuu ->
0bx xxyy yyyy zzzz zzuu uuuu

Note that the maximum value for UCS-2 is the code
0x10ffff so the highest bit can be 1 but only if the next 4  bits are all 0, so 0xf4 followed by 0x8f is the highest code that can occur in a unicode translation. UTF-8 also provides translation for highest codes but they are not representable in UCS-2. Again, also it is illegal to represent code that can be represented by fewer bytes so 0xf0 0x8x is not legal.

To summarize:

0xxx xxxx                                -> 0 0000 0000 0000 0xxx xxxx
110x xxxx  10yy yyyy                     -> 0 0000 0000 0xxx xxyy yyyy
1110 xxxx  10yy yyyy 10zz zzzz           -> 0 0000 xxxx yyyy yyzz zzzz
1111 0xxx  10yy yyyy 10zz zzzz 10uu uuuu -> x xxyy yyyy zzzz zzuu uuuu

Combinations that overlap is illegal, so even though you can specify 0x0003f by 0xc0 0xbf that code is considered invalid since 0x3f is the correct code for 0x0003f.

This gives you a code in the range 0x000000..0x1fffff
but the valid ranges for each byte sequence is:

0xxx xxxx -> 0x000000..0x00007f
110x xxxx -> 0x000080..0x0007ff
1110 xxxx -> 0x000800..0x00ffff
1111 0xxx -> 0x010000..0x10ffff

This gives you a code in the range 0x000000..0x10ffff

Further you should check that the code is NOT in the range 0x00d800..0x00dfff If it is it is also an invalid code.

and you can now convert it to UCS-2 as follows:

If the code is less than 0x010000 the code is the result:

0x00yyyy -> 0xyyyy as UCS-2 code. The code must NOT be in the range 0xd800..0xdfff.

IF the code is greater than or equal to 0x010000  the code is in the range 0x010000..0x10ffff. Subtract 0x010000 from this value and you get a code in the range 0x00000..0xfffff. This code is split in two:

0bxxxx xxxx xxyy yyyy yyyy

And form a sequence of two UCS-2 codes:

0x1011 10xx xxxx xxxx 0x1011 11yy yyyy yyyy

The first code is in the range 0xd800..0xdbff and the second code is in the range 0xdc00..0xdfff.

For UTF-8 there is no little endian/big endian problems since all codes are one byte in length. However, for UCS-2 there is and so there are two dialects of UCS-2. One for little endian machines and one for big endian machines.

For a reader to recognize the endianess of a written UCS-2 code you should provide a so-called "Byte order mark" at the beginning. This is a 'regular' Unicode UCS-2 code but there are two things that is special with it:

The code is 0x00feff. The code with the bytes switched would be 0x00fffe and that is an illegal Unicode code. so if the reader reads the first UCS-2 code and get 0xfffe it knows it is reading with the wrong endianess and can swap the bytes before sending them to whoever wants to read the text. The code 0xfeff means 'zero width no break space' and is ordinarily a code which at the beginning of a text has no specific meaning it can safely be ignored once the byte order is found. However, it causes no harm if the reader were to actually read and interpret the code.

It does provide problems if the reader expect a specific signature for a header etc and then get the BOM first. If so the reader should be prepared to receive the BOM and discard it before reading the actual header.

In other words, when you write UCS-2 code it is a good idea to supply the BOM. So before you write the actual text of the file you might want to consider writing a BOM mark to the file.

Note that this should only be done for the very first byte to a file. If you write embedded strings as part of other data in a binary file or you are appending strings to an existing file the BOM should never be written. For a binary file with embedded text strings it is assumed that the binary data provide info about endianess and for appending text to the file it is assumed that the BOM is placed by whoever wrote the original text of the file.


if (write_bom)

Given that text may thus contain a BOM it is also some times found in UTF-8 text even though it isn't needed there. The BOM as UTF-8 is easy enough to detect:

0x1111 1110 1111 1111 -> 0b1110 1111 0b1011 1011 0b1011 1111 -> 0xef 0xbb 0xbf.

Don't write two BOM so if the UTF-8 file start with that sequence you already write a BOM and so you shouldn't add it yourself, on the other hand you have probably already added the BOM when you read that UTF-8 text so in that case you don't copy the UTF-8 BOM but just skip it and read and copy the text following it.

poor_guyAuthor Commented:
Very detail answer, Thanks :)

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now