asked on

Unicode to ASCII conversion

I am trying to program a command-line utility that will convert a file from Unicode to ASCII. Files are approx 60 MB. I have the command line architecture built, I just can't figure out how to read the unicode file, convert it to ascii and write it back out. Any help would be greatly appreciated.

ASKER CERTIFIED SOLUTION

Salte

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Salte

I promised I would come back to how codes above 0x80 are encoded, I can say something about that also even though you don't need it in your program. A promise is a promise.

codes from 0x000080 through 0x0007ff are encoded using two bytes, the first byte is in the range 0xc0..0xdf (110xxxxx) and the other byte is in the range 0x80..0xbf (10yyyyyy) these two bytes hold 11 bits and these bits are the 11 bits of the codes from 0x80 through 0x7ff:

110x xxxx, 10yy yyyy -> 0 0000 0000 0xxx xxyy yyyy

Codes from 0x000800 through 0x00d7ff and from 0x00e000 through 0x00ffff are encoded using 3 bytes. The first byte is in the range 0xe0..0xef (1110xxxx) and the next two bytes are both in the range 0x80..0xbf (10yyyyyy and 10zzzzzz). These form the 16 bits of the unicode value:

1110 xxxx, 10yy yyyy, 10zz zzzz -> 0 0000 xxxx yyyy yyzz zzzz

Codes from 0x010000 through 0x10ffff are encoded using 4 bytes. The first byte is in the range 0xf0..0xf4 and the next three bytes are all in the range 0x80..0xbf.

1111 0xxx, 10yy yyyy, 10zz zzzz, 10uu uuuu -> x xxyy yyyy zzzz zzuu uuuu

Note that because unicode are maximum 0x10ffff the first byte cannot be 0xf5..0xf7 and if it is 0xf4 then the next byte must be in the range 0x81..0x8f.

Note also that UCS-4 which also can be encoded in UTF-8 lacks several of these restrictions and also allow for codes up to 0xf7 and also allow the first byte to be in the range:

0xf8..0xfb (1111 10xx) followed by 4 bytes in the range 0x80..0xbf and:

1111 10xx, 10yy yyyy, 10zz zzzz, 10uu uuuu, 10vv vvvv ->

0000 00xx yyyy yyzz zzzz uuuu uuvv vvvv

and codes starting with: 0xfc..0xfd (1111 110x) followed by 5 bytes in the range 0x80..0xbf and:

1111 110x, 10yy yyyy, 10zz zzzz, 10uu uuuu, 10vv vvvv, 10ww wwww -> 0xyy yyyy zzzz zzuu uuuu vvvv vvww wwww

Giving a maximum value of 0x7fffffff which is the maximum value for UCS-4.

Since unicode and UTF-32 has a maximum value of 0x10ffff they do not allow those 5 byte and 6 byte sequences though and the maximum byte sequence is 4 bytes to decode a unicode value.

Alf

ocjared

ASKER

Thanks! Much appreciated. -J