?
Solved

Unicode to ASCII conversion

Posted on 2003-03-10
3
Medium Priority
?
1,495 Views
Last Modified: 2012-06-27
I am trying to program a command-line utility that will convert a file from Unicode to ASCII. Files are approx 60 MB. I have the command line architecture built, I just can't figure out how to read the unicode file, convert it to ascii and write it back out. Any help would be greatly appreciated.
0
Comment
Question by:ocjared
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
3 Comments
 
LVL 12

Accepted Solution

by:
Salte earned 1000 total points
ID: 8105190
Well you need a unicode to ascii conversion function.

Main problem here is that there is no format called "unicode" per se. Unicode can come in three main flavors and  for some of these there are even subflavors:

UTF-8:
This is the format most often used. When people say "this is a unicode file" there's a 85% chance that they really mean "this is a UTF-8 file". UTF-8 is a format that decodes unicode in a way so that all ascii codes 0-127 remains as single bytes so that you can open the file in a regular editor and read it as regular text. Codes above 127 are decoded in a special manner which I will come back to below.

UTF-16:
This is a format encoding unicode in 16 bit codes. This used to be a good format back in the days when unicode was 16 bits. Now when unicode is really 21 bits it isn't so great anymore but it is still useful. This format encodes unicode values in the range 0x000000 through 0x00d7ff (both inclusive) as the value itself. Codes in the range 0x00e000 through 0x00ffff (both ends inclusive) are also coded as the value itself. Values in the range 0x010000 through 0x10ffff are decoded using two 16 bit codes as described below.

Since this is using 2 bytes per code it differ on little endian and big endian systems and so there are two subformats: UTF-16LE and UTF-16BE. The format "UTF-16" is used to denote a UTF-16LE or UTF-16BE with a BOM (byte order mark) in front so you can determine which one of the two formats it is. This is done by reading the first code (the BOM) and if it has the value 0xfeff it is fine and you are reading the unicode values correctly, if the value is 0xfffe you get the bytes in opposite order and you must swap the bytes on each 16 bit code you read.

UTF-32 is the format using 32 bit per each unicode character and each value is in the range 0x000000 through 0x00d7ff or in the range 0x00e000 through 0x10ffff.

Values in the range 0x00d800 through 0x00dfff or above 0x10ffff is not legal unicode values.

In addition there's something called "UCS-4" which is 4 byte values just like UTF-32 but without any restriction other than that the value must be positive in a signed int, i.e. the values are in the range 0x00000000 through 0x7fffffff. UCS-4 cannot properly represent unicode unless the value is also at the same time UTF-32.

Ok, so the question is, do you want to convert from UTF-8 to ascii or from UTF-16 to ascii (LE or BE?) or from UTF-32 to ascii?

Also, as should be obvious, unicode has around several million codes while ascii has 128 so obviously you can't represent all possible unicode values in ascii. You must therefore have a clear idea of what values you want to convert and which values you should disallow.

If you by "ascii" really mean ASCII (the 7 bit char set) the conversion is easy enough. From UTF-32 to ascii is very simple:

bool uni2ascii(int unichar, char & asc)
{
   if (unichar < 0 || unichar >= 0x80)
      return false;
   asc = char(unichar);
}

This function will return true if the unicode code is valid ascii and set the char asc to that value.

If the unicode code is outside the range of valid ascii the function return false.

However, it is possible you by "ascii" mean "some 8 bit charset". Some people tend to confuse the ANSI charset used in windows with ASCII or with some other 8 bit charset such as ISO-LATIN-1 with ASCII. Then things get hairier. Basically you have to define which unicode characters you want to allow and which you don't want to allow and then convert those you allow to the proper codes and reject the others.

However, again, assuming this is a separate problem and I have already explained how to convert UTF-32 to ascii the problem remains if you have UTF-16 or UTF-8 and you want to convert that to ascii.

UTF-8 to ascii is also very simple. As already said, if it really is ASCII (the 7 bit char set) then UTF-8 require no conversion. Every code is less than 0x80 and is the plain ascii code. The file is both legal UTF-8 (unicode) and legal ascii.

If the file contain some invalid ascii codes which you want to remove, just strip any codes with the high bit set and you get only valid ascii left:

char ch;
while (cin.get(ch)) {
   if ((ch & 0x80) == 0)
      cout << ch;
}

This code converts a UTF-8 file to ascii by stripping off all non-ascii characters from the file.

Normally UTF-8 is harder but any codes above 0x80 are used to encode values above 0x80 and they aren't valid ascii anyway so there's no reason to decode them.

If you have UTF-16 then you must first determine the endianess:

unsigned short u;
bool swap = false;

if (cin.read(u)) {
   if (u == 0xfffe) { // wrong endianess
      swap = true;
      cin.read(u);
   } else if (u != 0xfeff) {
      // no BOM char...
      // in this case the format really should have been
      // UTF-16LE or UTF-16BE since the format itself
      // doesn't identify the endianess.
      // probably you should reject this file as a
      // UTF-16 file and just refuse to translate.
   } else {
      cin.read(u);
   }
}
while (cin) {
   if (swap)
      u = (u << 8) | (u >> 8); // swap the bytes.
   if (u < 0x80) // valid ascii.
      cout << char(u);
   cin.read(u);
}

If you open a file for binary read this should output ascii version of the file.

If the format is UTF-16LE or UTF-16BE then you know if you need to swap or not not based on any BOM but as argument in, so instead of a bool swap you can do something like this:

enum format_t {
   utf16,
   utf16be,
   utf16le,
};

void convert2ascii(format_t f, ostream & os, istream & is)
{
   unsigned short u;
   bool swap = false;

   if (! is.read(u))
      read; // error.
   switch (f) {
   case utf16:
      if (u == 0xfffe) {
         swap = true;
         is.read(u);
      } else if (u != 0xfeff) { // no bom, panic
         throw some_error("UTF16 require BOM");
      } else {
         swap = false;
         is.read(u);
      }
      break;
   case utf16be:
      // if on little endian machine set swap to true
      // if on big endian machine set swap to false.
      swap = ! BIG_ENDIAN;
      break;
   case utf16le:
      // if on little endian machines set swap to false.
      // if on big endian machine set swap to true.
      swap = BIG_ENDIAN;
      break;
   }
   while (is) {
      if (swap)
         u = (u << 8) | (u >> 8);
      if (u < 0x80) // ascii
         os << char(u);
      is.read(u);
   }
}

If you have UTF-32 format then the conversion is trivial as already mentioned but you need to worry about endianess also here.

There are machines with different endianess so you can encounter that 4 bytes of 0x11 0x22 0x33 0x44 will if read as a 4 byte integer come up as 0x11223344 or 0x44332211 or even 0x22114433 or 0x33441122 on some odd machines.

Well, since UTF-8 has to be maximum 0x10ffff so you see that it will come up as 0xffff1000 etc... Again, the UTF-32 format usually have a BOM code in front so you should expect to read 0x0000feff. Due to endianess this may come up as 0xfffe0000 instead if so you know you need to swap. On some odd machines it may even show up as: 0x0000fffe which si also invalid unicode and so you know you need to swap the two lower and the two upper bytes but keep each 16 bit part in place without swapping. If you read 0xfeff0000 you know you need to swap the 16 bit parts but keep the bytes within each 16 bit part.

Once this decoding is done the conversion to ascii is trivial since the first 128 codes of unicode is identical to ascii.

Now, as you can see it is rather trivial. However, chances are that you didn't want to convert to ascii but rather wanted to convert to some 8 bit char set. The problem there is that I can't second guess which 8 bit char set you want and so I won't try to show code for that, it is easy enough though and if you want to I can show that too.

Alf
0
 
LVL 12

Expert Comment

by:Salte
ID: 8105288
I promised I would come back to how codes above 0x80 are encoded, I can say something about that also even though you don't need it in your program. A promise is a promise.

codes from 0x000080 through 0x0007ff are encoded using two bytes, the first byte is in the range 0xc0..0xdf (110xxxxx) and the other byte is in the range 0x80..0xbf (10yyyyyy) these two bytes hold 11 bits and these bits are the 11 bits of the codes from 0x80 through 0x7ff:

110x xxxx, 10yy yyyy -> 0 0000 0000 0xxx xxyy yyyy

Codes from 0x000800 through 0x00d7ff and from 0x00e000 through 0x00ffff are encoded using 3 bytes. The first byte is in the range 0xe0..0xef (1110xxxx) and the next two bytes are both in the range 0x80..0xbf (10yyyyyy and 10zzzzzz). These form the 16 bits of the unicode value:

1110 xxxx, 10yy yyyy, 10zz zzzz -> 0 0000 xxxx yyyy yyzz zzzz

Codes from 0x010000 through 0x10ffff are encoded using 4 bytes. The first byte is in the range 0xf0..0xf4 and the next three bytes are all in the range 0x80..0xbf.

1111 0xxx, 10yy yyyy, 10zz zzzz, 10uu uuuu -> x xxyy yyyy zzzz zzuu uuuu

Note that because unicode are maximum 0x10ffff the first byte cannot be 0xf5..0xf7 and if it is 0xf4 then the next byte must be in the range 0x81..0x8f.

Note also that UCS-4 which also can be encoded in UTF-8 lacks several of these restrictions and also allow for codes up to 0xf7 and also allow the first byte to be in the range:

0xf8..0xfb (1111 10xx) followed by 4 bytes in the range 0x80..0xbf and:

1111 10xx, 10yy yyyy, 10zz zzzz, 10uu uuuu, 10vv vvvv ->

0000 00xx yyyy yyzz zzzz uuuu uuvv vvvv

and codes starting with: 0xfc..0xfd (1111 110x) followed by 5 bytes in the range 0x80..0xbf and:

1111 110x, 10yy yyyy, 10zz zzzz, 10uu uuuu, 10vv vvvv, 10ww wwww -> 0xyy yyyy zzzz zzuu uuuu vvvv vvww wwww

Giving a maximum value of 0x7fffffff which is the maximum value for UCS-4.

Since unicode and UTF-32 has a maximum value of 0x10ffff they do not allow those 5 byte and 6 byte sequences though and the maximum byte sequence is 4 bytes to decode a unicode value.

Alf
0
 
LVL 1

Author Comment

by:ocjared
ID: 8105784
Thanks! Much appreciated. -J
0

Featured Post

New feature and membership benefit!

New feature! Upgrade and increase expert visibility of your issues with Priority Questions.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Article by: SunnyDark
This article's goal is to present you with an easy to use XML wrapper for C++ and also present some interesting techniques that you might use with MS C++. The reason I built this class is to ease the pain of using XML files with C++, since there is…
What is C++ STL?: STL stands for Standard Template Library and is a part of standard C++ libraries. It contains many useful data structures (containers) and algorithms, which can spare you a lot of the time. Today we will look at the STL Vector. …
The goal of the tutorial is to teach the user how to use functions in C++. The video will cover how to define functions, how to call functions and how to create functions prototypes. Microsoft Visual C++ 2010 Express will be used as a text editor an…
The viewer will learn how to use the return statement in functions in C++. The video will also teach the user how to pass data to a function and have the function return data back for further processing.
Suggested Courses

770 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question