<

Go Premium for a chance to win a PS4. Enter to Win

x

When is Unicode not Unicode? When Microsoft gets involved!

Published on
9,379 Points
2,179 Views
7 Endorsements
Last Modified:
evilrix
An expert in cross-platform ANSI C/C++ development, specialising in meta-template programming and low latency scalable architecture design.
Windows programmers of the C/C++ variety, how many of you realise that since Window 9x Microsoft has been lying to you about what constitutes Unicode? They will have you believe that Unicode requires you to use a WCHAR (wide) character type and that Unicode cannot be represented by a CHAR (narrow) character type. In fact, both of these statements are completely and utterly false. Microsoft has misled you in the most egregious way.

Before we go any further, I need to clarify some terminology that is often confused. This is especially true of Windows programmers who, quite often, mistakeningly believe that using a wide character type means they are using Unicode:

Character Set: This is a complete set of characters recognized by the computer hardware and software.

Character Encoding: This is a way of encoding a character set, generally to fit within the boundaries of a particular data type. ASCII, ANSI and UTFx are all examples of character encodings.

Character Type: This is a fundamental data type used to represent a character.

These three things are intrinsically related. The character type chosen to represent a character set will have a direct impact on the character encoding used. In C++, the normal fundamental character types are either wchar_t (wide) or char (narrow). The size of the narrow and wide types are platform dependent, although C++11 has introduced fixed sized character types. For the purposes of this discussion, it being Windows centric, we will assume wide is 16 bit and narrow is 8 bit.

Unicode code points are 32 bit. That's it, the end. If you want to work with raw Unicode code points you have no choice than to use a 32 bit character type. That said, some very clever people who work for The Unicode Consortium realised that the majority of the western world uses the Latin alphabet and most of this can be represented using just 8 bits. The majority of the rest of the world uses characters that can be represented by 16 bits. The remainder of the world requires 32 bits. On that basis, forcing the world to adopt 32 bit character types would be, for the most of us, completely insane. Of course, the same could be said for 16 bits... eh, Microsoft?

Those clever people went on to invent a number of Unicode Transformation Formats (character encodings) that would allow Unicode to be encoded using smaller character types than 32. The most common of these are UTF16 and UTF8, although other less common encodings do exist. These are encoding formats that allow Unicode to be represented as multi-byte representations of their 32 code points using either 16 or 8 bit character types. Of the two, UTF8 is by far the most efficient for the majority of cases and has the advantage of being directly backwards compatible with systems designed to only use ASCII (meaning all old programs will just work).

Unfortunately, Microsoft decided to jump on the Unicode bandwagon without really thinking things through and, in their infinite wisdom, decided to adopt UTF16 as the standard encoding format for Unicode on the Windows platform. Frankly, this couldn't have been a worse decision, and it is one that has plagued Windows programmers the world over ever since. The rest of the sane world realised that was UTF16 was just stupid and decide to use UTF8. Amazingly enough, the rest of the world has no significant problems writing programs what will work with Unicode in a portable fashion. Windows on the other hand... um... no!

The reason UTF16 makes no sense is because not only is it very wasteful for the majority of us who just use plain old ASCII most of the time, it's also a real pain to use, especially if you want to be able to generate data that is portable and can be used cross-platform. You see, each character in a UTF16 encoding is larger than a byte and so the storage and retrieval of text encoded in this format requires that the reader be able to identify and (if necessary) convert the encoding format to the correct endianness for the platform.

Also, most legacy programs are written using narrow character types and so to "port" these over to use Unicode means making major changes to the code base to use wide character types. Now, this might sound like a simple "search and replace" but it's really not. In a language such a C or C++, where the programmer, and not the compiler, is completely responsible for preserving the integrity of the memory, introducing a larger data type without reviewing each and every change to make sure it doesn't bust data boundaries is coding suicide. Basically, this decision to use UTF16 meant that all existing code has to be broken and then fixed to be international friendly. That is a huge cost to business, and so most just didn't (and don't) bother!

Further, regardless of what Microsoft would have you believe, UTF16 is still a multi-byte format because you can't represent the full 32 bit code-point set using 16 bit types. Sure, the majority of usable code points will fit into 16 bits, but it is a lie to say that Unicode can be represented by single wide character types. It's just impossible. 32 bit into 16 bits does not fit! A quart does not fit into a pint pot! The use of "Unicode" in Windows is a broken promise that just makes life oh so unnecessarily hard for the software engineer.

By contrast, other platforms (Linux, for example) use UTF8 natively. This means that all data can be stored and retrieved using narrow types. Because UTF8 is a byte level encoding format it has no sense of endianness and so is easy to port between different platforms. It is also way more efficient than UTF16 because, in the general case of only using the standard ASCII character set, each character requires only 1 byte (rather than 2) to be represented. Even in the extended case of non-standard ASCII it's normally still way more efficient because only some non-ASCII chars require more than one byte of encoding. UTF8 is a highly efficient encoding format, UTF16 is just not!

When Microsoft talks of Unicode please don't be confused. They are NOT referring to Unicode, they are referring to the UTF16 encoding format. They do this because they want the world to believe that only 16 bits and UTF16 can be used to represent Unicode. They do this because they don't want you to know how stupid they were to decide to use this pointless encoding format. Yes, this is a way to represent Unicode, but it is not the only way and from a software engineering point of view it is quite probably the stupidest way.

Further, whereas the rest of the sensible software engineering world uses UTF8 as a narrow character encoding format to represent Unicode, Microsoft insists on sticking with ANSI Code Pages. Unlike UTF8, these cannot represent the full range of the Unicode character set and, worse, unless you know the original code-page you have absolutely no idea what the encoding format actually represents. You may as well be working with a file of random binary, because that's about as useful as an ANSI format file with no code-page information would be. It wouldn't be so bad if Microsoft offered UTF8 as a native encoding format, but to date, this isn't the case. It's UTF16, ANSI or nothing!

So, Windows programmers, when you start talking about your project being "Unicode" please remember that to the rest of the sane world this phrase is meaningless. All you are saying is that your project uses wide rather than narrow data types for representing characters and you just so happen to have been fooled into using UTF16 when you could quite as easily have used UTF8. That's right, you don't have to use UTF16, even in Windows, to be Unicode friendly, you can use UTF8. There, I said it. The secret is out! I always code all my projects using narrow character types and, internally, I work with UTF8. I only convert (on Windows) to UTF16 when I absolutely have to (at the system API boundary).

But why do this? Doesn't that make life hard? Good question. Yes and no. Yes, because it means at some point I still have to convert to UTF16. No, because C++11 now provides nice, efficient tools to do this conversation process so it is pretty painless. What it does mean is that my code will work on any platform. By using a platform agnostic character set, which UTF8 is, it means my code will run just as well on Windows as it will on Linux or OS-X.

For more reading on why we should all be using UTF8, why forcing us to use UTF16 is just silly and why Microsoft owes us all a very large apology for the mess they have made of "Unicode" on Windows, I highly recommend taking a look a the excellent UTF8 Everywhere website.
7
Comment
Author:evilrix
8 Comments
 
LVL 8

Expert Comment

by:LajuanTaylor
You have an extra "http" in the "UTF8 Everywhere" link.

This was a very informative article.
0
 
 

Administrative Comment

by:Eric AKA Netminder
Lajuan,

Don't blame evilrix; when I edited the article, I missed that, but I've now fixed it.

ericpete
Page Editor
0
 
LVL 8

Expert Comment

by:LajuanTaylor
@ericpete - No blame intended...
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
LVL 40

Author Comment

by:evilrix
LajuanTaylor, thank you for your kind comment.
0
 

Expert Comment

by:Michael Greenspan
Appreciate the article.  Going through Windows and Visual Studio upgrades and then trying to maintain older software has been a pain, and the UTF-16 issue was certainly a major one.

Appreciate the advice to continue with UTF-8 as you said...where I have done this, there were a lot fewer problems of updating.  But, well, as you say, it is sort of a secret indeed...one has to figure out how to turn off warnings and errors, and find exactly where the UTF-16 is required for the windows API.  Perhaps you could add a bit of advice at that point...but I guess there are lots of development tools, and each has its own settings and tricks to do this, so maybe it is not practical.
0
 
LVL 40

Author Comment

by:evilrix
>> Perhaps you could add a bit of advice at that point
Read the details on the website I linked: http://utf8everywhere.org/

There's a pretty comprehensive guide to avoiding using UTF16 when developing on Windows.

Thanks for your comment.
0
 
LVL 32

Expert Comment

by:DrDamnit
"When Microsoft gets involved."

You had me at the title.

Excellent work, sir.
0
 
LVL 29

Expert Comment

by:pepr
+1 ... not because it helped myself [learned via more painful way], but because the articles like that should be spread to enhance the future. It the past, it was a lot of discussion about UTF-8 being impractical "because you cannot seek to the position". Actually, languages like Python 3 show, that beginners need not to care about how it is implemented inside. A unicode character is represented by a number (as one logical unit). If the programming language gives you tool for accessing the parts of the string easily you do not want to care about how many details must be solved. You simply enjoy when it works (and you feel safe when you know it is not an ad-hoc solution with some dark corners).
0

Featured Post

Efficient way to get backups off site to Azure

This user guide provides instructions on how to deploy and configure both a StoneFly Scale Out NAS Enterprise Cloud Drive virtual machine and Veeam Cloud Connect in the Microsoft Azure Cloud.

Join & Write a Comment

The goal of this video is to provide viewers with basic examples to understand opening and reading files in the C programming language.
The viewer will learn how to user default arguments when defining functions. This method of defining functions will be contrasted with the non-default-argument of defining functions.

Keep in touch with Experts Exchange

Tech news and trends delivered to your inbox every month