Have you tried to learn about Unicode, UTF-8, and multibyte text encoding and all the articles are just too "academic" or too technical? This article aims to make the whole topic easy for just about anyone to understand.
It's time to tackle multibyte characters, because I'm still coming across articles on the topic that are confusing. Also, even with a good understanding of UTF-8, I still think the PHP functions for utf8_encode() and utf8_decode() are confusing in their explanations. So let's start with the basics and work down to those functions. If you know some of the basics, well, you should be able to read through them faster.
Encoding 101: You Already Do It All Day Long
Everything around you is made up of different kinds of units. For example, a day is made up of 24 units called hours. At different times in the day, you might change what you're doing. For example, you might be sleeping from 12 AM to 7 AM. From 7 AM to 8 AM, you're getting up and eating breakfast. From 8 am to 12 PM, you're working at your job. From 12 PM to 1 PM, you're eating lunch. From 1 PM to 5 PM, you're working some more. From 5 PM to 6 PM, you might eat dinner, and then from 6 PM to 10 PM, you might relax and read a book. From 10 PM to 12 AM, you sleep, and then the cycle repeats.
At the same time, a clock is simply working on counting the number of hours in the day. It doesn't really care what you do during each hour - it only cares about counting 24 hours. So in a single day, you have 24 hours but they mean something different to you than they do to a clock. If someone asked you and the clock to describe your days, the clock (if it could talk) might say, "Hour 1, Hour 2, Hour 3, etc..." but you would probably describe your day the way I did above - in chunks of hours.
File data is also -always- made up of units called bytes. You're probably saying, "Yeah, I know THAT!" However, the idea of encoding seems to occasionally screw with people's thinking. It doesn't matter what kind of encoding you're talking about, though. If you're dealing with everyday data and files, you're dealing with bytes.
In the computer world, the process of encoding is simply turning an idea or a concept into one or more bytes of data. Decoding is the process of reading those bytes and turning them back into the idea or concept.
The most basic and common encoding is text encoding. In fact, you're reading decoded bytes right now. Technically speaking, this entire web page is just a continuous series of thousands of regular ol' bytes that each range in value from 0 to 255 (and each byte is made up of 8 bits). The only reason you're able to read this instead of a bunch of raw data is because your web browser has been built to treat the raw data as if they are everyday letters, numbers, and symbols. So your web browser is actively decoding a bunch of raw bytes back into the idea of text, and then showing you the result.
If you tried to tell your web browser to open up some uncommon file like a WordPerfect document that you created back in 1990, it wouldn't be quite sure what to do with it. You might end up seeing a white web page with gibberish data all over it. That's because a WordPerfect document has its own special encoding, which a web browser doesn't automatically know about. So trying to treat that file's bytes as if they are regular text just won't look right.
Encoding 102: History Explains The "Why"
Can you imagine trying to use a computer without a monitor and a keyboard? Back when computers were just big calculators for corporations, they didn't need to type in letters of the alphabet, nor get them back out. When computers became more accessible to the general public, basic communication between the computer and the user was a pretty important thing. Since computers could store bytes that had values between 0 and 255, someone figured that they could assign meanings to each one of those values. The first 32 values were each used for some basic system controls, like making the internal speaker beep. These weren't really considered "printable" characters - they weren't meant to actually be displayed anywhere.
The concept of printable characters was to display a picture on the screen that looked like something that you and I would easily recognize. Computers had no idea what the letter "A" was - they didn't think like that. A printable character was simply one of those values that could be turned into a specific arrangement of pixels on the screen that LOOKED like the intended character. So when code ran that displayed stuff on the screen, it would see the byte value that meant "A" and then draw a bunch of dots on the screen that looked like the letter "A." Meanwhile, the computer couldn't care less about what those dots looked like.
After those first 32 control characters, the next 16 values started getting into printable characters - things that might show up on a screen. Each of those 16 values represented different symbols, like the % sign, or a simple space, or parentheses. Then the next 10 values would display the numbers 0 through 9. After that were 6 more symbols, like the = equals sign. Then they added in the 26 upper-case letters of the alphabet, then some more symbols, then the lower-case letters and a couple more symbols. At this point, they had assigned various characters / meanings to about 128 of the 256 total possible values of any single byte. They figured this was all they would need to be able to write American English documents. Here's what it looked like:
Anyway, some group said, "We're going to the group of the first 128 values and call it the 'ASCII' text encoding - hurray!" (Yes, it stands for something that you will likely never need to remember.) Everyone rejoiced and was happy with using ASCII for many years.
Then those guys over at the ISO (the organization responsible for creating all those international standards) said, "Hey, we still have 128 characters left - how about we use them to store some common accented letters and symbols to make computers more accessible to people like Juan Pablo, who may want to write a letter about his niñ
os [Spanish for 'children'] or for people who want to type up their ré
s?" And so those guys came up with their own version of text encoding called ISO-8859-1 (actually it's ISO-8859 and the -1 is just the main Latin version of it). This thing was basically the original 128 characters of ASCII plus another 128 ones to make it more international:
NOTE: There are other similar variations on all of these things, but we'll stick with ISO-8859-1 for now.
One day, someone realized that there was no way to display Chinese characters.... or Japanese characters... or Hebrew characters... or certain Russian characters..., or Elvish or Klingon letters (God forbid), etc, etc, etc... Every byte was limited to 255 values and each of them was already taken up already, and there were simply thousands of other types of characters that needed to be displayed for different reasons. The world was too big for ASCII!
Then someone said, "HEY! What if we..." and they figured out a good way to fix it, the end. Just kidding, keep reading.
Encoding 103: This Probably Isn't Historically Accurate
"...use more than just one byte?" the person finished asking out loud (we'll call him Bob).
"We'll call it... 'multibyte' text encoding! Hmmm... but how many bytes will we need...? Hey, Mary, how many languages do you think there are there in the world?"
"Oh, probably about a thousand."
"Good, good... so if each of those languages has a 100 characters or less on average, we're probably dealing with about 100,000 new characters. Two bytes together can hold up to 64k different values, but FOUR bytes together could hold up to 4 BILLION different values! That means we could handle up to 4 BILLION different characters! Maniacal laugh!
"So now every time I write ABC, it's going to take up 12 bytes instead of 3? Way to go, genius."
"I guess we'll need to think about that some more, but it's a great idea! Right? .... Right...?"
"Okay, well, -I- think it's a great idea and I'm going to call this giant table of characters... 'Unicode'... yeah... That sounds cool."
"You just said you were going to call it multibyte text encoding."
"I did... yes, I did. That's because multibyte will refer to the general IDEA of using multiple bytes to store characters, while Unicode will be the actual map of characters."
Encoding 104: More Efficient Multibyte Encoding
Obviously, Mary had a point about the number of bytes. Imagine making the world switch over to some text encoding that required a total of 4 bytes for every single character written. A 1 megabyte text document would instantly become 4 megabytes! Sure, we'd solve the language problem, but it would be terrible inefficient, especially since the vast majority of languages could fit in 2 bytes, and pretty much all the English characters already fit into 1 byte.
So someone figured out a way to get the best of both worlds. Some characters would be stored in one byte, while others would be stored in 2 bytes, others in 3 bytes, and some in 4 bytes. So a text document that used nothing but American English characters would end up only using 1 byte for each character. Another document that had a mix of American English and Chinese characters might use 1 byte for the American English characters and 4 bytes for the Chinese characters. This idea was called UTF-8 and it was a great and joyous thing.
The bigger question was HOW could this work?
I think it's time for another analogy!
Imagine you're watching some kid walk along a sidewalk that is made up of 100 tiles, and he is hopping from one tile to the next, one-by-one. It -should- take him 100 hops to go from the beginning to the end of the sidewalk. Halfway through, he lands on a tile that has a little bit of water in it. He realizes that the next two tiles are completely flooded and he isn't supposed to jump into puddles of water! So he makes a big jump OVER the two flooded tiles, and keeps hopping along. At the end of the sidewalk, he's made a total of 98 hops, even though there were 100 tiles.
This is yet another example of an encoding in real life and it mimics the text encoding we're about to talk about. Normally, every hop is simply one tile, the way that ASCII encoding deals with every byte as one character. When the child started at the beginning of the sidewalk, he already knew that he wasn't supposed to jump into puddles of water. So as long as each tile was dry, he just kept hopping along, one tile per hop.
Once he reached the first wet tile, that little bit of water acted like a signal to him. That signal told him to look ahead for the puddles of water. As a result, he just had to treat those flooded tiles differently - he jumped over them and continued to hop along the sidewalk. Now, if he had not been told how to deal with puddles of water, he would continue hopping tile-by-tile and would splash into each flooded tile and create a mess.
Single-byte encoding, like ASCII or ISO-8859-1, simply treats every byte as one character. If that character is not a normal printable character, then some gibberish character would show up. For example, some of the first 32 control characters will show up as a music symbol or a line corner, or the male or female symbol.
UTF-8 encoding is the most common multi-byte text encoding out there, and it acts like the kid who is jumping over the puddles. When an application tries to read a series of bytes as a UTF-8 string, it starts by reading bytes one-by-one, just like ASCII. In many cases where there are no multibyte characters, it may act EXACTLY like ASCII. However, it is also paying attention to the byte value each time.
If the byte value is within a certain range (128 or higher), the application gets a signal to take a closer look at the byte. If that byte's value is between 192 and 223, then that tells the application that it's dealing with a 2-byte character, so there should be one more extra byte. See the example above, where the first byte with a value of 195 tells the application that there's one more byte (169). If that first byte value is between 224 and 239, then it's a 3-byte character, so there should be TWO more bytes to read. If the byte value is 240 or higher, then it's a 4-byte character, so there will be THREE more bytes to read. There are also some rules on the values of the extra characters, but the first byte is the one that tells the application how many bytes to read in order to assemble that one character. Once it reads in all the bytes for a character, it looks at their total value and then finds the correct Unicode character to display for that value.
As soon as it finishes with that character, it goes back to the one-by-one byte reading, and keeps repeating that whole process until everything is read.
If someone used an application to try and read a UTF-8 encoded file as if it were ISO-8859-1 encoded, then they'd likely see some bizarre characters show up, because the application would be treating those special UTF-8 control characters as if they were simply the choice some weird author decided to take when he wrote the file.
Encoding 105: So I've Heard About UTF-16 and UTF-32
I've heard more than one developer incorrectly say that UTF-8 can only display a certain portion of the Unicode table and you need UTF-16 or UTF-32 to display larger characters like Chinese. That's totally wrong and if you hear someone say that after you read this, then you should go correct them to prevent the spread of bad information.
UTF-8, UTF-16, and UTF-32 can all display every Unicode character. So what are the differences?
The differences between them are mostly a matter of performance and programming efficiency. While UTF-8 is fantastic for that vast majority of applications out there, it requires a little more "thinking" to process. Every byte has the potential for four different outcomes (1-byte character up to 4-byte character), and each outcome has slightly different programming (a three byte character requires code that reads another two bytes, while a two-byte character requires code that reads only one more byte). Since it has to run this logic on every byte, it can be a little slow. You probably won't be able to tell that it's slow, but if you're trying to read a 10-megabyte file that is marked as UTF-8, then that's roughly 10 million bytes to check! Luckily, we have pretty fast processors nowadays that can do it all in the blink of an eye, which makes UTF-8 pretty viable for most applications.
UTF-16 also does the logic checking, but it uses a MINIMUM of 2 bytes per character. So even for the English letter "A", you'll end up with 2 bytes of data if it's encoded with UTF-16. The specifics get a little complex (they're public information), but the gist is that the way that UTF-16 is stored means that it spends less data on the "signal" and "logic" information, allowing more of those precious bits to be used for identifying the right character. If you're dealing with a language that primarily requires two or more bytes anyway for each character, this can lead to a nice performance increase. You can tell the programming language to read the bytes two at a time instead of just one. So for a 10-megabyte file encoded with UTF-16, the entire file can be read in roughly 5 million loops (2 bytes at a time), which means HALF of the number of times that the system has to check for "signals."
UTF-32 does away with the logic checks altogether and is basically Bob's original idea from Encoding 103 above. Every character takes up 4 bytes, so there is no need for any signal-checking, and an application can literally speed through a file, reading 4 bytes at a time and immediately being able to look up the right character and move onto the next 4 bytes. So UTF-32 is the fastest encoding to read, but it's also the biggest waste of storage space. Given the huge hard drives we have nowadays, UTF-32 can be a good encoding format for situations where you care more about the speed of reading data versus writing it.
These are not the only differences, but they're typically the biggest ones for deciding which encoding to use (if you're writing your own decoder/encoder from scratch, then you'd want to pay particular attention to the endianness of UTF-16 and UTF-32). There are plenty of other resources that will get into the more technical specifics of the differences, but UTF-8 is often the best choice for most developers (especially since it can read ASCII-encoded files without any extra effort).
There are lots of other text encoding schemes out there, but we'll stop here for now.
Encoding 106: Programming Gotchas
At the very beginning of the article, I mentioned some confusion around PHP's functions for utf8_encode() and utf8_decode(). The motivation for writing this article was because some PHP developers don't understand what these functions are actually doing, leading to incorrect usage and sometimes, data corruption or loss! There are also other similar languages and routines in other languages.
Once Unicode was invented, it occasionally stored some of those extra ISO-8859-1 characters in different spots. For example, the symbol for Japanese yen is actually a single byte in the ISO-8859-1 encoding and uses the byte value of 165. This means that if a file is saved with that ISO-8859-1 encoding, then it can store that symbol in just one byte. Great, right?
However, that same character is stored in a different
place within the Unicode table. So if you put the yen symbol into a file and save the file with UTF-8 text encoding, it would actually use two bytes when saving the yen symbol - the first byte would have a value of 194 and the second byte would be 165.
Of course, you can only choose one encoding for text, so if you had a sentence that was encoded with ISO-8859-1 and used the yen symbol, then you would probably run into problems if you tried to give that text to some system or program that was expecting to receive UTF-8-encoded text. It wouldn't be entirely sure what to do with that yen symbol. Many, but not all, systems will use some extra programming logic to automatically figure out how to respond and will convert the encoding on-the-fly. If a system doesn't do this, it might just throw up its hands, claim it's the end of the world, and spit back an error.
So programming languages will usually have a way to convert between encodings. PHP has several different methods, and it even has some special extensions for doing nothing but handling multibyte encodings and converting between them all. One of the most common requests is this conversion between ISO-8859-1 and UTF-8, so this is where the utf8_encode() and utf8_decode() functions come in (or if you use another language, it probably has its own set of functions that work the same way).
Most of these functions aren't smart enough to be able to tell what kind of encoding is already in place, so it's up to you to either know or to figure out (again, using tools to detect the encoding if you're not sure). Knowing what the encoding currently is - that's the key to using your programming language's text encoding functions correctly.
In PHP, the utf8_encode() function will take a string of text and assume it is encoded as ISO-8859-1. It will then look for any characters that are in a different location in the Unicode table. If it finds them, it swaps out that old single-byte character for the UTF-8 bytes that correspond to the same character in the Unicode table.
So if you passed it the single-byte version of the yen symbol (from ISO-8859-1), like this:
$utf8_yen_symbol = utf8_encode(chr(165));
...then $utf8_yen_symbol would be a string containing two bytes (194 and 165).
You would use utf8_decode() if you were heading in the OPPOSITE direction. Let's say you are exporting data from a system that uses UTF-8 encoding and you are trying to store the data into a system that only supports ISO-8859-1 text. Running utf8_decode() on text would see the two-byte yen character in the original data, and it would try to find the equivalent character for ISO-8859-1, and would then swap out the two-byte version for the one-byte version.
Because UTF-8 is capable of 4 billion characters while ISO-8859-1 is only capable of 256 (sort of), you can't always go in that direction without the chance of losing data. If utf8_decode() can't find a corresponding character in ISO-8859-1, it'll simply replace the multi-byte character with a simple question mark.
If that didn't make you do a double-take, then nothing will. Yes, utf8_decode() will DESTROY any data that it cannot convert from UTF-8 to ISO-8859-1. It will preserve only the characters that it CAN convert successfully, so you might end up with a paragraph of text that has more question marks in it than you remember. And once it's been converted to a question mark, you can't use utf8_encode() to convert it back to the original value, because the question mark is actually a valid character and is the same in both UTF-8 and ISO-8859-1.
Also, if you already have text with multibyte characters that is encoded as UTF-8, and you run utf8_encode() on the string, it will likely end up adding extra garbage characters into your text, because it starts with the assumption that the text is currently in a single-byte text encoding. So it will ASSUME that all of the bytes in those multibyte characters are actually individual characters from ISO-8859-1 that still need to be converted to UTF-8. D'oh!
So knowing the original text encoding is essential to knowing whether you need to use a UTF-8 encoding or decoding function, or whether you don't need to do anything at all!
While Unicode is not specific to any particular programming language, I covered PHP specifically because of the number of people who get into PHP as their initial programming language and then create data import or export scripts that work in one situation but then accidentally end up corrupting data that is in a different initial encoding. That said, I've put together a small PHP script that can be dropped into a simple web server and executed as-is, and should help show how the characters are stored and how they are parsed out. You can download it below:
UTF-8 Example Code
If you want to read more about UTF and get into some of the heavier technical stuff, like byte-order marks (BOM) and so on, the Unicode Consortium (I'm sure they probably employ at least
one Bob and one Mary) has an FAQ here:
Thanks for reading!
Copyright © 2016 - Jonathan Hilgeman. All Rights Reserved.