Unicode has been developed to describe all possible characters of all languages plus a lot of symbols with one unique number for each character/symbol. Unicode as defined by the Unicode organization has become a universal standard: ISO/IEC 10646, describing the 'Universal Multiple-Octet Coded Character Set' (UCS).
It is not always possible to transfer a Unicode character to another computer reliably. For that reason a special encoding scheme has been developed, UTF-8, which stands for UCS Transformation Format 8.
On this page you will find an overview of the UTF-8 encoding scheme.
This page is encoded as Windows-1252. Your browser should support this character set.
Let us take for example the trademark sign, which looks something like a higher positioned TM.
On a Macintosh you can produce this sign by taking character position number 170 decimal. On a Windows computer with a CP1252 character set this is position 153 decimal. Unicode is the same for all users and in this scheme the trademark sign can be found at position 2122 hexadecimal, which is the same as 8842 decimal.
On a webpage you could try to encode this character like ™ but not each and every browser is able to reproduce many of those 'entities'. If your reader has a Netscape or Internet Explorer version 4 browser or better, the best thing you can do is encode the trademark sign with a numerical Unicode entity like ™. In a special META-tag your page has to be defined as a UTF-8 page. This is explained in detail on the page with entity tips.
If you write an email with for instance Microsoft Outlook Express and let the emailer encode your letter as UTF-8, then Outlook Express converts the trademark sign to a UTF-8 code. The result is in this case a combination of three characters with numerical values 226, 132 and 162. What characters you will see on your screen without UTF-8 decoding, depends on your platform and your active character set. A Windows viewer will see a letter a with a circumflex, followed by a double low quotation mark and a cent sign.
How did the encoding program get these numbers?
The proper way to convert between UCS-4 and UTF-8 is to use bitmask (and, or) and bitshift operations. But if you would like to convert only a couple of characters by hand or if your program development environment (scripting language) does not support bit operations, then integer division and multiplication can be used as follows.
From Unicode UCS-4 to UTF-8:
Start with the Unicode number expressed as a decimal number and call this ud.
If ud <128 (7F hex) then UTF-8 is 1 byte long, the value of ud.
If ud >=128 and <=2047 (7FF hex) then UTF-8 is 2 bytes long.
byte 1 = 192 + (ud div 64)
byte 2 = 128 + (ud mod 64)
If ud >=2048 and <=65535 (FFFF hex) then UTF-8 is 3 bytes long.
byte 1 = 224 + (ud div 4096)
byte 2 = 128 + ((ud div 64) mod 64)
byte 3 = 128 + (ud mod 64)
If ud >=65536 and <=2097151 (1FFFFF hex) then UTF-8 is 4 bytes long.
byte 1 = 240 + (ud div 262144)
byte 2 = 128 + ((ud div 4096) mod 64)
byte 3 = 128 + ((ud div 64) mod 64)
byte 4 = 128 + (ud mod 64)
If ud >=2097152 and <=67108863 (3FFFFFF hex) then UTF-8 is 5 bytes long.
byte 1 = 248 + (ud div 16777216)
byte 2 = 128 + ((ud div 262144) mod 64)
byte 3 = 128 + ((ud div 4096) mod 64)
byte 4 = 128 + ((ud div 64) mod 64)
byte 5 = 128 + (ud mod 64)
If ud >=67108864 and <=2147483647 (7FFFFFFF hex) then UTF-8 is 6 bytes long.
byte 1 = 252 + (ud div 1073741824)
byte 2 = 128 + ((ud div 16777216) mod 64)
byte 3 = 128 + ((ud div 262144) mod 64)
byte 4 = 128 + ((ud div 4096) mod 64)
byte 5 = 128 + ((ud div 64) mod 64)
byte 6 = 128 + (ud mod 64)
The operation div means integer division and mod means the rest after integer division.
For positive numbers a div b = int(a/b) and a mod b = (a/b-int(a/b))*b.
UTF-8 sequences of 5 bytes and longer are at the moment not supported by the regular browsers.
The highest character position defined in Unicode 3.2 is number 10FFFF hex (1114111 dec) in a 'private use' area. The highest character with an actual glyph is number E007F hex (917631 dec), the CANCEL TAG character. In Unicode 6.1 there are still no characters defined above 200000 hex.
Please note that at the moment UTF-8 is only defined for number series from 1 to 4 bytes long. What will happen when the Unicode region above 200000 hex is filled, is not known. It is possible that UTF-8 will be extended to 6 byte series, but this is far from certain. That means that the algorithm given above should throw an error if ud >=2097152.
From UTF-8 to Unicode UCS-4:
Let's take a UTF-8 byte sequence. The first byte in a new sequence will tell us how long the sequence is. Let's call the subsequent decimal bytes z y x w v u.
If z is between and including 0 - 127, then there is 1 byte z. The decimal Unicode value ud = the value of z.
If z is between and including 192 - 223, then there are 2 bytes z y; ud = (z-192)*64 + (y-128)
If z is between and including 224 - 239, then there are 3 bytes z y x; ud = (z-224)*4096 + (y-128)*64 + (x-128)
If z is between and including 240 - 247, then there are 4 bytes z y x w; ud = (z-240)*262144 + (y-128)*4096 + (x-128)*64 + (w-128)
If z is between and including 248 - 251, then there are 5 bytes z y x w v; ud = (z-248)*16777216 + (y-128)*262144 + (x-128)*4096 + (w-128)*64 + (v-128)
If z is 252 or 253, then there are 6 bytes z y x w v u; ud = (z-252)*1073741824 + (y-128)*16777216 + (x-128)*262144 + (w-128)*4096 + (v-128)*64 + (u-128)
If z = 254 or 255 then there is something wrong!
Please note that at the moment UTF-8 is only defined for number series from 1 to 4 bytes long. What will happen when the Unicode region above 200000 hex is filled, is not known. It is possible that UTF-8 will be extended to 6 byte series, but this is far from certain. That means that the algorithm given above should throw an error if z >=248.
Example: take the decimal Unicode designation 8482 (decimal), which is for the trademark sign. This number is larger than 2048, so we get three numbers.
The first number is 224 + (8482 div 4096) = 224 + 2 = 226.
The second number is 128 + (8482 div 64) mod 64) = 128 + (132 mod 64) = 128 + 4 = 132.
The third number is 128 + (8482 mod 64) = 128 + 34 = 162.
Now the other way round. We see the numbers 226, 132 and 162. What is the decimal Unicode value?
In this case: (226-224)*4096+(132-128)*64+(162-128) = 8482.
And the conversion between hexadecimal and decimal? Come on, this is not a math tutorial! In case you don't know, use a calculator.
More information about the UTF-8 encoding can be found in:
Request for Comments No. 3629, UTF-8, a transformation format of ISO 10646.
The page you are reading now is encoded in the standard Windows Roman encoding, 'code page 1252'. The unicode definition can be found at:
Windows code page 1252, Unicode encodings
Another encoding scheme is for instance the Apple Roman Unicode encoding.
This encoding scheme can also be found at the unicode organization:
Apple Roman Unicode encoding.
That document describes the latest Apple character set, as used by the Apple Mac OS Text Encoding Converter software version 1.5 and above.
A remark about that encoding: code position 0xDB is now used for the EURO SIGN, but a couple of years ago this position was used for the CURRENCY SIGN, as originally defined.
© Oscar van Vlijmen, June 2000
URL of this page: http://home.kpn.nl/vanadovv/uni/utf8conversion.html
Last update: 2013-12-30