Unicode browser display

Short explanations and usage tips about character entities like È and the relation to Unicode.

Index to this page:
Unicode characters with a high value
UTF-8 once more
Literal characters
Named entities

Go to the table with all entities that can be encoded safely with a name.
Go to a page with tables of Macintosh & Windows standard encodings and Symbol font encodings.
Go to a page describing the UTF-8 encoding
Go to/Back to the Unicode information index.

This page is a complete rewrite of the old page with entities tips. You can find the old page here.

It will be necessary that the HTML page at hand is set to UTF-8. This can be done by setting the character set to UTF-8 in a META instruction in the HEAD section of the HTML code:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
The server sending the page to the reader has to convey the same page designation and the end users browser may not enforce another character encoding.
Test: é
All is right if you see here 1 letter e with an acute accent. The UTF-8 transfer was done correctly and you have the right browser facilities.
If however you see the following 2 characters: é then something went wrong with the UTF-8 mechanism. We cannot give a solution for this situation.

An entity is part of an encoding method that can be used to display 'difficult' characters on for instance web pages. 'Normal' characters can be put literally on the page, but 'difficult' characters have to be encoded. An example is the lower case character e with an acute accent. You can always try to put this character literally in the HTML code and see what happens, but if it is encoded with an entity, then a correct representation on the page the end user gets is more likely. There are several possibilities, a decimal numerical entity (&#233;), a hexadecimal numerical entity (&#x00E9;) or a named entity (&eacute;). Note that the semicolon (;) is an essential part of the entity encoding.

Unicode characters with a high value
A 'high value' means in this case a hexadecimal character value larger than FFFF, so 10000 or larger.
These days most browsers will display Unicode characters with high values correctly, under the condition that suitable fonts are available. It is even possible to work with webfonts. While loading the page the necessary fonts are send too, so that the reader doesn't have to separately download and install the necessary fonts. For this story we will assume that a certain font is already installed, i.e. Akkadian from George Douros.
A nice Unicode character is found on position U+12345 and according to Unicode that is CUNEIFORM SIGN URU TIMES KI.
A picture of the enlarged character:
In HTML this can be coded with a numerical entity like &#x12345; 𒍅
Very likely you'll see nothing or a small rectangle. Your browser doesn't automatically select the right font, even if this is installed.
Let's give the correct font designation with a CSS style font-family:Akkadian:
<span style="font-family:Akkadian">&#x12345;</span>: 𒍅

If the font Akkadian is installed, you should see the correct cuneiform character.

It is possible to write something to an HTML page from JavaScript. It is at the moment not possible to write Unicode characters with a 'high value' in one go. You will have to encode Unicode characters with a value of 10000 hex and larger with a so called surrogate pair, in this case: \uD808\uDF45

The JavaScript code that wrote the above is:

var t = "\uD808\uDF45";
var st1 = "<span style='font-family:Akkadian'>"; var st2 = "</span>";
At the moment ECMAscript 6 is under development and a possible solution for high valued Unicode characters is an encoding scheme like \u{12345} but this isn't supported yet by browsers.
Surrogate pair codes can be calculated with some calculators to be found on-line. How to calculate surrogate pairs is explained in a Unicode FAQ, the UTF-8, UTF-16, UTF-32 & BOM FAQ.
By the way, what to do if there are spaces in a font name? In a CSS style instruction the font name should be surrounded by quotes, but in the JavaScript code above we'll have to nest 3 types of quotes, whereas we only have 2 types available, i.e. single ' and double ". Solution: encode the quotes with entities in the HTML code:
var t = "\uE209";
var st1 = "<span style='font-family:&quot;Segoe UI Symbol&quot;'>"; var st2 = "</span>";
If after the word "Result" no like symbol is shown, then you have no font Segoe UI Symbol installed. This font is part of the Microsoft Windows 8+ distribution. The Unicode position E209 is right in the Private Use Area so that it is very likely that only the named font will show the like-symbol. Other fonts will display nothing, a rectangle or a completely different symbol.

UTF-8 once more
Unicode U+12345 is encoded as UTF-8 the following series of hexadecimal numbers: F0 92 8D 85.
If you look at these as individual characters, you'll see the following: 𒍅 (these are 4 characters of which probably one is not visible.)
What can we do with this information? Probably nothing. But if we see those characters instead of one, we know for sure that something went wrong with the display of UTF-8.

Literal characters
Suppose we'll have a Unicode capable text editor like Microsoft Word and we somehow produce Unicode character U+12345 and we copy/paste this in an HTML text editor. Does the reader of the HTML page see the intended character?
Let's assume that a suitable font is installed and that all CSS style instructions are programmed correctly. All goes well if a couple of conditions are met. The HTML text editor has to be Unicode capable and set to UTF-8 encoding. For instance, in Notepad++, a Windows text editor for programmers, the Encoding has to be set to Encode in UTF-8 without BOM. The BOM is the byte order mark, put in front of the HTML file. Some of the older browsers don't know how to handle a BOM, so you could possibly skip this mark. In the text editor the Unicode character is internally handled as a UTF-8 code, like the 4 hexadecimal numbers shown in the previous section.
The pasted literal character U+12345 is shown here as: 𒍅 and including CSS style instructions:: 𒍅
Pasting literal characters can of course also be used to put a piece of text in another alphabet or script onto an HTML page, like Russian:
Российский язык
It is probably in this case not necessary to give a CSS style instruction. At least Microsoft Windows has a good support of Cyrillic characters, and the browser can automatically find one of the by default installed fonts with Cyrillic characters.
If you would like to encode the same Russian text with individual entities, you'll end up with for instance:
&#x0420;&#x043E;&#x0441;&#x0441;&#x0438;&#x0439;&#x0441;&#x043A;&#x0438;&#x0439; &#x044F;&#x0437;&#x044B;&#x043A;
For an encoding like this you could probably better use a conversion utility.

Named entities
For a small range of Unicode characters named entities are defined. A decennium ago there were only 254 named entities available, amongst others Latin characters with diacritics (e.g. &eacute; é), Greek characters (e.g. &omega; ω), some typographic characters (e.g. &para; ¶) and some mathematical symbols (e.g. &infin; ∞). At the moment (XML, HTML5) there are 2231 named entities defined for 1574 different Unicode positions. Only the most modern browsers can resolve all those entities to a displayable character.
By the way, never forget the semicolon; this is an essential part of the entity encoding.
You could encode the piece of Russian text above with named entities:
&Rcy;&ocy;&scy;&scy;&icy;&jcy;&scy;&kcy;&icy;&jcy; &yacy;&zcy;&ycy;&kcy;
Modern browsers should support this; see what happens:
Российский язык
The usability of named entities for the encoding of other languages is very restricted. Only for a handful of Latin characters with diacritics, Greek (without accents and modern Greek), some typographics and an assortment of Cyrillic characters there are named entities available. There are however lots of mathematical named entities defined.


© Oscar van Vlijmen, 2015
URL of this page: http://home.kpn.nl/vanadovv/uni/entitiesTips.html
Last modification date: 2015-05-05

Go to/Back to the Unicode information index.