## THEORY OF FLOATING-POINT NUMBERS IN COMPUTERS

### by:   J.H.M. Bonten

First date of publication: 05 october 2006
Last date of modification: 05 october 2006
A tiny correction: 09 january 2007
A tiny correction: 18 july 2007
Renaming 754r into 754-2008: 23 september 2008

### Contents

Back to index of numeric formats

### General

This document shows some numerical and mathematical properties of the numbers in the computer stored in the way called as the floating-point format.

In many computers the non-integer numeric values are stored in the so-called floating-point notation. This notation consists of the +/- sign, the numeric base, the exponent value and the coefficient. Mantissa is the old word for coefficient.

The numeric value is:
sign * coefficient * numeric_base ^ exponent_value

sign
The sign has always one out of two values. It is either '+' or '-' (or if you like: '+1' or '-1'). Therefore in a binary computer it can be represented by a single bit, the so-called sign-bit. In nearly all computers 0 means '+' and 1 means '-'.

In the notation of the numeric value the sign is always visible. It is not mixed up with the other parts of the number, like is done often in the binary notation of integer numbers (viz. 2-complement notation). Therefore the notation is called 'sign-and-magnitude' notation. It also enables the existence of a signed zero, viz. two zeroes: +0 and -0.

In the further discussion the sign is not important, so from now it will be omitted. Only the absolute value of the numeric value will be dealt with.

exponent base
The numeric base is always a integer value that is positive and at least two. In most computers it is not notated since it is assumed to be the same for all numbers stored in a single computer. Another word for numeric base is 'exponent base'.

digit
A digit is the smallest coherent unit of information that is used in the representation (e.g. storage or display) of a numeric value. It is a symbol that represents a non-negative integer value. Always this value is in the range from zero to numeric_base minus one. Thus it is a kind of unsigned mini-integer. A few series of such digits are used in the representation of the numeric value. All digits in these series operate within the same range. In the coefficient one or two such digit-series are used.

coefficient
Generally the coefficent is represented by one or two series of digits. It has a fairly complicated structure, so it is described in a separate chapter.

exponent value
The exponent value is an integer value that can be positive, zero or negative. Since it is an integer it is stored in a binary way in many computers. The binary way is the most easy way of calculation for computers. The number of bits for this storage is always finite, and in many computers it is fixed.

### Limitations

Since both the length of the storage for the exponent value and that for the coefficient are always finite no computer is ever able to store all numbers that exist along the mathematical axis of real numbers. They only can store a limited and predefined set of them. So computers always have the problems of underflow (= value too near to zero), overflow (= value too far away from zero) and accuracy (= only a nearby value can be stored). We need to manage these problems intelligently, otherwise the calculations may give results that are far away from what they should be and so are unreliable. One method to manage these problems is the application of the special values Infinite and NaN (= Not a Number).

##### WARNING

Although the normalized values are what you mostly see when your program is working with real data, proper handling of the rest of the values (denorms, error-values, infinities) is vitally important; otherwise you'll get all sorts of horrible results that are difficult to understand and usually impossible to fix.

Back to contents

### General structure

The coefficient is notated as a series of digits in fixed-point notation. The fixed-point notation means that the coefficient is split up into two parts, viz. the integral part and the fractional part. Both parts consist of an contiguous series of consecutive digits. A separator is between the two parts.

The value the integral part stands for is one out of the set of non-negative integral values 0, 1, 2, 3, ... .  The fractional part stands for a value out of the range from 0 to nearly 1, so out of [0,1), or 0 =< frac.p. < 1.  The values of both parts have to be added to get the coefficient value:
coefficient = ( integral_part + fractional_part )
In this formula the addition operator can be seen also as the separator synbol.

The two parts never change in length (= number of digits), so the coefficient has a fixed length and the seperator never moves in it, hence the name fixed-point. All numbers in a single computer have the same length for the integral part and have the same length for the fractional part. So in a single computer all coefficients have the same total number of digits and have the separator on the same place inside this series.

Since the separator does never move place it does not need to be notated physically. It only needs to be assumed standing on the right predefined place. Thus it becomes virtual.

In some computers one of the two coefficient parts has no length and so is absent. Its value is said to be zero. In most computers both parts are present. Then often the integral part has only one digit.

### Normalization and zero

Both the coefficient and the whole number are said to be normalized when the coefficient's most significant digit is not zero, i.e. 1 or more. This digit stands in the integral part or when this part is absent in the fractional part. This is dealt with later.

The whole number is zero if and only if the coefficient is zero. This is when both parts are zero. The exponent and the sign may have any value. Thus the zero is signed: +0 and -0. Mathematically both zeroes are equal.

Back to contents

### NAMING CONVENTIONS AND BASIC FORMULAS

In America the separator in the coefficient is often called a 'point' or 'fractional point' and in Europe a 'comma' or 'fractional comma'.

The old word for coefficient is the word mantissa. At present mantissa means the (e.g. physical) space wherein the coefficient is stored.

Some documents on this internet site use the old word mantissa, whilst others use the new word coefficient. However, inside each document the selected word is used consequently.

For the following theories abbreviations are used:

```+/- sign = S
Numeric base = Exponent base = EB
Exponent value = Exp
Value of a single digit = D
Coefficient = C
Integral part = I
Fractional part = F
Value of the total number = V
```

So it has been shown already that

```   2 =< EB
0 =< D < EB
C = I + F
V = S * C * EB^Exp
```

In the computer the numeric representations are limited in size. So the integral part has a maximum length, the fractional part has a maximum length, and the exponent value can go neither beyond a maximum nor beyond a minimum value.

```Number of digits in the integral part = k+1    (-1 < k)
Number of digits in the fractional part  = m    (0 < m)
Number of bits to store the exponent value = n  (0 < n)
Minimum value of the exponent = ExMin
Maximum value of the exponent = ExMax
Exponent bias (to be defined later) = ExBias
The j-th digit in the integral part = Dj
The j-th digit in the fractional part = D_j
```

Thus the value of the integral part of the coefficient equals:

```Dk*EB^k + ... + D3*EB^3 + D2*EB^2 + D1*EB^1 + D0
```
The value of the fractional part equals:
```D_1*EB^(-1) + D_2*EB^(-2) + D_3*EB^(-3) + ... + D_m*EB^(-m)
```

With all these data we can perform calculations about several properties of the numeric value storage, like:
minimum non-zero value that can be stored,
minimum normalized non-zero value that can be stored,
maximum value that can be stored,
several types of accuracy.

In many computers the integral part consists of only one digit. Then k = 0 and the integral part becomes D0. Then normalization means that this single digit is not zero. In this document only these computers are dealt with. So the symbol k will be used seldomly.

So in the following discussion the integral part consists of only one digit: D0.

Back to contents

### NORMALIZATION WITH(-OUT) BIT-HIDING

A binary computer is a computer operating with exponent base 2. At present the most of them apply digit hiding, i.e bit hiding. This mechanism was invented by Digital Equipmet Corporation, refined and redefined by Intel and standardized by an IEEE committee under the number 754.

In this mechanism at first the coefficient is moulded (see below) such that its most significant bit (which often is the only bit in the integral part) is always one, except when the whole numeric value is zero. Then this bit is hidden, which means it is removed from the bit pattern only to become imaginary and to be assumed 1 always.

The moulding operation is called 'normalization'. Bit hiding makes it obligatory since the bit is always assumed to be 1 (except when the exponent value is at its smallest for the zero value).

Computers with an exponent base greater than 2 are not able to hide a digit since the actual value of that digit would get lost. So they never apply hiding.

When the hiding is absent the operation of normalizing the numbers is not required obligatorily. Nevertheless in our discussion it is applied often, although only for the ease of the examples. It makes the various cases better comparable.

Normalizing means that the first digit of the coefficient is made non-zero by shifting the series of digit values to the left until that first digit is filled with non-zero. The exponent is adapted to keep the same total value. So the numeric value remains the same, only its notation differs. In a decimal example: 0.0724E+3 becomes the equally-valued 7.2400E+1.  Normalization is not applied when the exponent is at its minimum value or when all digits in the coefficient are zero, since this would be impossible.

Back to contents

### Maximum value

The numeric value will be at its maximum when both the exponent and the coefficient are at their maximum values. The latter happens when all coefficient digits are at their maximum value EB - 1.  Since the first digit is not zero the coefficient is normalized.

The value of the integral part is EB-1, and that of the fractional part is (EB-1)*EB^(-1) + (EB-1)*EB^(-2) + (EB-1)*EB^(-3) + ..... + (EB-1)*EB^(-m) = 1 - EB^(-m)   Both parts added together result in the maximum coefficent value of EB - EB^(-m).  Addition of 1 to the last digit would result in a value that exactly equals the exponent base.

Actually storing the exponent base in the coefficient just leads to an overflow. It is the smallest value that leads to an overflow when attempted to be stored. The storeable maximum is the non-overflow value nearest to that base,  EB - EB^(-m).  The more digits the coefficient has the greater m is and better this maximum approximates the exponent base EB.

The value of the total maximum number is  EB ^ ExMAx * (EB - EB^(-m))  which approximates  EB ^ (ExMax+1)

### Minimum non-zero value

The coefficient has its minimum value when all its digits are zero. Then the value of the whole number equals zero.

The numeric value will be at its non-zero minimum when the exponent is at its minimum value and the coefficient is at its minimum non-zero value. Then all digits in the coefficient are zero, except the least significant one which is 1.  This is digit D_m in the fractional part, and so the minimum non-zero coefficient value becomes  1*EB^(-m)

The value of the total number becomes  EB^ExMin * EB^(-m)

### Minimum normalized value

The number is at its minimum and still normalized when the exponent is at its minimum and the coefficient is at its normalized minimum. The latter happens when the most significant digit is 1 and all other digits are zero. Then the value of the coefficient is 1.

The value of the total number becomes  EB^ExMin

### Coefficient ratio

As shown above the minimum value of a normalized coefficient is 1, and its maximum value is  EB - EB^(-m).  So the value range of a normalized coefficient goes from 1 to nearly the exponent base EB. The ratio between both extremes, max/min.norm, also approximates EB. One octal and one decimal example show it now.

Octal example:

```         exponent base = 8
digit type  = octal
digit maximum = 7
minimum coefficient = 0.0000...
minimum normalized coefficient = 1.0000...
maximum coefficient = 7.7777...
ratio  max / min.norm = 7.7777...
maximum approximates = 8.0000...
ratio approximates = 8
```

Decimal example:

```         exponent base = 10
digit type  = decimal
digit maximum = 9
minimum coefficient = 0.0000...
minimum normalized coefficient = 1.0000...
maximum coefficient = 9.9999...
ratio  max / min.norm = 9.9999...
maximum approximates = 10.0000...
ratio approximates = 10
```

Back to contents

### Absolute accuracy

The accuracy in the representation of the numeric value can be defined in several ways. Nevertheless all notions of it depend amongst others on the number of digits in the coefficient and the base of the exponent. Two important types of accuracy are in ubiquitous use: the absolute and the relative. Here a third one is introduced: the guaranteed.

At first sight the absolute accuracy is the minimum number one has to add to the value or subtract from it in order to change its digital representation, i.e. to change the last digit of the coefficient without changing any other digit. The minimum size of the number that enables this change increases with the value of the exponent and decreases with the length (= number of digits) of the coefficient.

The above description may lead to an unclear and fuzzy value to be added or subtracted. Should it be 0.5 times the value of this change, or 0.49, or 0.51, or 0.4, or 0.6, or 1.0, or ...?  The next definition is really unambiguous:

• The absolute accuracy of a number is the |change| in the value of that number when the least significant digit of the coefficient is changed by 1.
So the absolute accuracy is the value of the |change| itself. In this definition |change| means the always positive absolute value of the change. And always the least significant digit is the last digit in the coefficient. The value of the |change| equals  EB^Exp * EB^(-m)

In a binary computer the digits in the coefficient are the bits in it. Thus the last bit is the last digit. This bit is changed solely by toggling (= inverting) it. In a decimal computer the digits can have any value from 0 to 9.  When the last digit is 0 then the change of 1 cannot be subtracted, otherwise one or more other digits will be affected. When the last digit is 9 the addition is forbidden.

### Relative accuracy

The relative accuracy is the ratio between the absolute accuracy and the actual numeric value. It is the absolute accuracy divided by the actual value. Thus this division blots out the influence of the actual value of the exponent. Its value is  (EB^Exp * EB^(-m)) / (EB^Exp * C) = EB^(-m)/C.  It equals the absolute accuracy when the numeric value is 1.

This type of accuracy makes much sense when the number is normalized. Then this accuracy depends mainly on the length of the coefficient (= its number of digits), less on the value of the coefficient (= its actual digit pattern), and of course not at all on the value of the exponent. Also the influence of the coefficient value can be blotted out. This is done in the next discussion by the introduction of the guaranteed accuracy.

Back to contents

### Varying relative accuracy

The relative accuracy varies with the value of the coefficient. It is at its best when the coefficient has its maximum value. This is when all digits have their maximum value. The relative accuracy of a normalized number is at its worst when the coefficient has the minimum allowed value. This occurs when all digits are at their allowed minimum. This means: when the first digit equals 1 and all others equal 0.

The formula for the relative accuracy is:  EB^(-m)/C
The best relative accuracy is:  EB^(-m)/(EB-EB^(-m))  = approximately =  EB^(-m-1)
The worst relative accuracy is:  EB^(-m)
Ratio worst/best is:  EB-EB^(-m) = approximately = EB
The ratio between both accuracies equals the ratio between the coefficient extremes, and thus approximates the exponent base.

So the greater the exponent base is the greater its influence is on the relative accuracy. Thus this influence in a decimal computer (base = 10) is great compared with that in a binary computer (base = 2). That giant factor of 10 cannot be overlooked in the calculation of the relative accuracy whilst the factor of 2 sometimes may.

### Guaranteed accuracy

To fully blot out even this undesired dependence of the coefficient extremes and thus of the actual coefficient value the concept of the guaranteed relative accuracy is introduced. It is the relative accuracy at its worst, i.e. when the value of the normalized coefficient is at its minimum. This accuracy does not depend on any actual value at all. It depends solely on the exponent base and the coefficient length.

By this total elimination of the value dependence the concept of the guaranteed relative accuracy becomes more important when the computer uses a greater exponent base. It makes more sense in a decimal computer than in a binary computer.

So the value of the guaranteed relative accuracy is:  EB^(-m)

When the integral part of the coefficient has less or more than one digits the formula for this type of accuracy becomes  EB^(-k-m). Note that k+m is the total number of digits in the mantissa minus one. So in every floating point representation one coefficient digit should not be taken into account. In the case of the binary hidden-bit notation one simply has to 'forget' the hidden bit.

### Decimal and absolute accuracy

The guaranteed relative accuracy determines the number of decimal digits that can be represented reliably by the sequence of coefficient digits. The longer the coefficient is the more decimal digits this sequence can represent well. This so-called 'decimal accuracy' equals  - 10_log (guaranteed_accuracy), thus  (k+m)*10_log(EB),  or in ordinary language:  (coefficient_length - 1) * 10_log (exponent_base).  Consequently the decimal accuracy of a decimal coefficient equals  coefficient_length - 1.  Since 0.30103000 is an extremely good approximation of 10_log(2) the number of digits that fit well in a binary coefficient of 102 bits equals 30.4 (note the 'broken digit'!).

The absolute accuracy depends on the exponent value and on the length of the fractional part. It does not depend on the length of the integer part. Thus it is  EB^Exp * EB^(-m)  irrespective the length of the integral part.

Back to contents

### Ways of storage

The exponent notation represents the exponent value. In theory this value is allowed to be non-integral. In practice it must be integral, otherwise the calculations would slow down and become too difficult even for a computer.

Therefore the exponent value is an integer value that can be positive, zero or negative. Since it is an integer many computers store it in a binary way. For computers this way is the most easy way of calculation. The number of bits for this storage is always finite, and in many computers it is fixed.

All computers store integer values in the same way when zero or positive. When negative they apply different ways of storage. Four ways commonly used are: 1-complement, 2-complement, excess-bias and sign+magnitude. At present the 2-complement method is often used for the ordinary integer numbers and the excess-bias method is often used for the exponent.

In the excess-bias method the exponent part is always a non-negative binary integer without sign. A predefined constant value has to be subtracted from this integer value in order to find the actual value it stands for. By this way negative exponent notations are avoided although the resulting exponent value can be negative. Thus the exponent has got rid of its signbit.

The storage with the bias has one advantage over the other methods of storing negative exponent values. The bias value can be chosen arbitrarily, so the range of the exponent values can be located non-symmetrically around zero. E.g. two-third of the available values is greater than zero and only one-third is below zero. The other notations always have the zero in or near the middle of the value range.

### Choosing the exponent bias

The exponent bias can be chosen arbitrarily when a new type of floating-point numbers is designed. A good choice makes that there is a balance between the small and the great numeric values. When the bias is too small, too few small values are available. And when it is too great too few great values are. Thus the concept of symmetry in the normalized values can be very helpful for the right choice.

At first the exponent bias should be chosen such that the values of the whole number with normalized coefficient are located (more or less) symmetrically around the value 1.  So there are as many values between 0 and 1 as there are above 1.  This occurs when:
SmallestNormalizedValue = 1 / GreatestValue
or written elsewise:
SmallestNormalizedValue * GreatestValue = 1

This formula does not hold exactly. It can be an intention only since the inverse of the smallest normalized value will equal the value approximated by the maximum (of course with the same +/- sign). Actually the value 1 itself belongs to the series of values above 1.  And the approximated value is the smallest value that gives an overflow when attempted to be stored. So the symmetry is not exact, although the best available.

This good symmetry occurs when the bias is chosen half the size of the range of the exponent integer. Example for an axponent of 5 bits with binary base:

```           --- facts: ---
exponent base = 2 (binary)
exponent length = 5 bits
minimum normalized coefficient = 1.000...
maximum coefficient = 1.111...
minimum exponent integer = 0
maximum exponent integer = 31
size of exponent range = 32
half size of range = 16
--- when: ---
selected bias = 16
--- then: ---
minimum exponent value = -16
maximum exponent value = +15
minimum normalized value of whole number =
= 1.000... * 2^(-16) = 2^(-16)
maximum value of whole number =
= 1.111... * 2^(+15) = (when the
coeff. has 17 bits:) = 65535.5
maximum value of whole number approximates =
= 2 * 2(+15) = 2^(+16) = 65536
smallest value that gives an overflow when stored =
= 2 * 2(+15) = 2^(+16) = 65536
```

Idem with decimal base:

```           --- facts: ---
exponent base = 10 (decimal)
exponent length = 5 bits
minimum normalized coefficient = 1.000...
maximum coefficient = 9.999...
minimum exponent integer = 0
maximum exponent integer = 31
size of exponent range = 32
half size of range = 16
--- when: ---
selected bias = 16
--- then: ---
minimum exponent value = -16
maximum exponent value = +15
minimum normalized vale of whole number =
= 1.000... * 10^(-16) = 10^(-16)
maximum value of whole number =
= 9.999... * 10^(+15)
maximum value of whole number approximates =
= 10 * 10(+15) = 10^(+16)
smallest value that gives an overflow when stored =
= 10 * 10(+15) = 10^(+16)
```

Note that the symmetry (or attempt for it) is never obligatory!  It can be helpful in the design of a new type of floating-point numbers since it gives a first good idea for a useful exponent bias. Often it is not implemented in actual machines. There the exponent bias is lower or higher than the half of the exponent range.

The next table shows a listing of the multiplication of the smallest normalized with the greatest approximated value for some machines and definitions. For every machine and definition in this table the multiplication result happens to hold for all its types of floating-point numbers.

The table also lists the exponent base of the machines and definitions and the way they store the exponent value. In all these machines and definitions the exponent is a binary integer value with a sign. The table lists the way of its storage.

```definition             multiplication     e x p o n e n t
or machine             max*norm.min =    base      storage

DEC  PDP11 c.s.               2            2     excess-bias
IEEE-754  (binary)      2^2 = 4            2     excess-bias
IEEE-754r (decimal)    10^2 = 100         10     excess-bias
Burroughs 6700 c.s.    8^25 = 3.8E+22      8     sign+magnitude
'binary' hypotheticals        1        various   excess-bias
decimal hypothetical     1 or 4 or 10     10     excess bias
```

In the definitions of IEEE-754 (binary) and IEEE-754r (= IEEE-754-2008, decimal) the bias is lowered by one for all types of floating-point numbers. The Burroughs machines are very asymmetrical.

The hypotheticals are machines that are introduced in another document for educational purposes. They do not exist actually. In these machines the symmetry is maintained as good as possible.

Back to contents

### LISTING OF THE FORMULAS

The formulas that decribe the properties of the floating-point numbers in computers are collected in a listing:

```--- GENERAL ---
+/- sign = S                              +1 or -1
Numeric base = Exponent base = EB         2 =< EB
Exponent value = Exp
Value of a single digit = D               0 =< D < EB

--- COEFFICIENT ---
Number of digits in the integral part =   k+1    -1 < k
The j-th digit in the integral part =     Dj
Integral part = I              I = sum(Dj*EB^j)   k>=j>=0
Normalization means                       Dk >= 1
Number of digits in the fractional part = m       0 < m
The j-th digit in the fractional part =   D_j
Fractional part = F        F = sum(D_j*EB^(-j))   m>=j>=1
Normalization if I-part absent means      D_1 >= 1
Coefficient = C                           C = I + F
Value of the total number = V             V = S * C * EB^Exp

--- EXPONENT ---
Generally the exponent is stored as a binary integer = ExpInt
Number of bits to store this integer =    n  (n > 0)
Range of exponent integer = Rng =         2^n   (Rng>ExpInt>=0)
Exponent bias =  ExBias       = advice: = Rng/2
Exponent value = Exp =                    ExpInt-ExBias
- when special numbers are not in use: -
Maximum value of the exponent = ExMax =   Rng-1-ExBias
Minimum value of the exponent = ExMin =   -ExBias

--- EXTREMES AND ACCURACY ---
Absolute accuracy =             EB^Exp * EB^(-m)
Guaranteed relative accuracy =  EB^(-k-m)
Decimal accuracy =              (k+m)*10_log(EB)

- when k = 0 then: -

The extreme values of the total number are:
Maximum value =                EB ^ ExMAx * (EB - EB^(-m))
= approximately =  EB^(ExMax+1)
Mimimum non-zero value =       EB^ExMin * EB^(-m)
Minimum normalized value =     EB^ExMin

Coefficient ratio =            EB-EB^(-m)
= approximately = EB
Relative accuracy =            EB^(-m)/C
Best relative accuracy =       EB^(-m)/(EB-EB^(-m))
= approximately =  EB^(-m-1)
Worst relative accuracy =      EB^(-m)
Ratio worst/best =             EB-EB^(-m) = approximately = EB
```

Back to contents

Back to index of numeric formats