# EXPLANATORY NOTES

### by:   J.H.M. Bonten

First publication date: 8 march 2009
Representations of numbers added: 21 april 2009
Tiny correction: 30 april 2009

#### CONTENTS

==>>Back to main index of numeric formats

## Different numeric representations

### General introduction

This internet site contains a set of descriptions of the memory-word formats for the so-called floating point numbers used in some extinct, old and modern species of the computer-fauna. Also it contains some theory about the storage of these numbers. And it gives some proposals to improve the use of decimal numbers, both in Cobol and in Fortran/C/C++.

### Categories and exponent bases

The word 'computer' has been derived from the old-Latin word Computare that means handling numeric values and correctly juggling with them. As we know, in the early days of the computer its main purpose was this juggling with numbers, hence its name. Since this juggling was under development each manufacturer designed its own method for storing the numbers. This method differed even between the computer models of the same brand. Sometimes even the same model could apply different methods. Unification and standardization were far away.

Yet most of these methods had some resembling properties that the modern methods of storage still have today. Therefore this internet site describes several of the methods for storing numbers that were used in history and the few methods that are in common use today.

In these methods the numbers are stored as a sequence of digits. These digits are either binary (= bits) or octal or hexadecimal or decimal. Very seldomly an other type of digits is used than one out of these four. Often the type is called the 'exponent base'. So the four commonly used bases have the (here decimally written) values 2, 8, 10 and 16.  In the modern days the bases 8 and 16 are going to become obsolete. Only 2 and 10 stay in use.

Besides the digit's type the storage methods can be grouped into three main categories: integral, fixed-point and floating-point. Their notations are discussed below.

### Integral numbers

In the integral method the number consists of the +/- sign and a bare series of digits, without any interpunctuation mark in between. This method is apt only for storing integral values. Non-integral values have to be truncated or rounded to their nearest integral neighbour before storage. The structure looks like:

```            ,----  +/- sign
|
V

+     series of digits
```

All computers use this method since the integral values are required for pointing at the locations in the main memory. Since they must steadily calculate the addresses of these locations all computer models have hardware to perform fastly the integral calculations.

The main memory itself is called the Random Access Memory (RAM) or Core Memory. The latter name is derived from the magnetic ferrite cores that were used for the RAM in those early days, but also as the opposite to the peripheral memories like drums, tapes, disks, CDiscs, USB-sticks and networks (e.g. ethernet) and to the surrounding one-way data-communicators like punch cards, keyboards, mouses, printers, plotters and CRT-screens.

### Fixed-point numbers

In the fixed-point method the numbers look equal to the integral numbers, but with a fractional point in between the series of digits. When the digita are decimal the fractional point is called the decimal point. This point needs not to be fixed on one location, but it can stand everywhere inside the series and even float over the number during the calculations!  The structure looks like:

```         +/-             fractional
sign               point
|                   |
V                   V
+    integer part   ,   fractional part
```

This way of storage is used generally in numbers wherein the series of digits has a variable length. It is also used often in the language Cobol, but then with the fractional point fixed firmly on one location. Hence the name of the method. Common users of this notation are the accountants and the banks. Therefore it is used mostly in computers that work decimally.

Seldomly special hardware is used to handle these numbers. Many computers apply software that regularly invokes the hardware for the integers. Therefore very often the fractional point is not present physically in the number. Its location is remembered on a separate place by the running program. Then the point is called 'virtual'.

Other computers translate the number into a floating-point number (see below) before handling it. When the mathematical operations are finished the numeric result is translated back into the fixed-point notation.

### Floating-point numbers

In a floating-point number the series of digits is interspersed with more interpunctuation marks and so broken into three parts. First, the exponent mark separates the series of digits into the coefficient and the exponent. Then like above, the fractional point divides the coefficient into the integral and fractional part. The exponent part is always integral and gets its own +/- sign. Thus the exponent can be negative, zero or positive on its own. The sign of the coefficient determines the sign of the whole number. In scheme the structure looks like:

```     coefficient  fractional   exponent   exponent
sign            point       marker   sign
|             |               |   |
V             V               V   V
+   int.part  ,  fract.part   E   +  integer
\_______________________/            |
|                        |
coefficient               exponent part
```

Thus the floating-point number looks like a fixed-point number padded with the notation of an exponent. The value the whole number stands for is calculated as the value represented by the coefficient times the exponent base to the power of the exponent value. In mathematical formula:

```                                                    _  expon
+  integer
numeric   /--    _  coeff     coeff      *   expon
value    \--    +  integer ' fraction       base
```

The accuracy of the notation is determined by the length of the whole coefficient. It is expressed as the count of digits of the applied type. The so-called 'guaranteed accuracy' equals this length minus one digit. These types of accuracy are explained in more detail later in this document and elsewhere in this internet site.

Common users of this notation are the scientists and designers. Therefore it is used mostly in computers that work binarily. Because of the presence of the exponent the constructor of the world's first computer, Konrad Zuse, called it the 'semilogarithmic notation'.

### Floating-point notation and hardware

Normalization occurs when the the series of digits in the coefficient is shifted such that the coefficient's first digit is not zero. Simultaneously the exponent is adapted such that the value of the whole number is not changed by this shifting. Many computers can work with normalized numbers only, otherwise they will produce errors.

Generally the three parts in the series of digits have fixed lengths, i.e. fixed counts of digits. Thus the exponent mark and the fractional point are fixed on their locations. So they can become virtual in order to save space. Only the two +/- signs are physically present.

Actually in most computers, even the oldest ones, the exponent part together with its +/- sign is placed between the coefficent and the coefficient's +/- sign. Thus the actual lay-out becomes:

```    coeff    expon
sign    sign
+    +      expon.int    coeff.int    coeff.fract
```

In several computers a constant value is added to the exponent, thus making its value always non-negative. Then the exponent sign can be omitted. This steadily added constant is called the excess bias of the exponent. This weird contraption enables the computer to compare two floating-point numbers by using the simple and fast comparator made for the integer numbers.

Because of their concise notation and their enormous range of values the floating-point numbers are in widespread use. Alas, their calculations require many steps and so are very tedious. Thus when done by software they require extremely much time. To speed them up special hardware has been designed, dedicated solely to handle these numbers.

Most of the computers are furnished with this special hardware, even the oldest ones. This hardware has got several names, all meaning the same:
Floating-point processor (= FPP)
Floating-point unit (= FPU)
Mathematical co-processor (= MCP)

### Floating-point standards

For the sake of compatibility in the modern days all computer brands tend to use the same standards for storing the floating-point numbers. These standards are logged by the Institute of Electrotechnical and Electronical Engineers in the definition 754.  This definition falls apart into two halves:

Binary part:
exponent base = 2
definition number = IEEE-754 or IEEE-754-1986
name = Hidden Bit notation
forerunner = William Kahan
remark: first coefficient's bit is hidden
Decimal part:
exponent base = 10
definition number = IEEE-754r or IEEE-754-2008
name = Packed Decimal Encoding
forerunner = Mike Cowlishaw
remark: exponent integer is stored binarily

Both definition parts are described in this internet site. Besides these two ones this internet site also describes a lot of other floating-point formats that have been designed in the older times. Some of them are still in use.

Back to contents

## Some tiny history

### Number crunching versus text processing

The era of the computers has started with the design of machines that could perform many arithmetic calculations in a short period of time and do this work automatically according to a prescribed series of commands. Three of these computers, viz. the Cray-1, CDC-6600 and Zuse-Z1, plus their direct descendants were the fastest programmable calculators of their eras.

During the years 50-s a different kind of computers had been designed. Their main purpose was data ordering (e.g. data-base manipulation) and text processing, not the arithmetics. Hence the French name 'ordinateur' for such a machine. They store and process the numbers in a decimal way, often address the memory decimally, and have small RAMemory words, e.g. of six bits. In general their numbers do not have an exponent part. The coefficient can have an arbitrary length. Its (often virtual) decimal period can be located everywhere inside it, not obligatorily before or after the first digit. An example is the the IBM-1401 and also the IBM-1620, both designed in the late years 50's.

In a generalizing listing the two kinds have the properties:

```kind                     NUMBER CRUNCHER      ORDINATOR
---------------      ---------
purpose                  scientific and       data ordering
technical            and text
calculations         processing
memory word size         long (e.g. 25 bits)    short (e.g. 5)
size of the numbers      fixed length         variable length
notation of numbers      with exponent        with free period
way of storage           binary               decimal
and calculation
most apt programming     Fortran              Cobol, RPG
language
input and output and     by auxiliary device  by machine itself
long-term storage
customer                 university, labs     bank, insurance
useage                   research & design    accountancy
```

### Combination machines

Even so, the IBM-1620 was intended primarily for scientific applications. Therefore it was rigged with Fortran and it had a special type of floating-point format. Its RAMemory words are are built up of two (very) small words.

In the early 60-s IBM decided to make a computer that is not a refurbished ordinator, but really combines both categories of machines and so is a machine that can handle the different data-kinds simultaneously in an easy way. Sometimes its memory must behave as having long words, at other times it must act as consisting of short words. Also the machine should have two arithmetic processors, a decimal and a binary one. IBM's effort resulted in the IBM-360 computer. This machine revolutionized the arrangement of the computer's memory.

At present nearly all computers of nearly all brands, from palmtop upto giant server mainframe, except some small embedded chips, are combination machines that apply a memory structure based on that of the IBM-360.  Therefore this structure is described in more detail.

In such a structure the length of the memory words is a power of two. These words can be coupled two by two in order to make double-length words, quadruple-length words, and so on. The size of the elementary (= smallest, undividable) word can be 1 bit, 2 bits, 4 bits (= nibble), 8 bits (= byte), 16 bits, and so on. In the Intel 4004 it is 4 bits. In its successor, the Intel 8008 it is 8 bits. The memory of a 'medium sized' 8-bits byte based machine, the Digital PDP-11, is described.

Also the modern-day byte-based computers apply the same methods for the mathematical calculations. Their floating-point numbers are written in the hidden-bit notation that has been defined as the standard IEEE-754. This internet site shows a big list of many actual numeric formats.

In spite of the present-day consensus of applying memory words based on bytes and with the same mathematical operations there is no consensus about the names given to the words that result from the two-by-two coupling. Every company uses its own set of names. This internet site proposes a standardization for this naming.

In the early years 60-s several other companies, like Univac, Digital and Burroughs, also began to make combination machines, but let them work in a different way. The memory is not a multi-word memory, but has long words only. Often its word length is 36 bits. The short words are gained by a secondary addressing system that extracts parts out of these long words. Thus each memory word embraces a small row of short words. The internet site describes some of these computers.

Recently IBM has designed a way to store a decimal number in a long computer word, in the same way as a binary number with a fixed length. This hybrid contraption is called Packed Decimal Encoding. It has become the standard IEEE-754-2008.  Its preliminary name was IEEE-754r. This internet site explains it thoroughly.

Back to contents

## Conventions in the documents

### Names: Mantissa = Coefficient

In many documents about numeric storage in computers the words 'Mantissa', 'Coefficient' and 'Significand' are used. In many cases they have the same meaning: it is a part of a number. In older times this part was called 'mantissa'. In the modern times this word is replaced by 'coefficient' or 'significand'. 'Mantissa' sometimes has become the (physical) space wherein this number-part is stored.

In some documents in this internet site still the old word mantissa is used for that number-part, in other documents the word coefficient. Inside each document this useage is consistently. The word significand is not used at all.

A number has two kinds of size. One kind is the length of its notation and the other is the numeric value it stands for. For example: The number 0,30102999566398119521373889472449E-983 has a very big notation (i.e. uses many bits or characters) but stands for a very little value. The reverse holds for the number 7E98.

In this internet site different adjectives are used for both kinds of size consistently according to this table:

```               |   length of     numeric
SIZE  |   nOtation       vAlue
-------+----------------------------
|
little |     short        small
|
big |     long         large
|
corresponding vowel
I  |       O            A
|
```

To remember easily this tiny table one should remind the most ImpOrtAnt vowel in each word. Together these vowels can be pronounced as 'iowa'.

### Accuracy: Guaranteed or Number of digits

Generally in mathematical calculations the accuracy of the computer is very important. It is the number of digits that can be represented reliably by the machine when it stores a numeric value. Stated simply, the longer the field is wherein the numeric value has to be stored, the greater the number of digits is that can be stored in that field, and the better the accuracy will be. Actually, the length of the coefficient determines the accuracy of the whole representation of the numeric value.

It is important whether the smallest value or the largest value of the coefficient is used for the accuracy calculation, since the largest value of a normalized coefficient is bigger by nearly one exponent base than the smallest value. It is even more bigger when the coefficient can be un-normalized.

Therefore when the computer representation of the numeric value is used as a floating-point number, it is defined as the relative accuracy that occurs when the coefficient has its smallest possible value, but still is normalized. This normalized minimum often has the bit pattern = 1000...  The accuracy thus gained is named the guaranteed relative accuracy, or in short the guaranteed accuracy.

When the representation is used as an integer number, it is defined as the relative accuracy that occurs when the coefficient has its largest possible value, which of course is normalized always. Very often this maximum has the bit pattern = 1111...  Now the accuracy is the relative accuracy for one plus this maximum. It is named the number of bits, or the number of computer digits.

Both accuracy figures are expressed as a count of bits or computer digits. For the same length of the coefficient both figures differ by one. In formula:  # digits = guar.acc. + 1

### Decimal figures

The count in bits or computer digits can be translated into the number of decimal digits the coefficient can stand for. The resulting figure becomes  10_log(exponent_base) * accuracy.  Consequently in every computer the guaranteed relative accuracy becomes decimally:
10_log (minimum_normalized_coefficient).
The maximum number of digits becomes decimally:
10_log (maximum_coefficient+1).
For the same length of coefficient both figures differ by:
10_log (exponent_base).

The very most of the computers operate binarily, so have an exponent base of 2.  Therefore to determine this decimal accuracy the value of 10_log(2) is very important:

```       10
LOG(2) = 0,30102999566398119521373889472449......
= 0.30103000  (approx.)
```

### Rounding in the tables

In many tables three or four decimal figures are used to show the properties of a numeric field. These are:
- maximum value the field can contain
- minimum normalized value that field can contain
- minimum un-normalized value (only when un-norm is allowed)
- accuracy, either guaranteed or number of digits

For a good readability these decimal figures are approximated. The rounding protocols applied on them are chosen such that the figures can be used with full safety. The protocols are:
- for minimum value: always rounding up
- for maximum value: always rounding down
- for accuracy and #digits: always rounding down

### The prepositions Until and Upto

The linguistic preposition Until means 'going to the edge, but excluding the edge itself'. Thus the edge is approached only. The linguistic preposition Upto means 'going to the edge and including the edge itself'. Thus the edge is touched.

So, when A snd B are numbers (with A < B), then the sentence
From A until B
means the set of all numeric values between A and B plus the value A itself. The value B is not included. In the scientific mathematical notation this is written as [A,B).

The sentence
From A upto B
means the set of all numeric values between A and B plus the values A and B.  Now the value B is included too. In scientific mathematical notation this is written as [A,B].  Thus this set has one more value than the previous set.

When A and B are integral numbers and also only the integral numbers between them are taken into account, then each of both sets contains a finite series of numbers. Then the sentence 'from A upto B' equals the sentence 'from A until B+1'.

Example that uses integers:
From 3 until 8 means the collection of 3, 4, 5, 6 and 7.
From 3 upto 8 means the collection of 3, 4, 5, 6, 7 and 8.

Throughout the whole internet site these two prepositions are used consistently with their assigned meaning.

### Indexing the bits

Like every house in a street every bit in a computer word has to be identified. This can be done by assigning a unique name to each bit. The name of the bit should be choosen such that it clarifies the use of that bit. It is practical to let this name shrink into a mnemonic or even into one single letter. For example: Parity-check bit => Parchk => P.

When the word is used only for the binary represention of a numeric value, then each bit can be identified by the part out of that value it can stand for. Since the basis of every computer is a binarily operating circuitry the value-parts of the consecutive bits are  1, 2, 4, 8, 16, 32, and so on. When the bits named 1,4,8 and 32 are set to one and all other bits are zero, then the resulting numeric value is 1+4+8+32 = 45.

Most computer manufacturers identify the bits simply by counting them consecutively from left to right in ascending or descending order. The lowerst index is chosen either zero or one, Thus one gets four different index sequences, e.g. for a byte, viz:  0 1 2 3 4 5 6 7  and  1 2 3 4 5 6 7 8  and  7 6 5 4 3 2 1 0  and  8 7 6 5 4 3 2 1.  In these cases a mathematical formula can be used to derive from the index the value-part the bit stands for in a number.

Not only every computer manufacturer indexes the bits in its own way, sometimes even the very same firm applies different indexing methods for its different computer models. For example: IBM identifies the bits differently between its 1620 model and its 7094 model. Similarily DEC indexes differently between its PDP-10 and PDP-11.

Thus the identification of the bits in the computer words is quite a mess. Therefore to avoid confusion, in all documents in this internet site the bits in all computer words are indexed in the same way. So in several cases they are re-indexed prior to the discussion of their meaning in the word stucture.

The index increases from the right side to the left side of the word. The rightmost bit which is the LSB (= least significant bit) gets the index 0, whilst he leftmost bit which is the MSB (= most significant bit) gets the highest index. This indez equals the total length of the word in bits minus one. The drawing elucidates this way of indexing.

```              Computer word with length of N bits
+----------------------------------------------+
|N-1 <------------- bit index --------------< 0|
|                                              |
|left side                           right side|
|                                              |
|MSB <--------- bit significance ---------< LSB|
+----------------------------------------------+
```

Back to contents