A NEW DATA TYPE FOR

DOUBLING THE ACCURACY

A PROPOSAL

Date of first publication: 18 april 2009

Multiplication added: 4 may 2009

At present most computers can handle the floating-point numbers with a precision of approximately 7 ('single precision') or 15 ('double precision') decimal digits. A few models can even handle an accuracy of 33 digits. But most cannot. For them an emulation can be made that allows them to handle an accuracy of 2 x 15 = 30 digits in an addition. This can be important in a serial addition of a lot of small and large numbers that are located intermixedly in the series.

To this aim the resulting sum must be stored in a row of two double precision numbers, here called the 'header' (= HEAD) and the 'trailer' (= TAIL, without 'r'). The exponent in the trailer is lower by approximately 15 decimals than the exponent in the header. This construction is similar to the one applied for the double precision in the old CDC-6600 computer that is described elsewhere in this internet site.

The combination number might be named a TWIN number. In fact it is a 'quadruple-precision' number. Thus a series of ordinary double-precision numbers can be added to such a quadruple-precision number with less loss of accuracy than to an ordinary double-precision number. The hardware and the firmware (= micro-programs) of the computer do not need to be adapted at all. Only a small software module needs to be designed. The only drawback is that every addition requires much more time (up to a factor ten) than the addition to a double-precision number.

In the following description the working of this emulator is described in a non-existing computer that calculates decimally with an accuracy of exactly six digits. Since it does not have double-precision numbers these are emulated for the addition.

For the ease of the discussion we introduce a small hypothetical computer. This computer works with decimal digits, not binary. The non-integral numbers are written in floating-point notation. The length of the coefficient is six digits. The virtual decimal point stands after the first digit. Always normalization occurs, so the coefficient is zero (when all digits are zero) or has a value in the range from 1.00000 upto 9.99999 (= until 10.0). So its 'ordinary' accuracy is six digits, its guaranteed accuracy is five digits.

The six-digits approximation of Euler's gamma-constant is 0.577216 = 5.77216E-1. In the hypothetical computer its representation will look like 577216@-1. Herein the at-sign acts as the virtual separator between the coefficient and the exponent part. The values and representations of some other numbers are:

NAME VALUE IN COMPUTER REMARK sqrt2 1.41421 141421@0 pi 3.14159 314159@0 (Archimedes'constant) e 2.71828 271828@0 (Napier's constant) gamma 0.577216 577216@-1 (Euler's constant) bignum 100 000 100000@+5 ('big number') midnum5 49 999.5 499995@+4 ('middle-big' number) midnum6 49 999.6 499996@+4 midnum7 49 999.7 499997@+4 midnum8 49 999.8 499998@+4 midnum9 49 999.9 499999@+4 midnum0 50 000.0 500000@+4 midnum1 50 000.1 500001@+4 midnum2 50 000.2 500002@+4 midnum3 50 000.3 500003@+4 midnum4 50 000.4 500004@+4

Since the coefficient is fairly short, the numbers are said to be in 'single precision'. In the following text a software emulater will be designed to enable this computer to perform additions with a double precision accuracy.

A double-precision number is stored as a series of two single-precision numbers. The number that contains the large part is called the Header (= HEAD), and the number with the small part is called the Trailer (= TAIL, without 'r').

When pi is added to Bignum the result should be: 100003.14159. Since the computer can handle six digits only the result will become actually: 100003 = 100003@+5. The five trailing digits get lost. These should be saved by storing them in a second number, thus: 0.14159 = 141590@-1. One can see that the exponent in both numbers differ by 6, which is the accuracy of one such number.

In the CDC-6600 the hardware of the floating-point processor can act in double-precision mode. Then it produces the two resulting numbers in one stroke. But the FPP of our simple hypothetical computer can operate in single precision mode only. So it can produce only one resulting number at a time. Of course in an addition this will be the large number, 100003. The small number 0.14159 will get lost.

The trick of the emulator is that this second number can be retrieved by a cascade of subtractions wherein both input numbers are subtracted from the resulted large number. The remainder is the negated version of the value to be stored in the second output number. Thus:

100003 - bignum => 3.00000 3.00000 - pi => -0.141590 store in the tail +0.141590

Thus the total addition-result becomes: 100003@5 + 141590@-1.

So the basic idea is that the emulator adds the two numbers and then subtracts them in a cascade from the result. The remainder thus got is put in a second result-number. Alas, this basic idea has to be refined, otherwise it still will give unreliable results.

First, the order wherein the two subtractions are performed is very important, otherwise the value in the second number will not get the maximally available accuracy and thus will become unreliable. The largest input must be subtracted first. The next example shows what happens by the wrong order:

100003 - pi => 100000 100000 - bignum => 0 store in the tail 0.00000

Now the total addition-result becomes: 100003@5 + 000000@0, which is not more accurate than any single number.

Secondly, some registers inside the floating-point processor (= FPP) must by more accurate by at least one digit (or bit in a binary computer) than the input and output values of the FPP. This will be discussed next.

The numbers 100003 and 99999.8 have roughly the same value, but their notation suggests they are totally different. In spite of their numeric near-equality they are notated with a different exponent, thus 100003@6 and 999998@5.

When they are added or subtracted the coefficient of the smaller one of them must be shifted to the right in order to get equal exponents. Consequently its rightmost digit, the 8, is expelled and thus gets lost. Thus the accuracy of the notation decreases to five digits. This giant change in accuracy introduces a big aberration, and therefore should be avoided.

So the two numbers are notated either with a different accuracy or a different exponent, although they have nearly the same value. This weird idiosyncracy should be avoided.

Therefore at least one of the registers inside the FPProcessor, that are not accessible by the user, should have the place for an additional digit (or bit in a binary computer). Thus the accuracy of both numbers can be kept to six digits whilst they have the same exponent. This makes the big aberration to disappear.

In an addition of two arbitrary numbers often the exponent in the result will have the same value as the exponent in the largest input number. But sometimes it will increase by one. In that case in the following cascade subtraction it decreases by one. Correspondingly one of the coefficients is shifted to the right and later to the left over one digit (or bit in a binary computer). This fact can introduce such a big aberration.

To perform these shifts correctly at least one of the FPP's internal registers must be able to contain a coefficient that is longer by one or more digits (or bits) than the coefficients in the input and output registers. The next examples show that this additional digit keeps the accuracy on the expected level. Without this digit the accuracy would be lower than expected. Luckily most modern-day computers have this extra digit or bit. The world's first computer, the Zuse Z1, already had this bit.

Addition: midnum8 + midnum2 499998@4 + 500002@4 result has 7 digits: 1000000@4 ----> Overflow error! to avoid overflow, perform shifts to the right: 049999@5 + 050000@5 result inside the FPP has five digits: 099999@5 shift to left for the final 6-digits result: 999990@5 which is quite inaccurate. Subtraction: bignum - midnum2 100000@5 - 500002@4 shift the smaller number to the right: 100000@5 - 050000@5 intermediate result inside the FPP is: 050000@5 final 6-digits result: 500000@4 which is fairly inaccurate.

Addition: midnum8 + midnum2 499998@4 + 500002@4 intermediate 7-digits result inside FPP is: 1000000@4 final 6-digits result: 100000@5 Subtraction: bignum - midnum2 100000@5 + 500002@4 intermediate 7-digits result inside FPP is: 0499998@5 final 6-digits result: 499998@4

Nearly always the coefficient in the result of an addition or subtraction is shortened to fit in the result register. This shortening is done either by truncation or by rounding, which depends on the model or the setting of the computer. For the accuracy of the result it does not matter which mode is used. But the notations of the result may differ significantly. For this sake let us add the Napier's number to Bignum.

Truncation mode: bignum + e 100000@5 + 271828@0 resulting first number = 100002@5 subtraction cascade: 100002@5 - 100000@5 => 200000@0 200000@0 - 271828@0 => -718280@-1 resulting second number = +718280@-1 final result is 100002@5 + 718280@-1 Both numbers have the same +/- sign. Rounding mode: bignum + e 100000@5 + 271828@0 resulting first number = 100003@5 subtraction cascade: 100003@5 - 100000@5 => 300000@0 300000@0 - 271828@0 => +718280@-1 resulting second number = -281720@-1 final result is 100003@5 - 281720@-1 The +/- sign is opposite in both numbers.

The CDC-6600 computer delivers the result of the truncation mode. For the simple user this result is more understandable than the result of the rounding mode. In the case of a decimal computer it is also better printable.

Adding a single number to a twin number requires a series of additions and subtractions. For example, let us add Napier's number to Bignum+Sqrt2, first in truncation mode, then in rounding mode:

Truncation mode: Bignum&Sqrt2 = 100001@5 + 414210@-1 add the input to the header making a new twin: 100001@5 + 271828@0 => 100003@5 + 718280@-1 add the new trailer to the trailer of Bignum&Sqrt2: 414210@-1 + 718280@-1 => 113249@0 this result is added to the new header in twin mode: 100003@5 + 113249@0 => 100004@5 + 132490@-1 Rounding mode: Bignum&Sqrt2 = 100001@5 + 414210@-1 add the input to the header making a new twin: 100001@5 + 271828@0 => 100004@5 - 281720@-1 add the new trailer to the trailer of Bignum&Sqrt2: 414210@-1 - 281720@-1 => 132490@-1 this result is added to the new header in twin mode: 100004@5 + 13249@-1 => 100004@5 + 132490@-1

In both cases the result of the last line is the final result of the whole addition. Its value is 100004.13249 = 1.0000413249E+5. As one can see truncation or rounding does not affect significantly the value the result stands for.

In general the double-precision value cannot be multiplied or divided by another value. There is one exception: when that value is an integral power of the exponent base.

When a single precision value is multiplied or divided, then in general the last digit in the result's coefficient is not accurate, i.e. the actual value of that digit should be higher or lower by one. But nobody knows its right value. Sometimes even the last two or three digits in the coefficient are inaccurate.

A double-precision value should not be multiplied or divided by multiplying or dividing the header and the trailer separately. The inaccuracy thus gained in the last digit of the result's header stands for a greater value than the whole result's trailer. Thus this trailer becomes useless and the accuracy of the whole double number is merely a fake. Therefore auch a double number should never be multiplied or divided by this simple way.

There is one exception. When the multiplier or divider is an integral power of the exponent base then the multiplication or division reduces merely into an increase or decrease of the exponents in both the header and the trailer. Both coefficients stay unaffected. No inaccuracy is introduced into the header, and so the trailer keeps its sense. Consequently the result is as accurate as the input double number.

Thus in a decimally calculating computer the double number can be multiplied or divided by 10^n, wherein n is an integral number, either negative or positive or zero. Some examples are:

HEAD . TAIL * MULTIP ==> HEAD2 . TAIL2 +100001@5 +414210@-1 100 +100001@7 +414210@+1 +100001@5 +414210@-1 1/1000 +100001@2 +414210@-4

The absolute value of the integer n should not be too large. Otherwise, when the integer's sign is positive an overflow condition might occur. When it is negative the accuracy might deteriorate, as will be demonstrated below.

Since the right multiplier or divider depends on the exponent base, programs that use this operation explicitly may not be portable between the different computer models. They are only between models with the same exponent base.

The multiplication with an ordinary number can be implemented, although in a very cumbersome way. It is a fairly long and tedious series of these special multiplications and divisions plus the additions and subtractions. Thus it consumes insanely much time and computer power.

The accuracy of the double number decreases when the value of that number approaches zero. This effect will be shown first for the ubiquitously used definition of floating-point numbers and later for the definitions used in some older computers.

The definition IEEE-754 for the binary floats states that the coefficient is normalized always, except when the exponent has its lowest value. The latter results in a deterioration of the accuracy when the number is divided further. Consequently the accuracy of a double number is less when the trailer's exponent has its lowest possible value (bit pattern = 000...)

This phenomenon will be elucidated by the next example in our decimal computer. Therein the lowest possible value of the exponent is -99. Only with that exponent value the coefficient can be unnormalized. The example shows the values of a double number that is divided by 10 repeatedly until it becomes zero. The series of results is:

HEAD TAIL ACCUR REMARKS +300001@-90 +414210@-96 12 both numeric parts are +300001@-91 +414210@-97 12 normalized +300001@-92 +414210@-98 12 +300001@-93 +414210@-99 12 +300001@-94 +041421@-99 11 trailer is not normalized +300001@-95 +004142@-99 10 +300001@-96 +000414@-99 9 +300001@-97 +000041@-99 8 +300001@-98 +000004@-99 7 +300001@-99 +000000@-99 6 trailer is zero +030000@-99 +000000@-99 5 header is not normalized +003000@-99 +000000@-99 4 +000300@-99 +000000@-99 3 +000030@-99 +000000@-99 2 +000003@-99 +000000@-99 1 +000000@-99 +000000@-99 0 header is zero

In some computer models the value of a float is treated as zero when the exponent has the lowest possible value, irrespective the value of the coefficient. In our decimal example this means that the number is interpreted as zero when its exponent has the value -99. Thus: ******@-99 ==> 0.0. Computer models with this way of interpretation are the DEC-PDP-11 pedigree, the Zuse Z1,Z3 and the CDC-6600 pedigree.

Therein the unnormalization of the coefficient does not make any sense. Consequently a gap occurs near zero along the axis of values. This gap lessens even faster the accuracy of the double number in the neighbourhoood of zero. In our similar decimal example the repeated division results into:

HEAD TAIL ACCUR REMARKS +300001@-90 +414210@-96 12 both numeric parts are +300001@-91 +414210@-97 12 normalized +300001@-92 +414210@-98 12 +300001@-93 +414210@-99 6 trailer value is zero +300001@-94 +******@-99 6 +300001@-95 +******@-99 6 +300001@-96 +******@-99 6 +300001@-97 +******@-99 6 +300001@-98 +******@-99 6 +300001@-99 +******@-99 0 header value is zero

A double-precision adder can be emulated on a single-precision computer. Consequently a quadruple-precision addition can be performed on a computer that can handle double-precision numbers by its hardware. The only drawback is that the addition takes much more time since it consists of a lot of separate single-precision additions and subtractions and some if-clauses. To enhance speed this adder should be written in assembler code.

A double-precision multiplication or a division is possible, although it will consume a very lot of computer power and time. So an actual computer can do these operations with quadruple accuracy, but at a high cost.

Note that when a double- (or: quadruple-) precision value is very close to zero its accuracy is less then one might expect.

The emulated adder must be programmed into the computer. For the ease of the user here this program is given in a symbolic high-level language. First the names of the variables and values are described, which all are of single precision. The execution statements are notated under these descriptions.

The double-precision value to which the single-precision value ADDEND is added consists of the two numbers HEAD and TAIL.

The result of the addition is stored in HEAD2 and TAIL2.

The auxiliary variables for temporary use are H1.T1 and H2.T2.

In the executables |DATUM| means the absolute value of DATUM.

H1 = HEAD + ADDEND if |HEAD| > |ADDEND| then T1 = H1 - HEAD T1 = T1 - ADDEND else T1 = H1 - ADDEND T1 = T1 - HEAD endif T1 = TAIL - T1 H2 = H1 + T1 if |H1| > |T1| then T2 = H2 - H1 T2 = T2 - T1 else T2 = H2 - T1 T2 = T2 - H1 endif HEAD2 = H2 TAIL2 = -T2

The user can check that this program works even when HEAD and ADDEND have roughly the same size but have opposite signs. For example, when HEAD.TAIL = 100000@5-141592@-1 and ADDEND = 100003@5, then the result HEAD2.TAIL2 will become -314159@0-200000@-6. The following table gives several examples in the truncation mode.:

HEAD . TAIL + ADDEND ==> HEAD2 . TAIL2 (REM +100001@5 +414210@-1 +271828@0 +100004@5 +132490@-1 +100000@5 -141592@-1 -100003@5 -314159@0 -200000@-6 +999998@4 +584074@-2 -100003@5 -314159@0 -260000@-6 1) +999998@4 +584074@-2 -100003@5 -314160@0 +700000@-6 2) REMARKS: 1) When the FPP's internal registers have an extra digit for keeping the accuracy during the shift over one place. 2) When the FPP's internal registers do not have this digit.

In 1996 William Kahan designed a mechanism to enhance the precision of the computer in an addition. It was a subroutine that adds together all values in an array-row. From its source text the essential adder can be derived. When Kahan's names 'Sum' and 'Compensator' are replaced by Header and Trailer, then this adder looks like:

Y0 = ADDEND - TAIL H1 = HEAD H2 = H1 + Y0 T2 = H2 - H1 HEAD2 = H2 TAIL2 = T2 - Y0

This adder is approximately twice as fast as the one shown above. But alas, it is incomplete and unreliable. It should not be used when the number to be added has roughly the same or a larger value as/than the twin number it is added to. Then the trailer value is blurred, and so the desired accuracy gets lost.

Kahan's adder works correctly only when ADDEND is very small compared to the sum to which it is added. Nevertheless this adder gave the inspiration to design the adder above.

The above proposals give raise to a new data type with a small set of accompanying operators. Let us name this data type TWIN. Since it will consist of two floating-point numbers it is a double-value numeric, like a complex number is.

The data type COMPLEX is a class of double-value numerics. Two independent numbers of equal length and structure are stored, transferred and updated together by single arithmetic commands. These two numbers are named as the REAL and IMAGINARY part. For example: the value-copy statement CB = CA means that both the real part and the imaginary part of the complex number CA are copied into the corresponding parts of the complex number CB. Thus two values are copied by one single statement.

The proposed type TWIN is a new class of double-value numerics that consist of two equally shaped numbers. The first number is called the 'header', and the second one is called the 'trailer'. For the moment the access to these individual numbers is given by the suffixes [HEAD] and [TAIL].

This new type of data must get four intrinsic functions:

- ALTERATION:

A Twin number is copied into another one. The original
contents of that other number is lost.

- ASSIGNMENT:

A single float is assigned to a Twin number. This serves
for initializing the Twin number.

The header gets the value of the float. The trailer
becomes zero.

- ADDITION:

A single float is added to a Twin number.

This addition is performed according to the descriptons
given in the text above.

Note that both numeric parts of the Twin number change.

- SUBTRACTION:

It negates the single float before adding it to the Twin
number

When TWN and TWN2 are Twin-numbers and FL is an ordinary float then these four intrinsic functions work like:

- alteration: TWN[HEAD] = TWN2[HEAD] TWN[TAIL] = TWN2[TAIL] - assignment: TWN[HEAD] = FL TWN[TAIL] = 0.0 - addition: ADDEND = FL Add ADDEND to TWN in the way shown above - subtraction: ADDEND = -FL Add ADDEND to TWN in the way shown above

In these textes the three external variables TWN, TWN2 and FL, the local variable ADDEND and all other local variables in the adder all apply the same computer-word size and structure. The adder and subtractor differ only in the way ADDEND is filled.

For these four intrinsics the infix operator notations should be allowed:

- alteration: TWN = TWN2 - assignment: TWN = FL - addition: TWN2 = TWN + FL - subtraction: TWN2 = TWN - FL

Note that Twin numbers are not numbers in the ordinary sense. They only contain such numbers. They cannot be used as an input for any arithmetic function or operator. Only their separate parts can. Therefore also they cannot be added together.

Since serial additions occur often (e.g. in accountancy) it might be useful to implement the data type Twin in actual languages like Cobol and Fortran. Here an idea for the version in Fortran is given.

The new data type TWIN can be implemented in the programming language Fortran easily. Most important is that its structure, memory storage and data transfer to and from a subprogram can equal those of the COMPLEX data type. Consequently it can be treated fairly similar to that COMPLEX type. The HEAD-part in the Twin type corresponds with the Real part in the Complex type and the TAIL-part corresponds with the Imaginary part.

The declaration statement should be analogous to that of the type Complex:

TWIN var1, var2, ... etcetera

These things all make that a value of the Twin type can be transferred from a function to the calling program via the name of that function. The header of such a function looks like:

TWIN FUNCTION funcname ( argument_list )

In contrast with a Complex variable or value a Twin variable or value cannot be used as list item in the I/O list of a READ or WRITE statement. The reason is that a Twin data item is not a number in the ordinary sense. Moreover, the +/- sign of the trailer can be opposite to the sign of the header.

In a DATA statement a Twin variable can be initialized. The corresponding value in the list of values is a Real number. The trailer-part is always set to zero implicitly. A series of two float values between parentheses similar to a Complex value is forbidden. Constant literals of type Twin do not exist.

The alteration and the assignment are exactly the same as those for the Complex type. So for them the same operator notation can be used: TWN = TWN2 and TWN = FL.

Both data parts can be read separately by intrinsic functions similar to REAL and AIMAG. Here they might be named HEADTWN and TAILTWN, and their results are ordinary float numbers. But the analogon of the composing function CMPLX must be absent. The user should not be allowed to change the header and the trailer directly since the exponents of both parts might not match, e.g. the exponent in the trailer becomes too great. Nevertheless, the user still can via the Equivalence statement.

For the addition and the subtraction the ordinary plus and minus symbols can be applied. Therefore it should be allowed that both symbols can have a Twin value at their left side. Then their output is of Twin type. Never these operators accept the right operand to be of Twin type. That operamd must be an ordinary real or integer number. Twin values can never act as inputs for multiplication, division or other arithmetic operators. Only their separate parts can.

An example of calling a Twin function in an expression is:

TWN2 = funcTWNmaker (7.5E-4, int_num) + 3.1415926

The user/programmer is always allowed to use the same variable both for input and for output of a Twin-changing statement. For example: TWN1 = TWN1 + FL,

The above discussion suggests that the Twin variables are made of single-precision float numbers. Of course, they actually are not, since all computers can handle double-precision numbers in the direct way. So the Twin numbers should be organized like the DOUBLE PRECISION COMPLEX numbers. Then they emulate a quadruple precision.

As long as the Twin data type is not a part of the Fortan language the user/programmer can emulate it by himself by using the existing data type DOUBLE PRECISION COMPLEX. The operators for the assignment and the alteration can be borrowed directly from this complex type.

Of course in this case the compiler cannot check whether the user is applying the data of type Double-Precision-Complex as actually complex data or as twin data. The user/programmer must take care by himself that he does not use the whole Twin number as an ordinary numeric input for an arithmetic operator or a function. He only should use each of its two parts separately as such input. The HEAD part is selected by calling the Real part, and the TAIL part is selected by calling the Imaginary part.

The user only needs to write the two functions ADDTWIN and SUBTRTWIN. He easily can write these functions on the base of the symbolic program of the Twin adder. Their function header should look like:

DOUBLE PRECISION COMPLEX FUNCTION funcname (TWN1, FL)

DOUBLE PRECISION COMPLEX TWN1

DOUBLE PRECISION FL

IMPORTANT NOTE:

Neither the user, nor the compiler should optimize any part of the code like written in the symbolic program of the Twin adder. It is even better to 'copy' this code in assembler language and not in Fortran. Then the user can be sure not to have got any optimization. This subprogram should be linked with the ordinary (and perhaps optimized) Fortran programs that call it.