This repository addresses usnistgov/discuss#8 by exploring floating-point rounding errors through the lens of the associative property of addition. The programs herein demonstrate that deviations from mathematical precision are due to the standard representation of floating-point numbers in a binary data type, in compliance with IEEE 754. The evidence presented here does not support any role of the CPU, instruction sets, or dynamic execution in the deviations.
tl;dr: If strict adherence to mathematical law is required, use a high-precision math library. For example, instead of the built-in data types, use
If performance is needed, and some precision can be sacrificed in the interests of accuracy, build your C programs with "unsafe" math optimizations:
gcc -funsafe-math-optimizations
.
This primer was written by Trevor Keller trevor.keller@nist.gov. It is intended for educational purposes only. Please cite this work:
Trevor Keller, "A primer on floating-point arithmetic and IEEE 754." National Institute of Standards and Technology (2018): DOI 10.5281/zenodo.6524704.
To build the C and C++ programs in this repository, you will need:
- C and C++ compilers with standard libraries, e.g. gcc
- Make build tool
- MPFR library and headers, version 4 or greater
On Debian derivatives (Ubuntu, Mint), these are all packaged:
apt install gcc libmpfr6 libmpfr-dev make
Addition of real numbers is [associative][add]: inserting parentheses must not change the computed sum. For example,
The two's complement representation of integers naturally
supports associativity, within the representable range of n
-bit representation. However, fractional real numbers incur
a round-off error since infinite digits cannot practically be stored.
The chief exception to this general rule is that real fractions
formed by the sum of powers of two are exactly represented, as a
natural consequence of the data format.
It should be noted that two's complement is not the internal representation used for floating point numbers, but can be used to visualize floating point representations.
When addition is associative, we expect the sum of three terms,
The code has three variants:
std
(standard representation)mpf
(GNU MPFR representation)unsafe
(std
with unsafe optimizations).
Use the included Makefile or the commands in the following sections to compile the executables.
Each program will output the value of
$ make std
gcc -O3 -Wall addition.c -o std -lm && ./std
a | bin(a) | bin(a+b+c) | (a+b)+c | a+(b+c) | equal |
---|---|---|---|---|---|
1.000000000 | 1.0 | 1.0 | 1.000000000 | 1.000000000 | 1 |
0.500000000 | 0.1 | 0.1 | 0.500000000 | 0.500000000 | 1 |
0.333333343 | 0.0101010101010101010101011 | 0.01010101010101010101011 | 0.333333373 | 0.333333343 | 0 |
0.250000000 | 0.01 | 0.01 | 0.250000000 | 0.250000000 | 1 |
0.200000003 | 0.00110011001100110011001101 | 0.0011001100110011001101 | 0.200000048 | 0.200000003 | 0 |
0.166666672 | 0.00101010101010101010101011 | 0.00101010101010101010101 | 0.166666627 | 0.166666672 | 0 |
0.142857149 | 0.00100100100100100100100101 | 0.00100100100100100100101 | 0.142857194 | 0.142857149 | 0 |
0.125000000 | 0.001 | 0.001 | 0.125000000 | 0.125000000 | 1 |
Using precision identical to the built-in float, we have the same result:
$ make spf
gcc -O3 -Wall -D_SINGLE_ -pedantic -include "mpfr.h" addition.c -o spf -lm -lmpfr && ./spf
a | bin(a) | bin(a+b+c) | (a+b)+c | a+(b+c) | equal |
---|---|---|---|---|---|
1.000000000 | 1.0 | 1.0 | 1.000000000 | 1.000000000 | 1 |
0.500000000 | 0.1 | 0.1 | 0.500000000 | 0.500000000 | 1 |
0.333328247 | 0.0101010101010101 | 0.01010101010101 | 0.333312988 | 0.333328247 | 0 |
0.250000000 | 0.01 | 0.01 | 0.250000000 | 0.250000000 | 1 |
0.199996948 | 0.0011001100110011 | 0.001100110011001 | 0.199981689 | 0.199996948 | 0 |
0.166664124 | 0.00101010101010101 | 0.001010101010101 | 0.166656494 | 0.166664124 | 0 |
0.142856598 | 0.001001001001001001 | 0.001001001001001 | 0.142852783 | 0.142856598 | 0 |
0.125000000 | 0.001 | 0.001 | 0.125000000 | 0.125000000 | 1 |
The MPFR library allows us to increase the precision from double (64-bit) to quadruple (128-bit):
$ make mpf
gcc -O3 -Wall -pedantic -include "mpfr.h" addition.c -o mpf -lm -lmpfr && ./mpf
a | bin(a) | bin(a+b+c) | (a+b)+c | a+(b+c) | equal |
---|---|---|---|---|---|
1.000000000 | 1.0 | 1.0 | 1.000000000 | 1.000000000 | 1 |
0.500000000 | 0.1 | 0.1 | 0.500000000 | 0.500000000 | 1 |
0.333333313 | 0.010101010101010101010101 | 0.010101010101010101010101 | 0.333333313 | 0.333333313 | 1 |
0.250000000 | 0.01 | 0.01 | 0.250000000 | 0.250000000 | 1 |
0.199999988 | 0.001100110011001100110011 | 0.001100110011001100110011 | 0.199999988 | 0.199999988 | 1 |
0.166666657 | 0.0010101010101010101010101 | 0.0010101010101010101010101 | 0.166666657 | 0.166666657 | 1 |
0.142857134 | 0.001001001001001001001001 | 0.001001001001001001001001 | 0.142857134 | 0.142857134 | 1 |
0.125000000 | 0.001 | 0.001 | 0.125000000 | 0.125000000 | 1 |
The increased precision is implemented in software, and libraries like this allow the programmer to choose an appropriate level of precision, the rounding scheme, and deterministic orders of operations. However, they are necessarily slower than numerical formats supported in hardware (like double, single, and half-precision floats).
Per the Using the GNU Compiler Collection(GCC) §3.10,
-funsafe-math-optimizations
enables optimizations that
- assume both arguments and results are valid
- violate IEEE and ANSI standards
- change the floating-point unit control word
- cause programs that rely on exact implementation of IEEE or ISO rules to fail
- may yield faster code for programs that do not require the guarantees of these specifications.
In other words, it is ill-advised for production code. However,
$ make unsafe
gcc -O3 -Wall -funsafe-math-optimizations addition.c -o unsafe -lm && ./unsafe
a | bin(a) | bin(a+b+c) | (a+b)+c | a+(b+c) | equal |
---|---|---|---|---|---|
1.000000000 | 1.0 | 1.0 | 1.000000000 | 1.000000000 | 1 |
0.500000000 | 0.1 | 0.1 | 0.500000000 | 0.500000000 | 1 |
0.333333343 | 0.0101010101010101010101011 | 0.0101010101010101010101011 | 0.333333343 | 0.333333343 | 1 |
0.250000000 | 0.01 | 0.01 | 0.250000000 | 0.250000000 | 1 |
0.200000003 | 0.00110011001100110011001101 | 0.00110011001100110011001101 | 0.200000003 | 0.200000003 | 1 |
0.166666672 | 0.00101010101010101010101011 | 0.00101010101010101010101011 | 0.166666672 | 0.166666672 | 1 |
0.142857149 | 0.00100100100100100100100101 | 0.00100100100100100100100101 | 0.142857149 | 0.142857149 | 1 |
0.125000000 | 0.001 | 0.001 | 0.125000000 | 0.125000000 | 1 |
As @JRMatey-NIST noted, this is a straight-forward byproduct of the binary representation of floating-point numbers. Exact representation is possible for integral exponents of 2; any other number incurs rounding error. This is good to know, especially for codes that frequently increment large values by small amounts.
Since many scientific computing applications model diffusive processes (heat transfer, mass diffusion, etc.), the effect is expected to be small: the perturbations caused by round-off error will be smoothed out by the stable numerical scheme without any additional effort on the part of the programmer.
Ideally, the sequence of decimals (powers of 10)
However, due to the same floating point representation problem,
variations arise from the order of summation. As a demonstration,
this program will generate a vector of 64 numbers (
Similarly, the sequence of binaries (powers of 2)
The program will generate a vector of 65 numbers (
There are two variants, shuffle
and shuffle10
, which can be built
using the Makefile or the command listed below.
$ make shuffle10
g++ -O3 -Wall -pedantic -std=c++11 -DDECIMAL summation.cpp -o shuffle && ./shuffle
Value | Proportion |
---|---|
9999.99414062500000000000000000 | 0.033300001 % |
9999.99511718750000000000000000 | 0.624599993 % |
9999.99609375000000000000000000 | 4.240699768 % |
9999.99707031250000000000000000 | 15.360699654 % |
9999.99804687500000000000000000 | 34.903400421 % |
9999.99902343750000000000000000 | 33.413898468 % |
10000.00000000000000000000000000 | 11.187700272 % |
10000.00097656250000000000000000 | 0.235699996 % |
Value | Proportion |
---|---|
128.00000000000000000000000000 | 100.000000000 % |
The sequence comprised exclusively of powers of 2 is represented exactly, i.e. 1 million repetitions produce the same result, exactly 128, every time. The sequence of powers of 10 is approximate, with the exact result, 10,000, computed in only 11 % of the million repetitions.
Out-of-order execution is a common method to avoid processor pipeline stalls: instructions are fetched in the order provided by the compiler, but executed in a way that minimizes wasted processor cycles. While modern CPUs implement out-of-order execution with in-order completion, a few architectures feature out-of-order completion of the compiler-supplied instructions, e.g. Intel's P6 (Pentium Pro/II/III), Core, and Silvermont (Atom, Knights Landing). Since the sequence of mathematical operations would be non-deterministic, these processors could exhibit poor reproducibility of floating-point arithmetic.
Variants of the shuffled summation and
three-term addition codes can be used to
compare results between Intel Xeon (Sandy Bridge) and
Knights Landing (Silvermont) processors. Since the KNL
hardware is novel, the Intel C++ Compiler is preferred over
GCC; it sets -fp-model fast=1
by default, which is
equivalent to -Ofast
in GCC and must be disabled to provide a valid
comparison using standards-compliant arithmetic. The variants, phi
and shufflePhi
, can be built using the Makefile or the commands
below.
$ make phi
icc -O3 -Wall -xmic-avx512 -fp-model precise addition.c -o phi && ./phi
a | bin(a) | bin(a+b+c) | (a+b)+c | a+(b+c) | equal |
---|---|---|---|---|---|
1.000000000 | 1.0 | 1.0 | 1.000000000 | 1.000000000 | 1 |
0.500000000 | 0.1 | 0.1 | 0.500000000 | 0.500000000 | 1 |
0.333333343 | 0.0101010101010101010101011 | 0.01010101010101010101011 | 0.333333373 | 0.333333343 | 0 |
0.250000000 | 0.01 | 0.01 | 0.250000000 | 0.250000000 | 1 |
0.200000003 | 0.00110011001100110011001101 | 0.0011001100110011001101 | 0.200000048 | 0.200000003 | 0 |
0.166666672 | 0.00101010101010101010101011 | 0.00101010101010101010101 | 0.166666627 | 0.166666672 | 0 |
0.142857149 | 0.00100100100100100100100101 | 0.00100100100100100100101 | 0.142857194 | 0.142857149 | 0 |
0.125000000 | 0.001 | 0.001 | 0.125000000 | 0.125000000 | 1 |
0.111111112 | 0.000111000111000111000111001 | 0.000111000111000111001 | 0.111111164 | 0.111111112 | 0 |
0.100000001 | 0.000110011001100110011001101 | 0.00011001100110011001101 | 0.100000024 | 0.100000001 | 0 |
0.090909094 | 0.0001011101000101110100011 | 0.00010111010001011101001 | 0.090909123 | 0.090909094 | 0 |
0.083333336 | 0.000101010101010101010101011 | 0.00010101010101010101011 | 0.083333373 | 0.083333336 | 0 |
0.076923080 | 0.000100111011000100111011001 | 0.0001001110110001001111 | 0.076923132 | 0.076923080 | 0 |
0.071428575 | 0.000100100100100100100100101 | 0.0001001001001001001001 | 0.071428537 | 0.071428575 | 0 |
0.066666670 | 0.000100010001000100010001001 | 0.00010001000100010001001 | 0.066666722 | 0.066666670 | 0 |
0.062500000 | 0.0001 | 0.0001 | 0.062500000 | 0.062500000 | 1 |
$ make shufflePhi
icc -O3 -Wall -pedantic -std=c++11 -DDECIMAL -xmic-avx512 -fp-model strict summation.cpp -o shufflePhi && ./shufflePhi
Value | Proportion |
---|---|
9999.99414062500000000000000000 | 0.033300001 % |
9999.99511718750000000000000000 | 0.624599993 % |
9999.99609375000000000000000000 | 4.240699768 % |
9999.99707031250000000000000000 | 15.360699654 % |
9999.99804687500000000000000000 | 34.903400421 % |
9999.99902343750000000000000000 | 33.413898468 % |
10000.00000000000000000000000000 | 11.187700272 % |
10000.00097656250000000000000000 | 0.235699996 % |
The results of shuffled summation, with zeros truncated, shows no difference between the microarchitectures:
-------------------------------- --------------------------------
Xeon E5-1650 Xeon Phi 7210
-------------------------------- --------------------------------
9999.9941406250: 0.033300000 % 9999.9941406250: 0.033300001 %
9999.9951171875: 0.624600000 % 9999.9951171875: 0.624599993 %
9999.9960937500: 4.240700000 % 9999.9960937500: 4.240699768 %
9999.9970703125: 15.360700000 % 9999.9970703125: 15.360699654 %
9999.9980468750: 34.903400000 % 9999.9980468750: 34.903400421 %
9999.9990234375: 33.413900000 % 9999.9990234375: 33.413898468 %
10000.0000000000: 11.187700000 % 10000.0000000000: 11.187700272 %
10000.0009765625: 0.235700000 % 10000.0009765625: 0.235699996 %
-------------------------------- --------------------------------
For both test cases — associativity of
The following resources helped me to gain understanding.
-
David Goldberg (1991): What Every Computer Scientist Should Know About Floating-Point Arithmetic.
Concise summary of how the standard floating-point representation works, and pitfalls to avoid.
-
Donald Knuth, The Art of Computer Programming, Vol. 2, Ed. 3, (1997): §4.2.2: Accuracy of Floating Point Arithmetic
Philosophical review of floating-point representations with discussion of the compromises made to arrive at a working standard.
-
Erik Cheever (2001): Representation of Numbers
Summary, with exercises, of how the standard represents real numbers.
-
Rick Regan (2012): Why 0.1 Does Not Exist In Floating-Point
Excellent illustration of why round-off error is a necessary evil in the standard representation.
-
James Demmel and Hong Diep Nguyen (2013): Numerical Reproducibility and Accuracy at Exascale
Short paper summarizing reproducibility of summation algorithms.
-
James Demmel, Hong Diep Nguyen, & Peter Ahrens (2015): Cost of Floating-Point Reproducibility
Detailed slides discussing sources of non-reproducibility and the computational cost of correcting the deficiencies.
-
Jeff Arnold (2017): An Introduction to Floating-Point Arithmetic and Computation
Thorough slide deck highlighting problem areas for scientific computing.
-
Thomas Risse (2017): Better is the Enemy of Good: Unums — An Alternative to IEEE 754 Floats and Doubles
Concise discussion of an alternative floating-point representation.
-
James Matey (2018): Re: Floating-Point Repeatability Issues on Modern Processors
Insightful comment on the GitHub Issue with helpful references.