Skip to content

Instantly share code, notes, and snippets.

@certik
Created November 29, 2017 22:55
Show Gist options
  • Save certik/89dcc5bccf041ce1f57f1533c6a5d9e0 to your computer and use it in GitHub Desktop.
Save certik/89dcc5bccf041ce1f57f1533c6a5d9e0 to your computer and use it in GitHub Desktop.

Two schools

There are two schools how to write numerical code, which in this document we will call the IEEE school and the Fortran school.

One end of the spectrum: no changes in floating point by the compiler, the developer must express exact intent. This is the IEEE school.

On the other end of the spectrum: Developer specifies math, the compiler is free to do any rearrangement, no matter how big. This is SymPy. It only works with infinite precision, not finite precision.

In the middle is the Fortran school: the compiler is free to do small rearrangements, as well as arbitrary rearrangements in loops / arrays for performance. The developer must express intent in structuring the algorithm into loops and array operations. Individual expressions and statements can be rearranged locally (to an extent on the order of 10 operations or so), but not globally (change order of operation ~1000 instructions apart). Such a rearrangement does not affect accuracy significantly, and it can improve performance.

IEEE School

The IEEE school relies on the IEEE Standard 754 for Floating-Point Arithmetic semantics, and assumes it always works, and forbids the compiler to do any kind of optimizations that would change the order of floating point operations. It requires that the compiler evaluates the expression exactly as written.

A given expression in the code, say, (a+redux)-redux, must be evaluated exactly as written. For small redux, the answer will be close to a. For redux=0x1.8p52 / 256, the answer will be round(256*a)/256, but the compiler is not allowed to do any kind of transformations like that, because in corner cases these transformations might not be equal, or there might be some loss or gain of accuracy. The compiler is only allowed to do transformations that do not change the answer bit-wise.

It is always exactly defined what floating point number a given expression will produce. But it is not always clear what exact mathematics the expression solves.

Code written in the IEEE school of thought can't use an any optimizations such as -ffast-math that rewrite expressions, change the order of evaluations, or break IEEE compliance in some other way.

One must study the details of how the floating point numbers work according to the IEEE 754 standard. A good starting point can be the essay "What Every Computer Scientist Should Know About Floating-Point Arithmetic" by D. Goldberg.

Fortran School

In the Fortran school, the expression is written as the math formula that it solves, and the compiler is allowed to do any kind of transformations that are mathematically equivalent (assuming infinite precision). The final code that will be used to evaluate the expression is not clearly specified (and as such will numerically differ depending on the platform, compiler and compiler options), but it is assumed that the expression as written in the code is well conditioned, and the compiler is expected to only do transformations that will improve the performance while not significantly worsening the conditioning.

As an example, (a+redux)-redux is mathematically equivalent to a, and the compiler can numerically evaluate either (a+redux)-redux or a. This transformation is valid, because the compiler assumes that the expression (a+redux)-redux is well conditioned, and so it cannot happen that, say, a=1 and redux=2**52.

The compiler can't replace a with (a+redux)-redux, because the compiler can only assume that the expression a is well-conditioned, but the developer has not expressed anything about the redux variable (which can indeed be equal to redux=2**52), and as such, the final mathematically equivalent expression (a+redux)-redux might not be well conditioned.

The compiler can replace

y = x + b
use(y)
<do other stuff>
use(x, y, b)

with

y = x + b
use(y)
<do other stuff>
use(y - b, y, b)

i.e., doing a substitution x -> x+b-b = y-b, because the developer wrote y=x+b, and so the compiler knows that x+b is well conditioned, and as such can then replace x with x+b-b. If the developer didn't write x+b, then the compiler can't replace x with x+b-b.

If an expression is well-conditioned, here a few examples of the most common allowed transformations that preserve conditioning:

x + 0.0 -> x
x - x -> 0.0
x / C -> x * (1.0/C)
(a+b)+c -> a+(b+c)
(a*b)*c -> a*(b*c)
a*(b+c) -> a*b + a*c
a*b + a*c -> a*(b+c)

For example, (a+redux)-redux -> a + (redux-redux) -> a + 0.0 -> a.

One transformation that is forbidden is to introduce other variables, unless the compiler knows from previous expressions that the new variables will not worsen the conditioning.

Here is the list of the most common transformations that gcc does: https://gcc.gnu.org/wiki/FloatingPointMath in the "Transformations" section, and all of them preserve conditioning.

It is always exactly defined what exact mathematics the expression solves. But it is not always clear what exact floating point number a given expression will evaluate to, as that depends on how the compiler rewrites the expression.

Code written in the Fortran school can be compiled with any compiler, no matter how aggressively it optimizes the code. In particular, it is encouraged to use -ffast-math to obtain good performance.

One does not need to study details how floating point numbers work. But one must follow a certain set of rules when writing numerical code. One such set of rules is given in the next section.

Rules how to write correct code in Fortran school

Here is a simple set of consistent rules that, when followed, will guarantee that the code will be valid in the Fortran school of thought and consequently the code will always work no matter how aggressively the compiler optimizes the code. These rules are not unique, one can design other sets of rules, or relax the rules in various ways. The advantage of this set of rules is that they are general, easy to follow and simple.

  1. Conditioning (order of operation): Ensure all expressions are well conditioned, i.e., the expression will evaluate with acceptable (for the application domain) accuracy compared to the exact (infinite) precision answer. For example in a+b-b, both a and b must have similar magnitude, and not a=1 and b=2**52. In a-b, if both a and b have similar magnitude, then there will be numerical cancelation and one must ensure that the final accuracy is enough for the application domain. When summing an array sum([....]), ensure the sum is well conditioned, i.e., all array elements have a similar magnitude. Assume the compiler will rewrite the expression (do not assume associativity, or other properties). As long as the expression is well conditioned, the compiler will keep the conditioning. The order of operations is not defined. Since all expressions/sums/etc. are well conditioned per previous point, the order of operation does not matter, one always obtains the same answer with the accuracy of the original expression.

  2. Array/Loops: Expressions, statements and unrolled loops (either by hand or automatically by the compiler) are all assumed to be well conditioned and the compiler can rearrange the order of operations arbitrarily, as long as the changes are done locally (on the order of 10 elements), as that does not change the accuracy significantly. It cannot rearrange instructions globally (more than ~1000 operations apart), as that could potentially change the conditioning significantly.

    Array operations and loops do not have any specific order of operations on the array elements. However, array operations like sum(v1)+sum(v2) cannot be merged into sum([v1, v2]) if either of the arrays is bigger than ~10 elements (and the same applies to loops, those also cannot be merged). That follows from the previous paragraph, as loops can be unrolled by the compiler. However, loops and array operations have the extra property that if the developer has written them in the code, then it is telling the compiler that the array or loop operation is well conditioned no matter how large the array or loop is, and as such the compiler can rearrange the iterations arbitrarily. E.g., sum(v1) can internally be implemented in any order. This is consistent with the BLAS standard.

  3. Comparison: Compare floating point numbers as abs(a-b) < eps, not as a == b.

  4. Singularities: When a function has a singularity, such as when dividing 1/x, check that x is not zero by comparing it (using the previous point): abs(x) > eps and choose eps such that 1/x does not evaluate as infinity (which happens if x is a denormal number). One can do neither if (x == 0) then 1/x nor if (x < 0) then 1/x: in both cases use if (abs(x) > eps) then 1/x instead and choose large enough eps. The same rule applies for log(x) and other functions with singularities.

  5. Accuracy: Assume all functions and special functions like sin, cos, bessel will return answer with accuracy comparable to machine precision, e.g., ~1e-14 for double precision and ~1e-7 for single precision. The compiler is free to choose faster and tiny bit less accurate implementations of these when optimizations are turned on. Write the code in such a way so that when a sin implementation is changed or the accuracy changes from 1e-14 to 5e-14, the algorithm will still work.

  6. NaN/Inf: Make sure to never divide by 0, or to obtain NaN or Inf with any floating point operation. That way one does not have special case all functions to handle those. We simply assume that NaN and Inf does not exist and the code should abort if those are created.

  7. Debug/Release modes: Have Debug and Release modes. In Debug mode, enable exceptions for NaN, Inf and initialize all floating point variables to NaN, e.g. in gfortran one can do -ffpe-trap=invalid,zero,overflow -finit-real=snan -finit-integer=-99999999. Integers are intialized to some large negative value, to increase chances that the code will crash if an integer is used without initializing it explicitly in the code. The Release mode can enable -ffast-math and all other optimizations. Then ensure that the code never crashes in Debug mode for any user input input. Then you can use the Release mode for production runs.

  8. Properties of floating point numbers: When writing numerical algorithms, assume that the floating point representation allows a certain number of significant digits, and assume each operation has a certain accuracy. For bultin operations (such as arithmetics, special functions, etc.) one can assume close to machine accuracy, for user defined functions one must consult the documentation what the accuracy is. Write your code in such a way so as not to depend on any particular implementation of the functions you call. Use Fortran primitives such as tiny, epsilon, huge, selected_real_kind, precision, nearest, spacing, ..., to query the floating point representation. Do not depend on any other platform dependent property. Do not assume IEEE compliance.

Examples

Error of a floating-point addition

#include <math.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv) {
        double a = atof(argv[1]);
        double b = atof(argv[2]);
        if (fabs(a) < fabs(b)) { double t = b; b = a; a = t; }
        double e = ((a + b) - a) - b;
        printf("addition error for `%g + %g`: %g\n", a, b, e);
        return 0;
}

IEEE school:

$ gcc -O3 fastmath.c -o ieeemath
$ ./ieeemath 1.12e17 12.34
addition error for `1.12e+17 + 12.34`: 3.66
$ ./ieeemath 150.123 12.34
addition error for `150.123 + 12.34`: 3.55271e-15

The expression ((a+b)-a)-b evaluates as written, and as such it computes the error that an addition makes, and the error will not be small if the expression is badly conditioned, such as for the numbers in the first example, but it will be small for well conditioned numbers.

Fortran school:

$ gcc -O3 -ffast-math fastmath.c -o fastmath
$ ./fastmath 1.12e17 12.34
addition error for `1.12e+17 + 12.34`: 0
$ ./fastmath 150.123 12.34
addition error for `150.123 + 12.34`: 0

The expression ((a+b)-a)-b is assumed to be well conditioned, and mathematically it is equal to 0, so a good optimizing compiler will make this transformation. The first example then evaluates it with numbers that are not well conditioned (i.e., against rule 1), and as such we obtain a different answer from the IEEE case. For the well conditioned numbers in the second example both the IEEE and Fortran school examples give the same answer to machine precision.

How to find an error of a single floating-point addition in the Fortran school then? In the Fortran school such a subroutine must be an intrinsic, implemented by the compiler (or a library), and the code written in Fortran school then just uses such intrinsics. Fortran already provides many such intrinsics such as huge, tiny, epsilon. This would be another one, called for example add_error. One can then implement it in any language, and compile with an IEEE compiler that preserves the order of operation and guarantees the necessary platform dependent floating implementation details. The rest of the code then just uses add_error.

Division by zero

if (a < b) then
    x = 1/(b-a)
else
    x = -1
end if

IEEE school: this expression is well defined at all times. For small (b-a) one can obtain an infinity.

Fortran school: per rule 4, one must make sure using the abs(b-a) > eps idiom that the term (b-a) is not close to zero to produce a division by zero or an infinity in x. The code written in the Fortran school would look like this:

if (abs(b-a) > eps) then
    x = 1/(b-a)
else
    x = -1
end if

And eps is chosen such that the value in x will be what is required by the application domain.

Rounding

redux = float.fromhex("0x1.8p52") / 256
x = -0.7
((x+redux)-redux), round(x*256)/256

IEEE school: the left expression is well defined and is essentially equivalent to round(x*256)/256, so the left and right expressions should be equal.

Fortran school: the left expression is not well conditioned (x=-0.7 and redux ~ 2**52). The result is then not well defined per rule 1. A compiler can reduce it to 0. The right expression is well defined, and does the intended operation that the IEEE school does. That is how one would write such an operation in the Fortran school. If the round(x*256)/256 operation was not fast enough, one can provide an optimized implementation as an intrinsic round256(x), implement it using an IEEE compiler and the (x+redux)-redux idiom, and then just call it from the rest of the Fortran school code.

Bisection stopping condition

Fortran school:

When writing a stopping condition for bisection, simply use x2-x1 > tol assuming x2 > x1. Do not use x1 < (x1 + x2)/2 < x2, as the compiler can optimize this out, per rules 1, 3 and 8.

IEEE school:

In IEEE school, one can use x1 < (x1 + x2)/2 < x2, the way it works is that eventually the (x1 + x2)/2 will be exactly equal to either x1 or x2, when there are no more floating point numbers to bisect to.

Kahan summation

The [Kahan Summation](https://en.wikipedia.org/wiki/Kahan_summation_algorithm) looks like this:

function KahanSum(input)

var sum = 0.0 var c = 0.0 for i = 1 to input.length do

var y = input[i] - c var t = sum + y c = (t - sum) - y sum = t

next i return sum

IEEE school: The error term c is well defined and will reduce the numerical error in summation for badly conditioned input. The compiler will strictly follow the order of operation.

Fortran school: the input is assumed to be well conditioned and the error term is then mathematically (and numericaly) zero, per rule 1. An optimizing compiler will automatically rewrite c = (t - sum) - y -> ((sum+y)-sum)-y -> 0. If one wanted to implement Kahan summation in the Fortran school, one must first write an intrinsic add_error and then use it to compute the error term, as that operation depends on a particular implementation of floating point numbers.

Summation

Algorithms to sum an array of floating point numbers: sum([...]).

IEEE school: left-to-right, right-to-left, Kahan, pairwise and other summation algorithms will all produce different results for badly conditioned input (some algorithms will be more accurate than others).

Fortran school: The input is assumed to be well conditioned per rules 1 and 2, and then all the algorithms will produce the same answer to machine accuracy. This is consistent with the BLAS standard that also does not guarantee any particular order of operations. For badly conditioned input, the Fortran school requires to rewrite the summation algorithm using smaller order independent loops based on the structure of the input data, as is shown in the next two examples.

Pairwise summation

Code for pairwise summation (https://en.wikipedia.org/wiki/Pairwise_summation) is:

s = pairwise(x[1…n])
          if n ≤ N
                  s = x[1]
                  for i = 2 to n
                          s = s + x[i]
          else
                  m = floor(n / 2)
                  s = pairwise(x[1…m]) + pairwise(x[m+1…n])
          endif

IEEE school: this algorithm is well defined.

Fortran school: this algorithm is well defined, because it is using a naive summation for n <= N, and so the loop is telling the compiler that the order is independent, and the compiler cannot transform the pairwise(x[1…m]) + pairwise(x[m+1…n]) expression into pairwise(x[1…n]) per rule 2 (unless n ~ 10).

Terms summation

Page 7 in https://people.eecs.berkeley.edu/~wkahan/Mindless.pdf1 shows a problem that corresponds to the following code:

program kahan
implicit none
integer, parameter :: dp = kind(0.d0)
real(dp) :: s, os, comp
integer :: k
s = 0; os = -1; comp = 0; k = 0;
do while (s > os)
        k = k + 1; os = s; comp = comp + term(k)
        s = comp + os
        comp = (os-s)+comp
end do
print *, k, s, term(k)
s = s + (tail(k) + comp)
print *, s

contains

        real(dp) function term(k) result(r)
        integer, intent(in) :: k
        r = 3465/(real(k,dp)**2-1/16._dp) + 3465/((k+0.5_dp)**2-1/16._dp)
        end function

        real(dp) function tail(k) result(r)
        integer, intent(in) :: k
        r = 3465/(k+0.5_dp) + 3465/(k+1._dp)
        end function

end

When compiled and run without and with -ffast-math, it produces:

$ gfortran -O3 -march=native -funroll-loops kahan.f90 && time ./a.out
        61728404   9239.9998877340167        1.8187086585703921E-012
   9240.0000000000000

real    0m0.263s
user    0m0.260s
sys     0m0.000s
$ gfortran -O3 -march=native -ffast-math -funroll-loops kahan.f90 && time ./a.out
        87290410   9239.9999320850657        9.0949468492785197E-013
   9240.0000114752293

real    0m0.204s
user    0m0.200s
sys     0m0.000s

IEEE School: the (implicit) array of terms is badly conditioned, the first term is ~5000, the last one is ~1e-12. We must use the Kahan algorithm to get good accuracy. We can't use -ffast-math, becuase the Kahan algorithm depends on IEEE semantics, and using -ffast-math in the IEEE school is forbidden.

Fortran school: the input array is badly conditioned, so this code does not conform to the Fortran school per rules 1 and 2. We must improve the conditioning of the implicit input array by summing separately the first 600 terms and the rest. We obtain two separate well conditioned arrays, that we sum separately. The compiler can sum each loop in any order per rule 2, but it can't merge the two loops into one per rule 2, and so the algorithm will always work. The code becomes:

program kahan
implicit none
integer, parameter :: dp = kind(0.d0)
real(dp) :: sum_, sum1, sum2, t
integer :: k, k_split
sum1 = 0
k_split = 600
do k = 1, k_split
    sum1 = sum1 + term(k)
end do
sum2 = 0; k = k_split + 1; t = term(k)
do while (t > 1.8187e-12_dp)
    sum2 = sum2 + t
    k = k + 1; t = term(k)
end do
print *, k, sum1+sum2, t
sum_ = sum1 + sum2 + tail(k)
print *, sum_

contains

    real(dp) function term(k) result(r)
    integer, intent(in) :: k
    r = 3465/(real(k,dp)**2-1/16._dp) + 3465/((k+0.5_dp)**2-1/16._dp)
    end function

    real(dp) function tail(k) result(r)
    integer, intent(in) :: k
    r = 3465/(k+0.5_dp) + 3465/(k+1._dp)
    end function

end

When we run it, we obtain high accuracy results with and without -ffast-math:

$ gfortran -O3 -march=native -funroll-loops kahan.f90 && time ./a.out
    61728551   9239.9998877342841        1.8186999964570246E-012
   9240.0000000000000

real    0m0.126s
user    0m0.124s
sys     0m0.000s
$ gfortran -O3 -march=native -ffast-math -funroll-loops kahan.f90 && time ./a.out
    61728551   9239.9998877342823        1.8186999964570246E-012
   9239.9999999999982

real    0m0.126s
user    0m0.124s
sys     0m0.000s

Concern: what if the compiler with -ffast-math decides to join the two loops into one (then the code would produce an inaccurate answer again)? It can't merge the loops, per rule 2, because the length of (especially) the second loop is more than 10 iterations (it's over 60,000,000 iterations), so that would be a global reordering of order of operations which is forbidden per rule 2.

Integrating sin on [0, pi]

This code:

program integral
implicit none
! single precision;
!integer, parameter :: dp = kind(0.0), n_lim = 5000
! double precision:
integer, parameter :: dp = kind(0.d0), n_lim = 10000000
real(dp), parameter :: pi = 3.1415926535897932384626433832795_dp
real(dp), parameter :: a = pi, dx = (a-0) / n_lim
real(dp) :: sinx, r, cr, yr, tr, cs, ys, ts
integer :: n

sinx = 0.5_dp * ( sin(0._dp) + sin(a) )
r = dx; cs = 0; cr = 0
do n = 2, n_lim
    ys = sin(r) - cs
    ts = sinx + ys
    cs = (ts - sinx) - ys
    sinx = ts

    yr = dx - cr
    tr = r + yr
    cr = (tr - r) - yr
    r = tr
end do
sinx = sinx * dx

print *, sinx, abs(sinx-2)
end program

produces in double precision:

$ gfortran -O3 -march=native -funroll-loops int0.f90 && ./a.out
   1.9999999999999836        1.6431300764452317E-014
$ gfortran -O3 -march=native -funroll-loops -ffast-math int0.f90 && ./a.out
   2.0000000004135496        4.1354963897788366E-010

and in single precision (by uncommenting the 'single precision' line in the source code and commenting out the 'double precision' one):

$ gfortran -O3 -march=native -funroll-loops int0.f90 && ./a.out
   2.00000000       0.00000000
$ gfortran -O3 -march=native -funroll-loops -ffast-math int0.f90 && ./a.out
   2.00004196       4.19616699E-05

IEEE school: the code is using Kahan summation to accurately sum all the terms and one gets high accuracy in both single and double precision. -ffast-math cannot be used (it breaks the Kahan summation code and as a result one loses 4 significant digits in double precision).

Fortran school: the code calculates the r-coordinate using dx+dx+dx+... "n" times, which loses accuracy due to round-off errors. One must use the n*dx instead, which is always accurate. The compiler cannot rewrite n*dx into dx+dx+dx+..., due to the rule 2, as it can only rearrange instructions on the order of 10 or so, but not on the order of 10000000. One can apply the following patch:

--- a/src/int0.f90
+++ b/src/int0.f90
@@ -17,10 +17,7 @@ do n = 2, n_lim
     cs = (ts - sinx) - ys
     sinx = ts

-    yr = dx - cr
-    tr = r + yr
-    cr = (tr - r) - yr
-    r = tr
+    r = n*dx
 end do
 sinx = sinx * dx

and rerun:

$ gfortran -O3 -march=native -funroll-loops int0.f90 && ./a.out
   1.9999999999999836        1.6431300764452317E-014
$ gfortran -O3 -march=native -funroll-loops -ffast-math int0.f90 && ./a.out
   1.9999999999998086        1.9140244944537699E-013

The accuracy of the IEEE school (i.e., without -ffast-math) didn't change, and the code now follows the Fortran school rules and so it delivers high accuracy, as expected. The Fortran school code can now be written in a much simpler, equivalent, way:

program integral
implicit none
! single precision;
!integer, parameter :: dp = kind(0.0), n_lim = 3000
! double precision:
integer, parameter :: dp = kind(0.d0), n_lim = 10000000

real(dp), parameter :: pi = 3.1415926535897932384626433832795_dp
real(dp), parameter :: a = pi, dx = a / n_lim
real(dp) :: sinx
integer :: n
sinx = (sin(0._dp) + sin(a))/2
do n = 1, n_lim-1
    sinx = sinx + sin(n*dx)
end do
sinx = sinx * dx
print *, sinx, abs(sinx-2)
end program

the results did not change:

$ gfortran -O3 -march=native -funroll-loops -ffast-math int3.f90 && ./a.out
       1.9999999999998086        1.9140244944537699E-013

For single precision we get:

$ gfortran -O3 -march=native -funroll-loops -ffast-math int3.f90 && ./a.out
   1.99999976       2.38418579E-07

The single precision result is already as accurate as it can get, since single precision can represent ~7 significant digits and we got accuracy of about 2e-7.

The Fortran school code still loses about 1 significant digit in double precision compared to the IEEE code, which is due to the fact that we are summing 10,000,000 numbers and some of them are around 0 for x=0 and x=pi, and some of them are around 1 for x=pi/2.

One must decide for the given application if the accuracy is enough. If it is not, then one should use a method like the Gaussian quadrature, which converges quickly and thus is faste and more accurate, or one can use some more sophisticated summation method, like the pairwise summation. As an example for this particular case, we can simply sum the numbers around x=0 and x=pi separately and then we recover the IEEE accuracy:

program integral
implicit none
! single precision;
!integer, parameter :: dp = kind(0.0), n_lim = 3000
! double precision:
integer, parameter :: dp = kind(0.d0), n_lim = 10000000

real(dp), parameter :: pi = 3.1415926535897932384626433832795_dp
real(dp), parameter ::  a = pi, dx = a / n_lim
real(dp) :: sinx, sinx1, sinx2, sinx3
integer :: n
integer :: n_min
n_min = int(0.4_dp / dx)
sinx1 = 0
do n = 1, n_min
    sinx1 = sinx1 + sin(n*dx)
end do
sinx2 = 0
do n = n_min+1, n_lim-n_min
    sinx2 = sinx2 + sin(n*dx)
end do
sinx3 = 0
do n = n_lim-n_min+1, n_lim-1
    sinx3 = sinx3 + sin(n*dx)
end do
sinx = (sin(0._dp) + sin(a))/2 + sinx1 + sinx2 + sinx3
sinx = sinx * dx
print *, sinx, abs(sinx-2)
end program

And when we run it:

$ gfortran -O3 -march=native -funroll-loops -ffast-math int2.f90 && ./a.out
   1.9999999999999942        5.7731597280508140E-015

The compiler cannot fuse the loops due to rule 2, so this will always work.

But this is obviously not portable for other functions (for that one should use a Gaussian quadrature or a pairwise summation), this is just to show where the one significant digit was lost in double precision and how to recover it if that was required by the application.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment