certik/schools.rst

## schools.rst

      
    Raw
  

              schools.rst
            
          
Two schools

There are two schools how to write numerical code, which in this document we
will call the IEEE school and the Fortran school.
One end of the spectrum: no changes in floating point by the compiler, the
developer must express exact intent. This is the IEEE school.
On the other end of the spectrum: Developer specifies math, the compiler is
free to do any rearrangement, no matter how big. This is SymPy. It only works
with infinite precision, not finite precision.
In the middle is the Fortran school: the compiler is free to do small
rearrangements, as well as arbitrary rearrangements in loops / arrays for
performance. The developer must express intent in structuring the algorithm
into loops and array operations. Individual expressions and statements can be
rearranged  locally (to an extent on the order of 10 operations or so), but
not globally (change order of operation ~1000 instructions apart).  Such a
rearrangement does not affect accuracy significantly, and it can improve
performance.

IEEE School

The IEEE school relies on the IEEE Standard 754 for Floating-Point Arithmetic
semantics, and assumes it always works, and forbids the compiler to do any kind
of optimizations that would change the order of floating point operations. It
requires that the compiler evaluates the expression exactly as written.
A given expression in the code, say, (a+redux)-redux, must be evaluated
exactly as written. For small redux, the answer will be close to a. For
redux=0x1.8p52 / 256, the answer will be round(256*a)/256, but the compiler
is not allowed to do any kind of transformations like that, because in corner
cases these transformations might not be equal, or there might be some loss or
gain of accuracy. The compiler is only allowed to do transformations that do
not change the answer bit-wise.
It is always exactly defined what floating point number a given expression will
produce. But it is not always clear what exact mathematics the expression
solves.
Code written in the IEEE school of thought can't use an any optimizations such
as -ffast-math that rewrite expressions, change the order of evaluations, or
break IEEE compliance in some other way.
One must study the details of how the floating point numbers work according to
the IEEE 754 standard. A good starting point can be the essay "What Every
Computer Scientist Should Know About Floating-Point Arithmetic" by D. Goldberg.

Fortran School

In the Fortran school, the expression is written as the math formula that it
solves, and the compiler is allowed to do any kind of transformations that are
mathematically equivalent (assuming infinite precision). The final code that
will be used to evaluate the expression is not clearly specified (and as such
will numerically differ depending on the platform, compiler and compiler
options), but it is assumed that the expression as written in the code is well
conditioned, and the compiler is expected to only do transformations that will
improve the performance while not significantly worsening the conditioning.
As an example, (a+redux)-redux is mathematically equivalent to a, and the
compiler can numerically evaluate either (a+redux)-redux or a. This
transformation is valid, because the compiler assumes that the expression
(a+redux)-redux is well conditioned, and so it cannot happen that, say, a=1
and redux=2**52.
The compiler can't replace a with (a+redux)-redux, because the compiler can
only assume that the expression a is well-conditioned, but the developer has
not expressed anything about the redux variable (which can indeed be equal to
redux=2**52), and as such, the final mathematically equivalent expression
(a+redux)-redux might not be well conditioned.
The compiler can replace
y = x + b
use(y)
<do other stuff>
use(x, y, b)

with
y = x + b
use(y)
<do other stuff>
use(y - b, y, b)

i.e., doing a substitution x -> x+b-b = y-b, because the developer wrote
y=x+b, and so the compiler knows that x+b is well conditioned, and as such
can then replace x with x+b-b. If the developer didn't write x+b, then
the compiler can't replace x with x+b-b.
If an expression is well-conditioned, here a few examples of the most common
allowed transformations that preserve conditioning:
x + 0.0 -> x
x - x -> 0.0
x / C -> x * (1.0/C)
(a+b)+c -> a+(b+c)
(a*b)*c -> a*(b*c)
a*(b+c) -> a*b + a*c
a*b + a*c -> a*(b+c)

For example, (a+redux)-redux -> a + (redux-redux) -> a + 0.0 -> a.
One transformation that is forbidden is to introduce other variables, unless
the compiler knows from previous expressions that the new variables will not
worsen the conditioning.
Here is the list of the most common transformations that gcc does:
https://gcc.gnu.org/wiki/FloatingPointMath in the "Transformations" section,
and all of them preserve conditioning.
It is always exactly defined what exact mathematics the expression solves.  But
it is not always clear what exact floating point number a given expression will
evaluate to, as that depends on how the compiler rewrites the expression.
Code written in the Fortran school can be compiled with any compiler, no matter
how aggressively it optimizes the code. In particular, it is encouraged to use
-ffast-math to obtain good performance.
One does not need to study details how floating point numbers work. But one
must follow a certain set of rules when writing numerical code. One such set of
rules is given in the next section.

Rules how to write correct code in Fortran school

Here is a simple set of consistent rules that, when followed, will guarantee
that the code will be valid in the Fortran school of thought and consequently
the code will always work no matter how aggressively the compiler optimizes the
code. These rules are not unique, one can design other sets of rules, or relax
the rules in various ways. The advantage of this set of rules is that they are
general, easy to follow and simple.

Conditioning (order of operation): Ensure all expressions are well
conditioned, i.e., the expression will evaluate with acceptable (for the
application domain) accuracy compared to the exact (infinite) precision
answer. For example in a+b-b, both a and b must have similar
magnitude, and not a=1 and b=2**52. In a-b, if both a and b have
similar magnitude, then there will be numerical cancelation and one must
ensure that the final accuracy is enough for the application domain. When
summing an array sum([....]), ensure the sum is well conditioned, i.e.,
all array elements have a similar magnitude. Assume the compiler will
rewrite the expression (do not assume associativity, or other properties).
As long as the expression is well conditioned, the compiler will keep the
conditioning. The order of operations is not defined.  Since all
expressions/sums/etc. are well conditioned per previous point, the order of
operation does not matter, one always obtains the same answer with the
accuracy of the original expression.

Array/Loops: Expressions, statements and unrolled loops (either by hand or
automatically by the compiler) are all assumed to be well conditioned and
the compiler can rearrange the order of operations arbitrarily, as long as
the changes are done locally (on the order of 10 elements), as that does
not change the accuracy significantly. It cannot rearrange instructions
globally (more than ~1000 operations apart), as that could potentially
change the conditioning significantly.
Array operations and loops do not have any specific order of operations on
the array elements. However, array operations like sum(v1)+sum(v2) cannot be
merged into sum([v1, v2]) if either of the arrays is bigger than ~10
elements (and the same applies to loops, those also cannot be merged). That
follows from the previous paragraph, as loops can be unrolled by the
compiler. However, loops and array operations have the extra property that
if the developer has written them in the code, then it is telling the
compiler that the array or loop operation is well conditioned no matter how
large the array or loop is, and as such the compiler can rearrange the
iterations arbitrarily. E.g., sum(v1) can internally be implemented in any
order. This is consistent with the BLAS standard.

Comparison: Compare floating point numbers as abs(a-b) < eps, not as a ==
b.

Singularities: When a function has a singularity, such as when dividing
1/x, check that x is not zero by comparing it (using the previous
point): abs(x) > eps and choose eps such that 1/x does not evaluate as
infinity (which happens if x is a denormal number). One can do neither if
(x == 0) then 1/x nor if (x < 0) then 1/x: in both cases use if (abs(x)
> eps) then 1/x instead and choose large enough eps. The same rule
applies for log(x) and other functions with singularities.

Accuracy: Assume all functions and special functions like sin, cos,
bessel will return answer with accuracy comparable to machine precision,
e.g., ~1e-14 for double precision and ~1e-7 for single precision. The
compiler is free to choose faster and tiny bit less accurate implementations
of these when optimizations are turned on. Write the code in such a way so
that when a sin implementation is changed or the accuracy changes from
1e-14 to 5e-14, the algorithm will still work.

NaN/Inf: Make sure to never divide by 0, or to obtain NaN or Inf with any
floating point operation. That way one does not have special case all
functions to handle those. We simply assume that NaN and Inf does not exist
and the code should abort if those are created.

Debug/Release modes: Have Debug and Release modes. In Debug mode, enable
exceptions for NaN, Inf and initialize all floating point variables to NaN,
e.g. in gfortran one can do -ffpe-trap=invalid,zero,overflow
-finit-real=snan -finit-integer=-99999999. Integers are intialized to some
large negative value, to increase chances that the code will crash if an
integer is used without initializing it explicitly in the code. The Release
mode can enable -ffast-math and all other optimizations. Then ensure that
the code never crashes in Debug mode for any user input input. Then you can
use the Release mode for production runs.

Properties of floating point numbers: When writing numerical algorithms,
assume that the floating point representation allows a certain number of
significant digits, and assume each operation has a certain accuracy. For
bultin operations (such as arithmetics, special functions, etc.) one can
assume close to machine accuracy, for user defined functions one must
consult the documentation what the accuracy is. Write your code in such a
way so as not to depend on any particular implementation of the functions
you call. Use Fortran primitives such as tiny, epsilon, huge,
selected_real_kind, precision, nearest, spacing, ..., to query the
floating point representation. Do not depend on any other platform dependent
property. Do not assume IEEE compliance.


Examples


Error of a floating-point addition

#include <math.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv) {
        double a = atof(argv[1]);
        double b = atof(argv[2]);
        if (fabs(a) < fabs(b)) { double t = b; b = a; a = t; }
        double e = ((a + b) - a) - b;
        printf("addition error for `%g + %g`: %g\n", a, b, e);
        return 0;
}

IEEE school:
$ gcc -O3 fastmath.c -o ieeemath
$ ./ieeemath 1.12e17 12.34
addition error for `1.12e+17 + 12.34`: 3.66
$ ./ieeemath 150.123 12.34
addition error for `150.123 + 12.34`: 3.55271e-15

The expression ((a+b)-a)-b evaluates as written, and as such it computes the
error that an addition makes, and the error will not be small if the expression
is badly conditioned, such as for the numbers in the first example, but it will be small for well conditioned numbers.
Fortran school:
$ gcc -O3 -ffast-math fastmath.c -o fastmath
$ ./fastmath 1.12e17 12.34
addition error for `1.12e+17 + 12.34`: 0
$ ./fastmath 150.123 12.34
addition error for `150.123 + 12.34`: 0

The expression ((a+b)-a)-b is assumed to be well conditioned, and
mathematically it is equal to 0, so a good optimizing compiler will make this
transformation. The first example then evaluates it with numbers that are not
well conditioned (i.e., against rule 1), and as such we obtain a different
answer from the IEEE case.  For the well conditioned numbers in the second
example both the IEEE and Fortran school examples give the same answer to
machine precision.
How to find an error of a single floating-point addition in the Fortran school
then? In the Fortran school such a subroutine must be an intrinsic, implemented
by the compiler (or a library), and the code written in Fortran school then
just uses such intrinsics. Fortran already provides many such intrinsics such
as huge, tiny, epsilon. This would be another one, called for example
add_error. One can then implement it in any language, and compile with an
IEEE compiler that preserves the order of operation and guarantees the
necessary platform dependent floating implementation details. The rest of the
code then just uses add_error.

Division by zero

if (a < b) then
    x = 1/(b-a)
else
    x = -1
end if

IEEE school: this expression is well defined at all times. For small (b-a) one
can obtain an infinity.
Fortran school: per rule 4, one must make sure using the abs(b-a) > eps idiom
that the term (b-a) is not close to zero to produce a division by zero or an
infinity in x. The code written in the Fortran school would look like this:
if (abs(b-a) > eps) then
    x = 1/(b-a)
else
    x = -1
end if

And eps is chosen such that the value in x will be what is required by the
application domain.

Rounding

redux = float.fromhex("0x1.8p52") / 256
x = -0.7
((x+redux)-redux), round(x*256)/256

IEEE school: the left expression is well defined and is essentially equivalent
to round(x*256)/256, so the left and right expressions should be equal.
Fortran school: the left expression is not well conditioned (x=-0.7 and
redux ~ 2**52). The result is then not well defined per rule 1. A compiler
can reduce it to 0. The right expression is well defined, and does the intended
operation that the IEEE school does. That is how one would write such an
operation in the Fortran school. If the round(x*256)/256 operation was not
fast enough, one can provide an optimized implementation as an intrinsic
round256(x), implement it using an IEEE compiler and the (x+redux)-redux
idiom, and then just call it from the rest of the Fortran school code.

Bisection stopping condition

Fortran school:
When writing a stopping condition for bisection, simply use x2-x1 > tol
assuming x2 > x1. Do not use x1 < (x1 + x2)/2 < x2, as the compiler can
optimize this out, per rules 1, 3 and 8.
IEEE school:
In IEEE school, one can use x1 < (x1 + x2)/2 < x2, the way it works is that
eventually the (x1 + x2)/2 will be exactly equal to either x1 or x2, when
there are no more floating point numbers to bisect to.

Kahan summation

The [Kahan Summation](https://en.wikipedia.org/wiki/Kahan_summation_algorithm)
looks like this:


function KahanSum(input)
var sum = 0.0
var c = 0.0
for i = 1 to input.length do

var y = input[i] - c
var t = sum + y
c = (t - sum) - y
sum = t
next i
return sum


IEEE school: The error term c is well defined and will reduce the numerical
error in summation for badly conditioned input. The compiler will strictly
follow the order of operation.
Fortran school: the input is assumed to be well conditioned and the error term
is then mathematically (and numericaly) zero, per rule 1. An optimizing
compiler will automatically rewrite c = (t - sum) - y  ->  ((sum+y)-sum)-y ->
0.  If one wanted to implement Kahan summation in the Fortran school, one must
first write an intrinsic add_error and then use it to compute the error term,
as that operation depends on a particular implementation of floating point
numbers.

Summation

Algorithms to sum an array of floating point numbers: sum([...]).
IEEE school: left-to-right, right-to-left, Kahan, pairwise and other summation
algorithms will all produce different results for badly conditioned input (some
algorithms will be more accurate than others).
Fortran school: The input is assumed to be well conditioned per rules 1 and 2,
and then all the algorithms will produce the same answer to machine accuracy.
This is consistent with the BLAS standard that also does not guarantee any
particular order of operations. For badly conditioned input, the Fortran school
requires to rewrite the summation algorithm using smaller order independent
loops based on the structure of the input data, as is shown in the next two
examples.

Pairwise summation

Code for pairwise summation (https://en.wikipedia.org/wiki/Pairwise_summation)
is:
s = pairwise(x[1…n])
          if n ≤ N
                  s = x[1]
                  for i = 2 to n
                          s = s + x[i]
          else
                  m = floor(n / 2)
                  s = pairwise(x[1…m]) + pairwise(x[m+1…n])
          endif

IEEE school: this algorithm is well defined.
Fortran school: this algorithm is well defined, because it is using a naive
summation for n <= N, and so the loop is telling the compiler that the order
is independent, and the compiler cannot transform the pairwise(x[1…m]) +
pairwise(x[m+1…n]) expression into pairwise(x[1…n]) per rule 2 (unless n ~
10).

Terms summation

Page 7 in https://people.eecs.berkeley.edu/~wkahan/Mindless.pdf1 shows a
problem that corresponds to the following code:
program kahan
implicit none
integer, parameter :: dp = kind(0.d0)
real(dp) :: s, os, comp
integer :: k
s = 0; os = -1; comp = 0; k = 0;
do while (s > os)
        k = k + 1; os = s; comp = comp + term(k)
        s = comp + os
        comp = (os-s)+comp
end do
print *, k, s, term(k)
s = s + (tail(k) + comp)
print *, s

contains

        real(dp) function term(k) result(r)
        integer, intent(in) :: k
        r = 3465/(real(k,dp)**2-1/16._dp) + 3465/((k+0.5_dp)**2-1/16._dp)
        end function

        real(dp) function tail(k) result(r)
        integer, intent(in) :: k
        r = 3465/(k+0.5_dp) + 3465/(k+1._dp)
        end function

end

When compiled and run without and with -ffast-math, it produces:
$ gfortran -O3 -march=native -funroll-loops kahan.f90 && time ./a.out
        61728404   9239.9998877340167        1.8187086585703921E-012
   9240.0000000000000

real    0m0.263s
user    0m0.260s
sys     0m0.000s
$ gfortran -O3 -march=native -ffast-math -funroll-loops kahan.f90 && time ./a.out
        87290410   9239.9999320850657        9.0949468492785197E-013
   9240.0000114752293

real    0m0.204s
user    0m0.200s
sys     0m0.000s

IEEE School: the (implicit) array of terms is badly conditioned, the first term
is ~5000, the last one is ~1e-12. We must use the Kahan algorithm to get good
accuracy. We can't use -ffast-math, becuase the Kahan algorithm depends on
IEEE semantics, and using -ffast-math in the IEEE school is forbidden.
Fortran school: the input array is badly conditioned, so this code does not
conform to the Fortran school per rules 1 and 2. We must improve the
conditioning of the implicit input array by summing separately the first 600
terms and the rest. We obtain two separate well conditioned arrays, that we sum
separately. The compiler can sum each loop in any order per rule 2, but it
can't merge the two loops into one per rule 2, and so the algorithm will always
work. The code becomes:
program kahan
implicit none
integer, parameter :: dp = kind(0.d0)
real(dp) :: sum_, sum1, sum2, t
integer :: k, k_split
sum1 = 0
k_split = 600
do k = 1, k_split
    sum1 = sum1 + term(k)
end do
sum2 = 0; k = k_split + 1; t = term(k)
do while (t > 1.8187e-12_dp)
    sum2 = sum2 + t
    k = k + 1; t = term(k)
end do
print *, k, sum1+sum2, t
sum_ = sum1 + sum2 + tail(k)
print *, sum_

contains

    real(dp) function term(k) result(r)
    integer, intent(in) :: k
    r = 3465/(real(k,dp)**2-1/16._dp) + 3465/((k+0.5_dp)**2-1/16._dp)
    end function

    real(dp) function tail(k) result(r)
    integer, intent(in) :: k
    r = 3465/(k+0.5_dp) + 3465/(k+1._dp)
    end function

end

When we run it, we obtain high accuracy results with and without -ffast-math:
$ gfortran -O3 -march=native -funroll-loops kahan.f90 && time ./a.out
    61728551   9239.9998877342841        1.8186999964570246E-012
   9240.0000000000000

real    0m0.126s
user    0m0.124s
sys     0m0.000s
$ gfortran -O3 -march=native -ffast-math -funroll-loops kahan.f90 && time ./a.out
    61728551   9239.9998877342823        1.8186999964570246E-012
   9239.9999999999982

real    0m0.126s
user    0m0.124s
sys     0m0.000s

Concern: what if the compiler with -ffast-math decides to join the two loops
into one (then the code would produce an inaccurate answer again)? It can't
merge the loops, per rule 2, because the length of (especially) the second loop
is more than 10 iterations (it's over 60,000,000 iterations), so that would be
a global reordering of order of operations which is forbidden per rule 2.

Integrating sin on [0, pi]

This code:
program integral
implicit none
! single precision;
!integer, parameter :: dp = kind(0.0), n_lim = 5000
! double precision:
integer, parameter :: dp = kind(0.d0), n_lim = 10000000
real(dp), parameter :: pi = 3.1415926535897932384626433832795_dp
real(dp), parameter :: a = pi, dx = (a-0) / n_lim
real(dp) :: sinx, r, cr, yr, tr, cs, ys, ts
integer :: n

sinx = 0.5_dp * ( sin(0._dp) + sin(a) )
r = dx; cs = 0; cr = 0
do n = 2, n_lim
    ys = sin(r) - cs
    ts = sinx + ys
    cs = (ts - sinx) - ys
    sinx = ts

    yr = dx - cr
    tr = r + yr
    cr = (tr - r) - yr
    r = tr
end do
sinx = sinx * dx

print *, sinx, abs(sinx-2)
end program

produces in double precision:
$ gfortran -O3 -march=native -funroll-loops int0.f90 && ./a.out
   1.9999999999999836        1.6431300764452317E-014
$ gfortran -O3 -march=native -funroll-loops -ffast-math int0.f90 && ./a.out
   2.0000000004135496        4.1354963897788366E-010

and in single precision (by uncommenting the 'single precision' line in the
source code and commenting out the 'double precision' one):
$ gfortran -O3 -march=native -funroll-loops int0.f90 && ./a.out
   2.00000000       0.00000000
$ gfortran -O3 -march=native -funroll-loops -ffast-math int0.f90 && ./a.out
   2.00004196       4.19616699E-05

IEEE school: the code is using Kahan summation to accurately sum all the terms
and one gets high accuracy in both single and double precision. -ffast-math
cannot be used (it breaks the Kahan summation code and as a result one loses 4
significant digits in double precision).
Fortran school: the code calculates the r-coordinate using dx+dx+dx+... "n"
times, which loses accuracy due to round-off errors. One must use the n*dx
instead, which is always accurate. The compiler cannot rewrite n*dx into
dx+dx+dx+..., due to the rule 2, as it can only rearrange instructions on the
order of 10 or so, but not on the order of 10000000. One can apply the
following patch:
--- a/src/int0.f90
+++ b/src/int0.f90
@@ -17,10 +17,7 @@ do n = 2, n_lim
     cs = (ts - sinx) - ys
     sinx = ts

-    yr = dx - cr
-    tr = r + yr
-    cr = (tr - r) - yr
-    r = tr
+    r = n*dx
 end do
 sinx = sinx * dx

and rerun:
$ gfortran -O3 -march=native -funroll-loops int0.f90 && ./a.out
   1.9999999999999836        1.6431300764452317E-014
$ gfortran -O3 -march=native -funroll-loops -ffast-math int0.f90 && ./a.out
   1.9999999999998086        1.9140244944537699E-013

The accuracy of the IEEE school (i.e., without -ffast-math) didn't change,
and the code now follows the Fortran school rules and so it delivers high
accuracy, as expected. The Fortran school code can now be written in a much
simpler, equivalent, way:
program integral
implicit none
! single precision;
!integer, parameter :: dp = kind(0.0), n_lim = 3000
! double precision:
integer, parameter :: dp = kind(0.d0), n_lim = 10000000

real(dp), parameter :: pi = 3.1415926535897932384626433832795_dp
real(dp), parameter :: a = pi, dx = a / n_lim
real(dp) :: sinx
integer :: n
sinx = (sin(0._dp) + sin(a))/2
do n = 1, n_lim-1
    sinx = sinx + sin(n*dx)
end do
sinx = sinx * dx
print *, sinx, abs(sinx-2)
end program

the results did not change:
$ gfortran -O3 -march=native -funroll-loops -ffast-math int3.f90 && ./a.out
       1.9999999999998086        1.9140244944537699E-013

For single precision we get:
$ gfortran -O3 -march=native -funroll-loops -ffast-math int3.f90 && ./a.out
   1.99999976       2.38418579E-07

The single precision result is already as accurate as it can get, since single
precision can represent ~7 significant digits and we got accuracy of about
2e-7.
The Fortran school code still loses about 1 significant digit in double
precision compared to the IEEE code, which is due to the fact that we are
summing 10,000,000 numbers and some of them are around 0 for x=0 and x=pi, and
some of them are around 1 for x=pi/2.
One must decide for the given application if the accuracy is enough. If it is
not, then one should use a method like the Gaussian quadrature, which converges
quickly and thus is faste and more accurate, or one can use some more
sophisticated summation method, like the pairwise summation. As an example for
this particular case, we can simply sum the numbers around x=0 and x=pi
separately and then we recover the IEEE accuracy:
program integral
implicit none
! single precision;
!integer, parameter :: dp = kind(0.0), n_lim = 3000
! double precision:
integer, parameter :: dp = kind(0.d0), n_lim = 10000000

real(dp), parameter :: pi = 3.1415926535897932384626433832795_dp
real(dp), parameter ::  a = pi, dx = a / n_lim
real(dp) :: sinx, sinx1, sinx2, sinx3
integer :: n
integer :: n_min
n_min = int(0.4_dp / dx)
sinx1 = 0
do n = 1, n_min
    sinx1 = sinx1 + sin(n*dx)
end do
sinx2 = 0
do n = n_min+1, n_lim-n_min
    sinx2 = sinx2 + sin(n*dx)
end do
sinx3 = 0
do n = n_lim-n_min+1, n_lim-1
    sinx3 = sinx3 + sin(n*dx)
end do
sinx = (sin(0._dp) + sin(a))/2 + sinx1 + sinx2 + sinx3
sinx = sinx * dx
print *, sinx, abs(sinx-2)
end program

And when we run it:
$ gfortran -O3 -march=native -funroll-loops -ffast-math int2.f90 && ./a.out
   1.9999999999999942        5.7731597280508140E-015

The compiler cannot fuse the loops due to rule 2, so this will always work.
But this is obviously not portable for other functions (for that one should use
a Gaussian quadrature or a pairwise summation), this is just to show where the
one significant digit was lost in double precision and how to recover it if
that was required by the application.