An error-free transformation [EFT] is a transformation of a floating point realization of a mathematical operation that preserves the information inherent in the operation. Ordinary floating point addition of two Float64
values returns only that information about the sum which fits within a Float64
variable. For example, where flsum
is short for ordinary_Float64_sum
and flsum_errorfree
is flsum
with the remaining value
that usually is lost in calculation.
flsum = 1.0 + 2.0^(-60)
flsum == 1.0
An error-free transformation of the floating point sum preserves the full value by returning two quantities.
flsum_errorfree = errorfree_sum(1.0, 2.0^(-60))
flsum_errorfree = (1.0, 2.0^(-60))
Here is another example
- we will check the result
- finding it accurate through the last bit of the
lo
part
- finding it accurate through the last bit of the
- we will verify the result
- determining it obtains and preserves all non-zero value
flsum_errorfree = errorfree_sum(sqrt(3.0), sqrt(3.0)/1024)
flsum_errorfree == (1.7337422634356436, 3.686287386450715e-17)
# we can check that the result is as accurate as possible
# (here, through the last bit of the `lo` part)
bigsum = BigFloat(sqrt(3.0)) + BigFloat(sqrt(3.0)/1024);
hi = Float64(bigsum); lo = Float64(bigsum - Float64(bigsum)); hi, lo
# (1.7337422634356436, 3.686287386450715e-17)
# we can verify that we captured and preserved all the non-zero value
Float64(bigsum - hi - lo) # all non-zero value has been preserved
# 0.0
In the literature, the value that disappears in the usual floating point addition is referred to as the error
[of the sum]. That terminology seems slightly misleading to me. Instead, I use hi
and lo
to indicate or to designate the high order and low order values (the most significant and least significant parts). Basic arithmetic error-free transformations (for 2-argument forms of +
, -
, and *
), where the result does not underflow or overflow, share the following property, called non-overlap
or something that sounds similar. Imagine you expand out the exponents so the hi
and lo
significands are each shifted somewhere in their respective long sequence of bits (hypothetical registers). Now find the least significant nonzero bit in the register holding the hi
part and find the most signficant nonzero bit in the register holding the lo
part. Those bit indices do not overlap, the least nonzero hi
bit is strictly more significant than the most nonzero lo
bit.
hi, lo = errorfree_transformation(x, y)
hi === arithmetic_operation(x, y)
hi + lo === hi # hi, lo do not overlap
Just so you know: for extremely large values with opposite signs, those of magnitudes very near floatmax(T)
, a more careful (slower) algorithm is needed to preserve the lo
part if really have an error-free result. This example is used in Rust
as a test case (3.5630624444874539e+307, -1.7976931348623157e+308) https://docs.rs/safeeft/0.0.5/src/safeeft/twosum.rs.html#118. Changing the signs of those values gives another corner case. Applying the usual algorithm will not work properly when the lesser of the two extremely large magnitudes is given as the first arg, and the greater magnitude is given as the second arg. In those situations, the lo
part of the result becomes +/-Inf or NaN depanding on the underlying approach. For reference, the correct results for these corner cases are
safe_two_sum( 3.5630624444874539e+307, -1.7976931348623157e+308) == (-1.4413868904135704e308, 9.9792015476736e291)
# ~ -1.4413868904135703_227177074629942291243e+308
safe_two_sum(-3.5630624444874539e+307, 1.7976931348623157e+308) == ( 1.4413868904135704e308, -9.9792015476736e291)
# ~ 1.4413868904135703_227177074629942291243e+308
There are error-free transformations that return more than 2 values. The EFTs for the sum of three values and for fma re turn 3 values, and that for the sum of four values returns 4 values. Most of this discussion focuses on 2-argument functions.