Last active
December 10, 2015 23:28
-
-
Save cdeil/4509330 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
"""Test the speed and precision of the sum of many values. | |
When fitting data with many events or bins (say 1e6 or 1e9 or more), the fit statistic is first | |
computed per event or bin and then summed. | |
The sum (and thus the fit statistic and fit results) can become incorrect due to rounding errors | |
(see e.g. http://en.wikipedia.org/wiki/Kahan_summation_algorithm) | |
There are clever and slower methods to avoid this problem (like the Kahan summation algorithm), | |
but they are not readily available in stdlib C / C++ or numpy | |
(there's an open feature request for numpy though: https://github.com/numpy/numpy/issues/2448) | |
The simplest solution is to use 128 bit precision for the accumulator in numpy: | |
In [7]: np.sum([1e10, -1e10, 1e-6] * int(1e6)) | |
Out[7]: 1.9073477254638671 | |
In [8]: np.sum([1e10, -1e10, 1e-6] * int(1e6), dtype='float128') | |
Out[8]: 1.0002404448965787888 | |
Sherpa uses Kahan summation: | |
$ASCDS_INSTALL/src/pkg/sherpa/sherpa/sherpa/include/sherpa/stats.hh | |
This is a quick benchmarking of the precision, speed of sum as a function of these parameters: | |
* Accumulator bit size: 32, 64 or 128 | |
* Number of elements: 1e6, 1e9 | |
* TODO: how to choose values in a meaningful way? Precision results will completely depend on this. | |
Should be chosen similar to typical / extreme fitting cases with CASH, CSTAT, CHI2 fits | |
For now we only check the speed. | |
* TODO: Check against C and Cython implementation | |
""" | |
from timeit import Timer | |
dtypes = ['float32', 'float64', 'float128', 'longdouble'] | |
sizes = [int(1e6), int(1e9)] | |
def setup(size, dtype): | |
return """ | |
import numpy as np | |
data = np.zeros({size}, dtype='{dtype}') | |
"""[1:-1].format(**locals()) | |
def statement(dtype): | |
return """data.sum(dtype='{dtype}')""".format(**locals()) | |
for data_dtype in dtypes: | |
for accumulator_dtype in dtypes: | |
for size in sizes: | |
try: | |
timer = Timer(statement(accumulator_dtype), setup(size, data_dtype)) | |
time = min(timer.repeat(repeat=3, number=1)) | |
# Let's use the frequency in GHz of summed elements as our measure of speed | |
speed = 1e-9 * (size / time) | |
except: | |
speed = -1 | |
print('%10s %10s %10d %10.5f' % | |
(data_dtype, accumulator_dtype, size, speed)) | |
""" | |
On my 2.6 GHz Intel Core I7 Macbook the speed doesn't depend on data or accumulator dtype at all. | |
This is weird, because it's a 64 bit machine, so 128 bit addition should be slower. | |
Also for such a simple computation as sum the limiting factor should be memory loading speed, | |
so 128 bit data should be slower to process than 64 bit data? | |
In [53]: run sum_benchmark.py | |
f32 f32 1000000 0.82793 | |
f32 f32 1000000000 1.12276 | |
f32 f64 1000000 1.12207 | |
f32 f64 1000000000 1.10964 | |
f32 f128 1000000 1.04155 | |
f32 f128 1000000000 1.12900 | |
f64 f32 1000000 1.10609 | |
f64 f32 1000000000 1.12823 | |
f64 f64 1000000 1.10493 | |
f64 f64 1000000000 1.11920 | |
f64 f128 1000000 1.15450 | |
f64 f128 1000000000 1.11794 | |
f128 f32 1000000 1.12087 | |
f128 f32 1000000000 1.12223 | |
f128 f64 1000000 1.09885 | |
f128 f64 1000000000 1.11911 | |
f128 f128 1000000 1.06943 | |
f128 f128 1000000000 1.12578 | |
On an Intel Xeon CPU E5-2670 0 @ 2.60GHz running Scientific Linux I get: | |
float32 float32 1000000 0.94064 | |
float32 float32 1000000000 0.94639 | |
float32 float64 1000000 0.18847 | |
float32 float64 1000000000 0.18533 | |
float32 float128 1000000 0.13255 | |
float32 float128 1000000000 0.13377 | |
float32 longdouble 1000000 0.13289 | |
float32 longdouble 1000000000 0.13301 | |
float64 float32 1000000 0.17677 | |
float64 float32 1000000000 0.18126 | |
float64 float64 1000000 0.84172 | |
float64 float64 1000000000 0.88184 | |
float64 float128 1000000 0.12883 | |
float64 float128 1000000000 0.13193 | |
float64 longdouble 1000000 0.13350 | |
float64 longdouble 1000000000 0.13206 | |
float128 float32 1000000 0.15288 | |
float128 float32 1000000000 0.15494 | |
float128 float64 1000000 0.15133 | |
float128 float64 1000000000 0.15463 | |
float128 float128 1000000 0.60717 | |
float128 float128 1000000000 0.57651 | |
float128 longdouble 1000000 0.63454 | |
float128 longdouble 1000000000 0.57431 | |
longdouble float32 1000000 0.15543 | |
longdouble float32 1000000000 0.15688 | |
longdouble float64 1000000 0.14881 | |
longdouble float64 1000000000 0.15330 | |
longdouble float128 1000000 0.61881 | |
longdouble float128 1000000000 0.57245 | |
longdouble longdouble 1000000 0.63167 | |
longdouble longdouble 1000000000 0.58152 | |
""" |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment