cdeil/sum_benchmark.py

## sum_benchmark.py
"""Test the speed and precision of the sum of many values.

When fitting data with many events or bins (say 1e6 or 1e9 or more), the fit statistic is first
computed per event or bin and then summed.

The sum (and thus the fit statistic and fit results) can become incorrect due to rounding errors
(see e.g. http://en.wikipedia.org/wiki/Kahan_summation_algorithm)

There are clever and slower methods to avoid this problem (like the Kahan summation algorithm),
but they are not readily available in stdlib C / C++ or numpy
(there's an open feature request for numpy though: https://github.com/numpy/numpy/issues/2448)

The simplest solution is to use 128 bit precision for the accumulator in numpy:
In [7]: np.sum([1e10, -1e10, 1e-6] * int(1e6))
Out[7]: 1.9073477254638671
In [8]: np.sum([1e10, -1e10, 1e-6] * int(1e6), dtype='float128')
Out[8]: 1.0002404448965787888

Sherpa uses Kahan summation:
$ASCDS_INSTALL/src/pkg/sherpa/sherpa/sherpa/include/sherpa/stats.hh

This is a quick benchmarking of the precision, speed of sum as a function of these parameters:
* Accumulator bit size: 32, 64 or 128
* Number of elements: 1e6, 1e9
* TODO: how to choose values in a meaningful way? Precision results will completely depend on this.
  Should be chosen similar to typical / extreme fitting cases with CASH, CSTAT, CHI2 fits
  For now we only check the speed.
* TODO: Check against C and Cython implementation
"""

from timeit import Timer

dtypes = ['float32', 'float64', 'float128', 'longdouble']
sizes = [int(1e6), int(1e9)]

def setup(size, dtype):
    return """
import numpy as np
data = np.zeros({size}, dtype='{dtype}')
"""[1:-1].format(**locals())

def statement(dtype):
    return """data.sum(dtype='{dtype}')""".format(**locals())

for data_dtype in dtypes:
    for accumulator_dtype in dtypes:
        for size in sizes:
            try:
                timer = Timer(statement(accumulator_dtype), setup(size, data_dtype))
                time = min(timer.repeat(repeat=3, number=1))
                # Let's use the frequency in GHz of summed elements as our measure of speed
                speed = 1e-9 * (size / time)
            except:
                speed = -1
            print('%10s %10s %10d %10.5f' %
                  (data_dtype, accumulator_dtype, size, speed))

"""
On my 2.6 GHz Intel Core I7 Macbook the speed doesn't depend on data or accumulator dtype at all.
This is weird, because it's a 64 bit machine, so 128 bit addition should be slower.
Also for such a simple computation as sum the limiting factor should be memory loading speed,
so 128 bit data should be slower to process than 64 bit data?

In [53]: run sum_benchmark.py
       f32        f32    1000000    0.82793
       f32        f32 1000000000    1.12276
       f32        f64    1000000    1.12207
       f32        f64 1000000000    1.10964
       f32       f128    1000000    1.04155
       f32       f128 1000000000    1.12900
       f64        f32    1000000    1.10609
       f64        f32 1000000000    1.12823
       f64        f64    1000000    1.10493
       f64        f64 1000000000    1.11920
       f64       f128    1000000    1.15450
       f64       f128 1000000000    1.11794
      f128        f32    1000000    1.12087
      f128        f32 1000000000    1.12223
      f128        f64    1000000    1.09885
      f128        f64 1000000000    1.11911
      f128       f128    1000000    1.06943
      f128       f128 1000000000    1.12578

On an Intel Xeon CPU E5-2670 0 @ 2.60GHz running Scientific Linux I get:
   float32    float32    1000000    0.94064
   float32    float32 1000000000    0.94639
   float32    float64    1000000    0.18847
   float32    float64 1000000000    0.18533
   float32   float128    1000000    0.13255
   float32   float128 1000000000    0.13377
   float32 longdouble    1000000    0.13289
   float32 longdouble 1000000000    0.13301
   float64    float32    1000000    0.17677
   float64    float32 1000000000    0.18126
   float64    float64    1000000    0.84172
   float64    float64 1000000000    0.88184
   float64   float128    1000000    0.12883
   float64   float128 1000000000    0.13193
   float64 longdouble    1000000    0.13350
   float64 longdouble 1000000000    0.13206
  float128    float32    1000000    0.15288
  float128    float32 1000000000    0.15494
  float128    float64    1000000    0.15133
  float128    float64 1000000000    0.15463
  float128   float128    1000000    0.60717
  float128   float128 1000000000    0.57651
  float128 longdouble    1000000    0.63454
  float128 longdouble 1000000000    0.57431
longdouble    float32    1000000    0.15543
longdouble    float32 1000000000    0.15688
longdouble    float64    1000000    0.14881
longdouble    float64 1000000000    0.15330
longdouble   float128    1000000    0.61881
longdouble   float128 1000000000    0.57245
longdouble longdouble    1000000    0.63167
longdouble longdouble 1000000000    0.58152
"""
	"""Test the speed and precision of the sum of many values.

	When fitting data with many events or bins (say 1e6 or 1e9 or more), the fit statistic is first
	computed per event or bin and then summed.

	The sum (and thus the fit statistic and fit results) can become incorrect due to rounding errors
	(see e.g. http://en.wikipedia.org/wiki/Kahan_summation_algorithm)

	There are clever and slower methods to avoid this problem (like the Kahan summation algorithm),
	but they are not readily available in stdlib C / C++ or numpy
	(there's an open feature request for numpy though: https://github.com/numpy/numpy/issues/2448)

	The simplest solution is to use 128 bit precision for the accumulator in numpy:
	In [7]: np.sum([1e10, -1e10, 1e-6] * int(1e6))
	Out[7]: 1.9073477254638671
	In [8]: np.sum([1e10, -1e10, 1e-6] * int(1e6), dtype='float128')
	Out[8]: 1.0002404448965787888

	Sherpa uses Kahan summation:
	$ASCDS_INSTALL/src/pkg/sherpa/sherpa/sherpa/include/sherpa/stats.hh

	This is a quick benchmarking of the precision, speed of sum as a function of these parameters:
	* Accumulator bit size: 32, 64 or 128
	* Number of elements: 1e6, 1e9
	* TODO: how to choose values in a meaningful way? Precision results will completely depend on this.
	Should be chosen similar to typical / extreme fitting cases with CASH, CSTAT, CHI2 fits
	For now we only check the speed.
	* TODO: Check against C and Cython implementation
	"""

	from timeit import Timer

	dtypes = ['float32', 'float64', 'float128', 'longdouble']
	sizes = [int(1e6), int(1e9)]

	def setup(size, dtype):
	return """
	import numpy as np
	data = np.zeros({size}, dtype='{dtype}')
	"""[1:-1].format(**locals())

	def statement(dtype):
	return """data.sum(dtype='{dtype}')""".format(**locals())

	for data_dtype in dtypes:
	for accumulator_dtype in dtypes:
	for size in sizes:
	try:
	timer = Timer(statement(accumulator_dtype), setup(size, data_dtype))
	time = min(timer.repeat(repeat=3, number=1))
	# Let's use the frequency in GHz of summed elements as our measure of speed
	speed = 1e-9 * (size / time)
	except:
	speed = -1
	print('%10s %10s %10d %10.5f' %
	(data_dtype, accumulator_dtype, size, speed))

	"""
	On my 2.6 GHz Intel Core I7 Macbook the speed doesn't depend on data or accumulator dtype at all.
	This is weird, because it's a 64 bit machine, so 128 bit addition should be slower.
	Also for such a simple computation as sum the limiting factor should be memory loading speed,
	so 128 bit data should be slower to process than 64 bit data?

	In [53]: run sum_benchmark.py
	f32 f32 1000000 0.82793
	f32 f32 1000000000 1.12276
	f32 f64 1000000 1.12207
	f32 f64 1000000000 1.10964
	f32 f128 1000000 1.04155
	f32 f128 1000000000 1.12900
	f64 f32 1000000 1.10609
	f64 f32 1000000000 1.12823
	f64 f64 1000000 1.10493
	f64 f64 1000000000 1.11920
	f64 f128 1000000 1.15450
	f64 f128 1000000000 1.11794
	f128 f32 1000000 1.12087
	f128 f32 1000000000 1.12223
	f128 f64 1000000 1.09885
	f128 f64 1000000000 1.11911
	f128 f128 1000000 1.06943
	f128 f128 1000000000 1.12578

	On an Intel Xeon CPU E5-2670 0 @ 2.60GHz running Scientific Linux I get:
	float32 float32 1000000 0.94064
	float32 float32 1000000000 0.94639
	float32 float64 1000000 0.18847
	float32 float64 1000000000 0.18533
	float32 float128 1000000 0.13255
	float32 float128 1000000000 0.13377
	float32 longdouble 1000000 0.13289
	float32 longdouble 1000000000 0.13301
	float64 float32 1000000 0.17677
	float64 float32 1000000000 0.18126
	float64 float64 1000000 0.84172
	float64 float64 1000000000 0.88184
	float64 float128 1000000 0.12883
	float64 float128 1000000000 0.13193
	float64 longdouble 1000000 0.13350
	float64 longdouble 1000000000 0.13206
	float128 float32 1000000 0.15288
	float128 float32 1000000000 0.15494
	float128 float64 1000000 0.15133
	float128 float64 1000000000 0.15463
	float128 float128 1000000 0.60717
	float128 float128 1000000000 0.57651
	float128 longdouble 1000000 0.63454
	float128 longdouble 1000000000 0.57431
	longdouble float32 1000000 0.15543
	longdouble float32 1000000000 0.15688
	longdouble float64 1000000 0.14881
	longdouble float64 1000000000 0.15330
	longdouble float128 1000000 0.61881
	longdouble float128 1000000000 0.57245
	longdouble longdouble 1000000 0.63167
	longdouble longdouble 1000000000 0.58152
	"""