cfbolz/README.md Secret

## README.md

      
    Raw
  

              README.md
            
          
    The C code is in mm_build.py, the Python code in mm.py. To run the
experiment, use runner.py (this takes a night on my laptop), to produce the
results report.py. The raw data of the run that I did is in data.json.

  
## data.json
{"totaltime": 26292.353258132935, "results": {"python": [517.691975117, 520.333083153, 511.448441982, 515.494250059, 506.672176123, 514.22834897, 505.720468044, 506.864086151, 515.821257114, 516.192405939, 505.791134834, 533.7814641, 517.232166767, 513.63011694, 510.345921993, 506.653492928, 506.376320124, 504.348473072, 513.866441011, 506.332873821, 515.717053175, 520.538961172, 515.469412088, 517.487001896, 516.497164965, 504.021055937, 503.471532106, 525.189102888, 506.854137897, 514.394644976, 508.216271877, 504.955558062, 516.272284985, 509.004635096, 505.126121044, 510.817576885, 515.186355829, 517.686940908, 516.346740007, 516.169267893, 506.458129883, 504.289647102, 525.733214855, 505.447642088, 514.629616976, 507.585628986, 517.344475031, 517.974838972, 505.141008139, 516.538213968], "pypy": [8.22706985474, 8.20490908623, 8.1530559063, 8.16551494598, 8.15609288216, 8.1767039299, 8.14587807655, 8.19161105156, 8.18439888954, 8.17135596275, 8.21987700462, 8.17626714706, 8.14857006073, 8.16234707832, 8.18099403381, 8.16414093971, 8.14245200157, 8.15396404266, 8.15938210487, 8.16832613945, 8.16422104836, 8.18561697006, 8.1420378685, 8.17985486984, 8.17965102196, 8.18215608597, 8.16230511665, 8.16200494766, 8.16942405701, 8.17053818703, 8.14694786072, 8.15651202202, 8.15218901634, 8.17239809036, 8.14671993256, 8.1669960022, 8.16178917885, 8.14932703972, 8.16051602364, 8.21270394325, 8.1445980072, 8.17343306541, 8.14998602867, 8.15777206421, 8.15343904495, 8.17025089264, 8.15408015251, 8.15962600708, 8.15500617027, 8.16067314148], "c": [2.19682002068, 2.09115219116, 2.08114385605, 2.07064008713, 2.07012295723, 2.09699106216, 2.21557402611, 2.10430884361, 2.09098100662, 2.10097098351, 2.07317018509, 2.09088897705, 2.08370494843, 2.21196484566, 2.08613705635, 2.06726098061, 2.19589710236, 2.08736300468, 2.07264590263, 2.18075013161, 2.16967606544, 2.15311193466, 2.16501998901, 2.08497905731, 2.09225296974, 2.17023515701, 2.22405099869, 2.08495020866, 2.13236093521, 2.16504502296, 2.11265397072, 2.2247030735, 2.24440002441, 2.24582791328, 2.24160599709, 2.24124002457, 2.24960708618, 2.24201798439, 2.24800515175, 2.24004483223, 2.24525809288, 2.1488878727, 2.24572300911, 2.2334690094, 2.24044394493, 2.27632403374, 2.23374199867, 2.1650660038, 2.24659991264, 2.18787097931], "numpy": [0.170726060867, 0.167151927948, 0.170216083527, 0.169290065765, 0.167598962784, 0.174237012863, 0.198158979416, 0.172521114349, 0.17888712883, 0.167629003525, 0.169377803802, 0.168882131577, 0.167588949203, 0.170213937759, 0.169671058655, 0.169701099396, 0.164860963821, 0.171133995056, 0.166845083237, 0.175467014313, 0.192009210587, 0.16708111763, 0.171844005585, 0.167322874069, 0.166540145874, 0.188297986984, 0.171252012253, 0.170870065689, 0.17115521431, 0.166064977646, 0.16881608963, 0.168605804443, 0.165496826172, 0.169096946716, 0.166150093079, 0.176074981689, 0.171052932739, 0.172950983047, 0.165434122086, 0.169627904892, 0.167983055115, 0.174736022949, 0.165910959244, 0.172050952911, 0.171135902405, 0.170073032379, 0.169723987579, 0.172288179398, 0.178520202637, 0.169209003448]}, "size": 1500}

## matmul.md

      
    Raw
  

              matmul.md
            
          
    Repeating a Matrix Multiplication Benchmark

I watched the Hennessy & Patterson's Turing award lecture recently:
<iframe width="560" height="315"
src="https://www.youtube-nocookie.com/embed/3LVeEjsn8Ts?rel=0" frameborder="0"
allow="autoplay; encrypted-media" allowfullscreen></iframe>
In it, there's a slide comparing the performance of various matrix
multiplication implementations, using Python (presumably CPython) as a baseline
and comparing that against various C implementations (I couldn't find the
linked paper yet).

I expected the baseline speedup of switching from CPython to C to be much
higher and I also wanted to know what performance PyPy gets, so I did my own
benchmarks. This is a problem that Python is completely unsuited for, so it
should give very exaggerated results.
The usual disclaimers apply: All benchmarks are lies, benchmarking of
synthetic workloads even more so. My implementation is really naive (though I
did optimize it a little bit to help CPython), don't use any
of this code
for anything real. The benchmarks ran on my rather old Intel i5-3230M laptop
under Ubuntu 17.10.
With that said, my results are as follows:


Implementation
time
speedup over CPython
speedup over PyPy


CPython
512.588 ± 2.362 s
1 ×


PyPy
8.167 ± 0.007 s
62.761 ±  0.295 ×
1 ×


'naive' C
2.164 ± 0.025 s
236.817 ±  2.918 ×
3.773 ± 0.044 ×


NumPy
0.171 ± 0.002 s
2992.286 ± 42.308 ×
47.678 ± 0.634 ×


This is running 1500x1500 matrix multiplications with (the same) random
matrices. Every implementation is run 50 times in a fresh process. The results
are averaged, the errors are bootstrapped 99% confidence intervals.
So indeed the speedup of switching from CPython to C is quite a bit higher than
47x! PyPy is much better than CPython, but of course can't really compete
against GCC. And then the real professionals (numpy/OpenBLAS) are in a whole
'nother league. The speedup of the AVX numbers in the slide above is even
higher than my NumPy numbers, which I assume is the result of my old CPU with
two cores, vs. the 18 core CPU with AVX support. Lesson confirmed: leave matrix
multiplication to people who actually know what they are doing.

  
## matmul.png

      
    Raw
  

              matmul.png
            
          
## mm.py
import array
import numpy
from _mmc import ffi, lib

class M(object):
    def __init__(self, data, shape):
        # use an array for CPython's benefit, for PyPy a list would be fine too
        self.data = array.array("d", data)
        self.data = data
        self.shape = shape
        assert len(shape) == 2
        assert len(data) == shape[0] * shape[1]

    @staticmethod
    def fromshape(shape):
        return M([0.0] * (shape[0] * shape[1]), shape)

    def __getitem__(self, index):
        return self.data[index[0] * self.shape[1] + index[1]]

    def __setitem__(self, index, value):
        self.data[index[0] * self.shape[1] + index[1]] = value

    def matmul(self, other):
        # approach from:
        # https://tavianator.com/a-quick-trick-for-faster-naive-matrix-multiplication/
        assert self.shape[1] == other.shape[0]
        rows = self.shape[0]
        mid = self.shape[1]
        columns = other.shape[1]

        result = M.fromshape((rows, columns))

        selfdata = self.data
        otherdata = other.data
        resultdata = result.data
        baseresultindex = 0
        baseselfindex = 0
        for i in range(rows):
            baseotherindex = 0
            for k in range(mid):
                selfik = selfdata[baseselfindex + k]
                j = 0
                while j < columns:
                    resultdata[baseresultindex + j] += selfik * otherdata[baseotherindex + j]
                    j += 1
                baseotherindex += columns
            baseresultindex += columns
            baseselfindex += mid
        return result

def test_index():
    x = M([1, 2, 3, 4, 5, 6], (2, 3))
    assert x[0, 0] == 1
    assert x[0, 1] == 2
    assert x[0, 2] == 3
    assert x[1, 0] == 4
    assert x[1, 1] == 5
    assert x[1, 2] == 6


def test_matmul():
    x = M([1, 2, 1, 2, 2, 3], (3, 2))
    y = M([1, 2, 3, 7, 4, 5, 6, 2], (2, 4))
    assert x.matmul(y).data == [9, 12, 15, 11,  9, 12, 15, 11, 14, 19, 24, 20]

def main():
    import time, random, sys
    random.seed(1)
    method = sys.argv[1]
    size = int(sys.argv[2])

    d1 = [random.random() for i in range(size**2)]
    d2 = [random.random() for i in range(size**2)]

    if method == "python":
        m1 = M(d1, (size, size))
        m2 = M(d2, (size, size))
        t1 = time.time()
        m1.matmul(m2)
        t2 = time.time()
        print t2 - t1

    elif method == "numpy":
        m1 = numpy.array(d1)
        m1.shape = (size, size)
        m2 = numpy.array(d2)
        m2.shape = (size, size)
        t1 = time.time()
        numpy.matmul(m1, m2)
        t2 = time.time()
        print t2 - t1

    elif method == "c":
        m1 = ffi.new("double[]", size*size)
        m2 = ffi.new("double[]", size*size)
        m3 = ffi.new("double[]", size*size)
        for i, x in enumerate(d1):
            m1[i] = x
        for i, x in enumerate(d2):
            m2[i] = x
        t1 = time.time()
        lib.matmul(m3, m1, m2, size, size, size)
        t2 = time.time()
        print t2 - t1


if __name__ == '__main__':
    main()

## mm_build.py
from cffi import FFI
ffibuilder = FFI()

# C code from: https://tavianator.com/a-quick-trick-for-faster-naive-matrix-multiplication/

ffibuilder.set_source("_mmc",
   r"""
void matmul(double *dest, const double *lhs, const double *rhs,
            size_t rows, size_t mid, size_t cols) {
    memset(dest, 0, rows * cols * sizeof(double));

    for (size_t i = 0; i < rows; ++i) {
        const double *rhs_row = rhs;
        for (size_t j = 0; j < mid; ++j) {
            for (size_t k = 0; k < cols; ++k) {
                dest[k] += lhs[j] * rhs_row[k];
            }
            rhs_row += cols;
        }
        dest += cols;
        lhs += mid;
    }
}
    """,
    libraries=[],
    extra_compile_args=["-O3", "-mtune=native"])

ffibuilder.cdef("""
    // declarations that are shared between Python and C
void matmul(double *dest, const double *lhs, const double *rhs,
            size_t rows, size_t mid, size_t cols);
""")

if __name__ == "__main__":
    ffibuilder.compile(verbose=True)

## report.py
import json
import random
import math
from decimal import Decimal, ROUND_DOWN, ROUND_UP

REPETITIONS = 100000

# some copy-pasted old code

def _mean(l):
    return math.fsum(l) / float(len(l))

def bootstrap_sample(values):
    return [random.choice(values) for i in range(len(values))]

def confidence_slice(means, confidence="0.99"):
    means = sorted(means)
    # There may be >1 median indices, i.e. data is even-sized.
    lower, middle_indicies, upper = _confidence_slice_indicies(len(means), confidence)
    median = _mean([means[i] for i in middle_indicies])
    return means[lower], median, means[upper - 1] # upper is *exclusive*

def _confidence_slice_indicies(length, confidence_level=Decimal('0.95')):
    """Returns a triple (lower, mean_indicies, upper) so that l[lower:upper]
    gives confidence_level of all samples. Mean_indicies is a tuple of one or
    two indicies that correspond to the mean position

    Keyword arguments:
    confidence_level -- desired level of confidence as a Decimal instance.
    """

    assert not isinstance(confidence_level, float)
    confidence_level = Decimal(confidence_level)
    assert isinstance(confidence_level, Decimal)
    exclude = (1 - confidence_level) / 2

    if length % 2 == 0:
        mean_indicies = (length // 2 - 1, length // 2)
    else:
        mean_indicies = (length // 2, )

    lower_index = int(
        (exclude * length).quantize(Decimal('1.'), rounding=ROUND_DOWN)
    )

    upper_index = int(
        ((1 - exclude) * length).quantize(Decimal('1.'), rounding=ROUND_UP)
    )

    return lower_index, mean_indicies, upper_index

def format_error(val, err, unit='s'):
    return "%s±%s %s" % (round(val, 3), round(err, 3), unit)

def compute_speedup(l, baseline):
    speedups = []
    for i in range(REPETITIONS):
        speedups.append(_mean(bootstrap_sample(baseline)) / _mean(bootstrap_sample(l)))
    lower, mid, upper = confidence_slice(speedups)
    return mid, _mean([mid - lower, upper - mid])

NICE_NAMES = dict(python="CPython", pypy="PyPy", numpy="NumPy", c="'naive' C")

def main():
    with file("data.json") as f:
        data = json.load(f)

    print "matrix size", data["size"]
    print " " * 20, "time".rjust(30), "speedup over CPython".rjust(30), "speedup over PyPy".rjust(30)
    for name in "python pypy c numpy".split():
        l = data["results"][name]
        samples = [_mean(bootstrap_sample(l)) for i in range(REPETITIONS)]
        lower, mid, upper = confidence_slice(samples)
        print NICE_NAMES[name].rjust(20), format_error(_mean(l), _mean([mid - lower, upper - mid])).rjust(30),
        if name == "python":
            print "1 ×".rjust(30)
        else:
            speedup, speeduperr = compute_speedup(l, data["results"]["python"])
            print format_error(speedup, speeduperr, "×").rjust(30),
            if name == "pypy":
                print "1 ×".rjust(30)
            else:
                speedup, speeduperr = compute_speedup(l, data["results"]["pypy"])
                print format_error(speedup, speeduperr, "×").rjust(30)

if __name__ == '__main__':
    main()

## runner.py
import subprocess
import time
import json

SIZE = 1500
TIMES = 50

def run_repeatedly(cmd, times=TIMES):
    print cmd
    res = []
    for i in range(times):
        output = subprocess.check_output(cmd, shell=True)
        res.append(float(output.strip()))
        print i, res[-1]
    return res

def main():
    t1 = time.time()
    d = dict(size=SIZE, results={})
    try:
        for name, cmd in [
                ("numpy", "OPENBLAS_NUM_THREADS=2 python mm.py numpy"),
                ("c", "python mm.py c"),
                ("pypy", "pypy mm.py python"),
                ("python", "python mm.py python"),
                ]:
            res = run_repeatedly("%s %s" % (cmd, SIZE))
            print name, res
            d["results"][name] = res
            with file("data.json", "w") as f:
                f.write(json.dumps(d))
    finally:
        t2 = time.time()
        print "total time", t2 - t1
        d["totaltime"] = t2 - t1
        with file("data.json", "w") as f:
            f.write(json.dumps(d))


if __name__ == '__main__':
    main()
Implementation	time	speedup over CPython	speedup over PyPy
CPython	512.588 ± 2.362 s	1 ×
PyPy	8.167 ± 0.007 s	62.761 ± 0.295 ×	1 ×
'naive' C	2.164 ± 0.025 s	236.817 ± 2.918 ×	3.773 ± 0.044 ×
NumPy	0.171 ± 0.002 s	2992.286 ± 42.308 ×	47.678 ± 0.634 ×
	import array
	import numpy
	from _mmc import ffi, lib

	class M(object):
	def __init__(self, data, shape):
	# use an array for CPython's benefit, for PyPy a list would be fine too
	self.data = array.array("d", data)
	self.data = data
	self.shape = shape
	assert len(shape) == 2
	assert len(data) == shape[0] * shape[1]

	@staticmethod
	def fromshape(shape):
	return M([0.0] * (shape[0] * shape[1]), shape)

	def __getitem__(self, index):
	return self.data[index[0] * self.shape[1] + index[1]]

	def __setitem__(self, index, value):
	self.data[index[0] * self.shape[1] + index[1]] = value

	def matmul(self, other):
	# approach from:
	# https://tavianator.com/a-quick-trick-for-faster-naive-matrix-multiplication/
	assert self.shape[1] == other.shape[0]
	rows = self.shape[0]
	mid = self.shape[1]
	columns = other.shape[1]

	result = M.fromshape((rows, columns))

	selfdata = self.data
	otherdata = other.data
	resultdata = result.data
	baseresultindex = 0
	baseselfindex = 0
	for i in range(rows):
	baseotherindex = 0
	for k in range(mid):
	selfik = selfdata[baseselfindex + k]
	j = 0
	while j < columns:
	resultdata[baseresultindex + j] += selfik * otherdata[baseotherindex + j]
	j += 1
	baseotherindex += columns
	baseresultindex += columns
	baseselfindex += mid
	return result

	def test_index():
	x = M([1, 2, 3, 4, 5, 6], (2, 3))
	assert x[0, 0] == 1
	assert x[0, 1] == 2
	assert x[0, 2] == 3
	assert x[1, 0] == 4
	assert x[1, 1] == 5
	assert x[1, 2] == 6


	def test_matmul():
	x = M([1, 2, 1, 2, 2, 3], (3, 2))
	y = M([1, 2, 3, 7, 4, 5, 6, 2], (2, 4))
	assert x.matmul(y).data == [9, 12, 15, 11, 9, 12, 15, 11, 14, 19, 24, 20]

	def main():
	import time, random, sys
	random.seed(1)
	method = sys.argv[1]
	size = int(sys.argv[2])

	d1 = [random.random() for i in range(size**2)]
	d2 = [random.random() for i in range(size**2)]

	if method == "python":
	m1 = M(d1, (size, size))
	m2 = M(d2, (size, size))
	t1 = time.time()
	m1.matmul(m2)
	t2 = time.time()
	print t2 - t1

	elif method == "numpy":
	m1 = numpy.array(d1)
	m1.shape = (size, size)
	m2 = numpy.array(d2)
	m2.shape = (size, size)
	t1 = time.time()
	numpy.matmul(m1, m2)
	t2 = time.time()
	print t2 - t1

	elif method == "c":
	m1 = ffi.new("double[]", size*size)
	m2 = ffi.new("double[]", size*size)
	m3 = ffi.new("double[]", size*size)
	for i, x in enumerate(d1):
	m1[i] = x
	for i, x in enumerate(d2):
	m2[i] = x
	t1 = time.time()
	lib.matmul(m3, m1, m2, size, size, size)
	t2 = time.time()
	print t2 - t1


	if __name__ == '__main__':
	main()
	from cffi import FFI
	ffibuilder = FFI()

	# C code from: https://tavianator.com/a-quick-trick-for-faster-naive-matrix-multiplication/

	ffibuilder.set_source("_mmc",
	r"""
	void matmul(double dest, const double lhs, const double *rhs,
	size_t rows, size_t mid, size_t cols) {
	memset(dest, 0, rows * cols * sizeof(double));

	for (size_t i = 0; i < rows; ++i) {
	const double *rhs_row = rhs;
	for (size_t j = 0; j < mid; ++j) {
	for (size_t k = 0; k < cols; ++k) {
	dest[k] += lhs[j] * rhs_row[k];
	}
	rhs_row += cols;
	}
	dest += cols;
	lhs += mid;
	}
	}
	""",
	libraries=[],
	extra_compile_args=["-O3", "-mtune=native"])

	ffibuilder.cdef("""
	// declarations that are shared between Python and C
	void matmul(double dest, const double lhs, const double *rhs,
	size_t rows, size_t mid, size_t cols);
	""")

	if __name__ == "__main__":
	ffibuilder.compile(verbose=True)
	import json
	import random
	import math
	from decimal import Decimal, ROUND_DOWN, ROUND_UP

	REPETITIONS = 100000

	# some copy-pasted old code

	def _mean(l):
	return math.fsum(l) / float(len(l))

	def bootstrap_sample(values):
	return [random.choice(values) for i in range(len(values))]

	def confidence_slice(means, confidence="0.99"):
	means = sorted(means)
	# There may be >1 median indices, i.e. data is even-sized.
	lower, middle_indicies, upper = _confidence_slice_indicies(len(means), confidence)
	median = _mean([means[i] for i in middle_indicies])
	return means[lower], median, means[upper - 1] # upper is exclusive

	def _confidence_slice_indicies(length, confidence_level=Decimal('0.95')):
	"""Returns a triple (lower, mean_indicies, upper) so that l[lower:upper]
	gives confidence_level of all samples. Mean_indicies is a tuple of one or
	two indicies that correspond to the mean position

	Keyword arguments:
	confidence_level -- desired level of confidence as a Decimal instance.
	"""

	assert not isinstance(confidence_level, float)
	confidence_level = Decimal(confidence_level)
	assert isinstance(confidence_level, Decimal)
	exclude = (1 - confidence_level) / 2

	if length % 2 == 0:
	mean_indicies = (length // 2 - 1, length // 2)
	else:
	mean_indicies = (length // 2, )

	lower_index = int(
	(exclude * length).quantize(Decimal('1.'), rounding=ROUND_DOWN)
	)

	upper_index = int(
	((1 - exclude) * length).quantize(Decimal('1.'), rounding=ROUND_UP)
	)

	return lower_index, mean_indicies, upper_index

	def format_error(val, err, unit='s'):
	return "%s±%s %s" % (round(val, 3), round(err, 3), unit)

	def compute_speedup(l, baseline):
	speedups = []
	for i in range(REPETITIONS):
	speedups.append(_mean(bootstrap_sample(baseline)) / _mean(bootstrap_sample(l)))
	lower, mid, upper = confidence_slice(speedups)
	return mid, _mean([mid - lower, upper - mid])

	NICE_NAMES = dict(python="CPython", pypy="PyPy", numpy="NumPy", c="'naive' C")

	def main():
	with file("data.json") as f:
	data = json.load(f)

	print "matrix size", data["size"]
	print " " * 20, "time".rjust(30), "speedup over CPython".rjust(30), "speedup over PyPy".rjust(30)
	for name in "python pypy c numpy".split():
	l = data["results"][name]
	samples = [_mean(bootstrap_sample(l)) for i in range(REPETITIONS)]
	lower, mid, upper = confidence_slice(samples)
	print NICE_NAMES[name].rjust(20), format_error(_mean(l), _mean([mid - lower, upper - mid])).rjust(30),
	if name == "python":
	print "1 ×".rjust(30)
	else:
	speedup, speeduperr = compute_speedup(l, data["results"]["python"])
	print format_error(speedup, speeduperr, "×").rjust(30),
	if name == "pypy":
	print "1 ×".rjust(30)
	else:
	speedup, speeduperr = compute_speedup(l, data["results"]["pypy"])
	print format_error(speedup, speeduperr, "×").rjust(30)

	if __name__ == '__main__':
	main()
	import subprocess
	import time
	import json

	SIZE = 1500
	TIMES = 50

	def run_repeatedly(cmd, times=TIMES):
	print cmd
	res = []
	for i in range(times):
	output = subprocess.check_output(cmd, shell=True)
	res.append(float(output.strip()))
	print i, res[-1]
	return res

	def main():
	t1 = time.time()
	d = dict(size=SIZE, results={})
	try:
	for name, cmd in [
	("numpy", "OPENBLAS_NUM_THREADS=2 python mm.py numpy"),
	("c", "python mm.py c"),
	("pypy", "pypy mm.py python"),
	("python", "python mm.py python"),
	]:
	res = run_repeatedly("%s %s" % (cmd, SIZE))
	print name, res
	d["results"][name] = res
	with file("data.json", "w") as f:
	f.write(json.dumps(d))
	finally:
	t2 = time.time()
	print "total time", t2 - t1
	d["totaltime"] = t2 - t1
	with file("data.json", "w") as f:
	f.write(json.dumps(d))


	if __name__ == '__main__':
	main()