Instantly share code, notes, and snippets.

@cfbolz /README.md Secret
Last active Jul 14, 2018

Embed
What would you like to do?
matrix multiplication pure python benchmarks

The C code is in mm_build.py, the Python code in mm.py. To run the experiment, use runner.py (this takes a night on my laptop), to produce the results report.py. The raw data of the run that I did is in data.json.

{"totaltime": 26292.353258132935, "results": {"python": [517.691975117, 520.333083153, 511.448441982, 515.494250059, 506.672176123, 514.22834897, 505.720468044, 506.864086151, 515.821257114, 516.192405939, 505.791134834, 533.7814641, 517.232166767, 513.63011694, 510.345921993, 506.653492928, 506.376320124, 504.348473072, 513.866441011, 506.332873821, 515.717053175, 520.538961172, 515.469412088, 517.487001896, 516.497164965, 504.021055937, 503.471532106, 525.189102888, 506.854137897, 514.394644976, 508.216271877, 504.955558062, 516.272284985, 509.004635096, 505.126121044, 510.817576885, 515.186355829, 517.686940908, 516.346740007, 516.169267893, 506.458129883, 504.289647102, 525.733214855, 505.447642088, 514.629616976, 507.585628986, 517.344475031, 517.974838972, 505.141008139, 516.538213968], "pypy": [8.22706985474, 8.20490908623, 8.1530559063, 8.16551494598, 8.15609288216, 8.1767039299, 8.14587807655, 8.19161105156, 8.18439888954, 8.17135596275, 8.21987700462, 8.17626714706, 8.14857006073, 8.16234707832, 8.18099403381, 8.16414093971, 8.14245200157, 8.15396404266, 8.15938210487, 8.16832613945, 8.16422104836, 8.18561697006, 8.1420378685, 8.17985486984, 8.17965102196, 8.18215608597, 8.16230511665, 8.16200494766, 8.16942405701, 8.17053818703, 8.14694786072, 8.15651202202, 8.15218901634, 8.17239809036, 8.14671993256, 8.1669960022, 8.16178917885, 8.14932703972, 8.16051602364, 8.21270394325, 8.1445980072, 8.17343306541, 8.14998602867, 8.15777206421, 8.15343904495, 8.17025089264, 8.15408015251, 8.15962600708, 8.15500617027, 8.16067314148], "c": [2.19682002068, 2.09115219116, 2.08114385605, 2.07064008713, 2.07012295723, 2.09699106216, 2.21557402611, 2.10430884361, 2.09098100662, 2.10097098351, 2.07317018509, 2.09088897705, 2.08370494843, 2.21196484566, 2.08613705635, 2.06726098061, 2.19589710236, 2.08736300468, 2.07264590263, 2.18075013161, 2.16967606544, 2.15311193466, 2.16501998901, 2.08497905731, 2.09225296974, 2.17023515701, 2.22405099869, 2.08495020866, 2.13236093521, 2.16504502296, 2.11265397072, 2.2247030735, 2.24440002441, 2.24582791328, 2.24160599709, 2.24124002457, 2.24960708618, 2.24201798439, 2.24800515175, 2.24004483223, 2.24525809288, 2.1488878727, 2.24572300911, 2.2334690094, 2.24044394493, 2.27632403374, 2.23374199867, 2.1650660038, 2.24659991264, 2.18787097931], "numpy": [0.170726060867, 0.167151927948, 0.170216083527, 0.169290065765, 0.167598962784, 0.174237012863, 0.198158979416, 0.172521114349, 0.17888712883, 0.167629003525, 0.169377803802, 0.168882131577, 0.167588949203, 0.170213937759, 0.169671058655, 0.169701099396, 0.164860963821, 0.171133995056, 0.166845083237, 0.175467014313, 0.192009210587, 0.16708111763, 0.171844005585, 0.167322874069, 0.166540145874, 0.188297986984, 0.171252012253, 0.170870065689, 0.17115521431, 0.166064977646, 0.16881608963, 0.168605804443, 0.165496826172, 0.169096946716, 0.166150093079, 0.176074981689, 0.171052932739, 0.172950983047, 0.165434122086, 0.169627904892, 0.167983055115, 0.174736022949, 0.165910959244, 0.172050952911, 0.171135902405, 0.170073032379, 0.169723987579, 0.172288179398, 0.178520202637, 0.169209003448]}, "size": 1500}

Repeating a Matrix Multiplication Benchmark

I watched the Hennessy & Patterson's Turing award lecture recently:

<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/3LVeEjsn8Ts?rel=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>

In it, there's a slide comparing the performance of various matrix multiplication implementations, using Python (presumably CPython) as a baseline and comparing that against various C implementations (I couldn't find the linked paper yet). slide

I expected the baseline speedup of switching from CPython to C to be much higher and I also wanted to know what performance PyPy gets, so I did my own benchmarks. This is a problem that Python is completely unsuited for, so it should give very exaggerated results.

The usual disclaimers apply: All benchmarks are lies, benchmarking of synthetic workloads even more so. My implementation is really naive (though I did optimize it a little bit to help CPython), don't use any of this code for anything real. The benchmarks ran on my rather old Intel i5-3230M laptop under Ubuntu 17.10.

With that said, my results are as follows:

Implementation time speedup over CPython speedup over PyPy
CPython 512.588 ± 2.362 s 1 ×
PyPy 8.167 ± 0.007 s 62.761 ± 0.295 × 1 ×
'naive' C 2.164 ± 0.025 s 236.817 ± 2.918 × 3.773 ± 0.044 ×
NumPy 0.171 ± 0.002 s 2992.286 ± 42.308 × 47.678 ± 0.634 ×

This is running 1500x1500 matrix multiplications with (the same) random matrices. Every implementation is run 50 times in a fresh process. The results are averaged, the errors are bootstrapped 99% confidence intervals.

So indeed the speedup of switching from CPython to C is quite a bit higher than 47x! PyPy is much better than CPython, but of course can't really compete against GCC. And then the real professionals (numpy/OpenBLAS) are in a whole 'nother league. The speedup of the AVX numbers in the slide above is even higher than my NumPy numbers, which I assume is the result of my old CPU with two cores, vs. the 18 core CPU with AVX support. Lesson confirmed: leave matrix multiplication to people who actually know what they are doing.

import array
import numpy
from _mmc import ffi, lib
class M(object):
def __init__(self, data, shape):
# use an array for CPython's benefit, for PyPy a list would be fine too
self.data = array.array("d", data)
self.data = data
self.shape = shape
assert len(shape) == 2
assert len(data) == shape[0] * shape[1]
@staticmethod
def fromshape(shape):
return M([0.0] * (shape[0] * shape[1]), shape)
def __getitem__(self, index):
return self.data[index[0] * self.shape[1] + index[1]]
def __setitem__(self, index, value):
self.data[index[0] * self.shape[1] + index[1]] = value
def matmul(self, other):
# approach from:
# https://tavianator.com/a-quick-trick-for-faster-naive-matrix-multiplication/
assert self.shape[1] == other.shape[0]
rows = self.shape[0]
mid = self.shape[1]
columns = other.shape[1]
result = M.fromshape((rows, columns))
selfdata = self.data
otherdata = other.data
resultdata = result.data
baseresultindex = 0
baseselfindex = 0
for i in range(rows):
baseotherindex = 0
for k in range(mid):
selfik = selfdata[baseselfindex + k]
j = 0
while j < columns:
resultdata[baseresultindex + j] += selfik * otherdata[baseotherindex + j]
j += 1
baseotherindex += columns
baseresultindex += columns
baseselfindex += mid
return result
def test_index():
x = M([1, 2, 3, 4, 5, 6], (2, 3))
assert x[0, 0] == 1
assert x[0, 1] == 2
assert x[0, 2] == 3
assert x[1, 0] == 4
assert x[1, 1] == 5
assert x[1, 2] == 6
def test_matmul():
x = M([1, 2, 1, 2, 2, 3], (3, 2))
y = M([1, 2, 3, 7, 4, 5, 6, 2], (2, 4))
assert x.matmul(y).data == [9, 12, 15, 11, 9, 12, 15, 11, 14, 19, 24, 20]
def main():
import time, random, sys
random.seed(1)
method = sys.argv[1]
size = int(sys.argv[2])
d1 = [random.random() for i in range(size**2)]
d2 = [random.random() for i in range(size**2)]
if method == "python":
m1 = M(d1, (size, size))
m2 = M(d2, (size, size))
t1 = time.time()
m1.matmul(m2)
t2 = time.time()
print t2 - t1
elif method == "numpy":
m1 = numpy.array(d1)
m1.shape = (size, size)
m2 = numpy.array(d2)
m2.shape = (size, size)
t1 = time.time()
numpy.matmul(m1, m2)
t2 = time.time()
print t2 - t1
elif method == "c":
m1 = ffi.new("double[]", size*size)
m2 = ffi.new("double[]", size*size)
m3 = ffi.new("double[]", size*size)
for i, x in enumerate(d1):
m1[i] = x
for i, x in enumerate(d2):
m2[i] = x
t1 = time.time()
lib.matmul(m3, m1, m2, size, size, size)
t2 = time.time()
print t2 - t1
if __name__ == '__main__':
main()
from cffi import FFI
ffibuilder = FFI()
# C code from: https://tavianator.com/a-quick-trick-for-faster-naive-matrix-multiplication/
ffibuilder.set_source("_mmc",
r"""
void matmul(double *dest, const double *lhs, const double *rhs,
size_t rows, size_t mid, size_t cols) {
memset(dest, 0, rows * cols * sizeof(double));
for (size_t i = 0; i < rows; ++i) {
const double *rhs_row = rhs;
for (size_t j = 0; j < mid; ++j) {
for (size_t k = 0; k < cols; ++k) {
dest[k] += lhs[j] * rhs_row[k];
}
rhs_row += cols;
}
dest += cols;
lhs += mid;
}
}
""",
libraries=[],
extra_compile_args=["-O3", "-mtune=native"])
ffibuilder.cdef("""
// declarations that are shared between Python and C
void matmul(double *dest, const double *lhs, const double *rhs,
size_t rows, size_t mid, size_t cols);
""")
if __name__ == "__main__":
ffibuilder.compile(verbose=True)
import json
import random
import math
from decimal import Decimal, ROUND_DOWN, ROUND_UP
REPETITIONS = 100000
# some copy-pasted old code
def _mean(l):
return math.fsum(l) / float(len(l))
def bootstrap_sample(values):
return [random.choice(values) for i in range(len(values))]
def confidence_slice(means, confidence="0.99"):
means = sorted(means)
# There may be >1 median indices, i.e. data is even-sized.
lower, middle_indicies, upper = _confidence_slice_indicies(len(means), confidence)
median = _mean([means[i] for i in middle_indicies])
return means[lower], median, means[upper - 1] # upper is *exclusive*
def _confidence_slice_indicies(length, confidence_level=Decimal('0.95')):
"""Returns a triple (lower, mean_indicies, upper) so that l[lower:upper]
gives confidence_level of all samples. Mean_indicies is a tuple of one or
two indicies that correspond to the mean position
Keyword arguments:
confidence_level -- desired level of confidence as a Decimal instance.
"""
assert not isinstance(confidence_level, float)
confidence_level = Decimal(confidence_level)
assert isinstance(confidence_level, Decimal)
exclude = (1 - confidence_level) / 2
if length % 2 == 0:
mean_indicies = (length // 2 - 1, length // 2)
else:
mean_indicies = (length // 2, )
lower_index = int(
(exclude * length).quantize(Decimal('1.'), rounding=ROUND_DOWN)
)
upper_index = int(
((1 - exclude) * length).quantize(Decimal('1.'), rounding=ROUND_UP)
)
return lower_index, mean_indicies, upper_index
def format_error(val, err, unit='s'):
return "%s±%s %s" % (round(val, 3), round(err, 3), unit)
def compute_speedup(l, baseline):
speedups = []
for i in range(REPETITIONS):
speedups.append(_mean(bootstrap_sample(baseline)) / _mean(bootstrap_sample(l)))
lower, mid, upper = confidence_slice(speedups)
return mid, _mean([mid - lower, upper - mid])
NICE_NAMES = dict(python="CPython", pypy="PyPy", numpy="NumPy", c="'naive' C")
def main():
with file("data.json") as f:
data = json.load(f)
print "matrix size", data["size"]
print " " * 20, "time".rjust(30), "speedup over CPython".rjust(30), "speedup over PyPy".rjust(30)
for name in "python pypy c numpy".split():
l = data["results"][name]
samples = [_mean(bootstrap_sample(l)) for i in range(REPETITIONS)]
lower, mid, upper = confidence_slice(samples)
print NICE_NAMES[name].rjust(20), format_error(_mean(l), _mean([mid - lower, upper - mid])).rjust(30),
if name == "python":
print "1 ×".rjust(30)
else:
speedup, speeduperr = compute_speedup(l, data["results"]["python"])
print format_error(speedup, speeduperr, "×").rjust(30),
if name == "pypy":
print "1 ×".rjust(30)
else:
speedup, speeduperr = compute_speedup(l, data["results"]["pypy"])
print format_error(speedup, speeduperr, "×").rjust(30)
if __name__ == '__main__':
main()
import subprocess
import time
import json
SIZE = 1500
TIMES = 50
def run_repeatedly(cmd, times=TIMES):
print cmd
res = []
for i in range(times):
output = subprocess.check_output(cmd, shell=True)
res.append(float(output.strip()))
print i, res[-1]
return res
def main():
t1 = time.time()
d = dict(size=SIZE, results={})
try:
for name, cmd in [
("numpy", "OPENBLAS_NUM_THREADS=2 python mm.py numpy"),
("c", "python mm.py c"),
("pypy", "pypy mm.py python"),
("python", "python mm.py python"),
]:
res = run_repeatedly("%s %s" % (cmd, SIZE))
print name, res
d["results"][name] = res
with file("data.json", "w") as f:
f.write(json.dumps(d))
finally:
t2 = time.time()
print "total time", t2 - t1
d["totaltime"] = t2 - t1
with file("data.json", "w") as f:
f.write(json.dumps(d))
if __name__ == '__main__':
main()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment