The C code is in mm_build.py
, the Python code in mm.py
. To run the
experiment, use runner.py
(this takes a night on my laptop), to produce the
results report.py
. The raw data of the run that I did is in data.json
.
-
-
Save cfbolz/caa8299ebd5ab4e1203be1614a64bb54 to your computer and use it in GitHub Desktop.
{"totaltime": 26292.353258132935, "results": {"python": [517.691975117, 520.333083153, 511.448441982, 515.494250059, 506.672176123, 514.22834897, 505.720468044, 506.864086151, 515.821257114, 516.192405939, 505.791134834, 533.7814641, 517.232166767, 513.63011694, 510.345921993, 506.653492928, 506.376320124, 504.348473072, 513.866441011, 506.332873821, 515.717053175, 520.538961172, 515.469412088, 517.487001896, 516.497164965, 504.021055937, 503.471532106, 525.189102888, 506.854137897, 514.394644976, 508.216271877, 504.955558062, 516.272284985, 509.004635096, 505.126121044, 510.817576885, 515.186355829, 517.686940908, 516.346740007, 516.169267893, 506.458129883, 504.289647102, 525.733214855, 505.447642088, 514.629616976, 507.585628986, 517.344475031, 517.974838972, 505.141008139, 516.538213968], "pypy": [8.22706985474, 8.20490908623, 8.1530559063, 8.16551494598, 8.15609288216, 8.1767039299, 8.14587807655, 8.19161105156, 8.18439888954, 8.17135596275, 8.21987700462, 8.17626714706, 8.14857006073, 8.16234707832, 8.18099403381, 8.16414093971, 8.14245200157, 8.15396404266, 8.15938210487, 8.16832613945, 8.16422104836, 8.18561697006, 8.1420378685, 8.17985486984, 8.17965102196, 8.18215608597, 8.16230511665, 8.16200494766, 8.16942405701, 8.17053818703, 8.14694786072, 8.15651202202, 8.15218901634, 8.17239809036, 8.14671993256, 8.1669960022, 8.16178917885, 8.14932703972, 8.16051602364, 8.21270394325, 8.1445980072, 8.17343306541, 8.14998602867, 8.15777206421, 8.15343904495, 8.17025089264, 8.15408015251, 8.15962600708, 8.15500617027, 8.16067314148], "c": [2.19682002068, 2.09115219116, 2.08114385605, 2.07064008713, 2.07012295723, 2.09699106216, 2.21557402611, 2.10430884361, 2.09098100662, 2.10097098351, 2.07317018509, 2.09088897705, 2.08370494843, 2.21196484566, 2.08613705635, 2.06726098061, 2.19589710236, 2.08736300468, 2.07264590263, 2.18075013161, 2.16967606544, 2.15311193466, 2.16501998901, 2.08497905731, 2.09225296974, 2.17023515701, 2.22405099869, 2.08495020866, 2.13236093521, 2.16504502296, 2.11265397072, 2.2247030735, 2.24440002441, 2.24582791328, 2.24160599709, 2.24124002457, 2.24960708618, 2.24201798439, 2.24800515175, 2.24004483223, 2.24525809288, 2.1488878727, 2.24572300911, 2.2334690094, 2.24044394493, 2.27632403374, 2.23374199867, 2.1650660038, 2.24659991264, 2.18787097931], "numpy": [0.170726060867, 0.167151927948, 0.170216083527, 0.169290065765, 0.167598962784, 0.174237012863, 0.198158979416, 0.172521114349, 0.17888712883, 0.167629003525, 0.169377803802, 0.168882131577, 0.167588949203, 0.170213937759, 0.169671058655, 0.169701099396, 0.164860963821, 0.171133995056, 0.166845083237, 0.175467014313, 0.192009210587, 0.16708111763, 0.171844005585, 0.167322874069, 0.166540145874, 0.188297986984, 0.171252012253, 0.170870065689, 0.17115521431, 0.166064977646, 0.16881608963, 0.168605804443, 0.165496826172, 0.169096946716, 0.166150093079, 0.176074981689, 0.171052932739, 0.172950983047, 0.165434122086, 0.169627904892, 0.167983055115, 0.174736022949, 0.165910959244, 0.172050952911, 0.171135902405, 0.170073032379, 0.169723987579, 0.172288179398, 0.178520202637, 0.169209003448]}, "size": 1500} |
I watched the Hennessy & Patterson's Turing award lecture recently:
<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/3LVeEjsn8Ts?rel=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>In it, there's a slide comparing the performance of various matrix multiplication implementations, using Python (presumably CPython) as a baseline and comparing that against various C implementations (I couldn't find the linked paper yet).
I expected the baseline speedup of switching from CPython to C to be much higher and I also wanted to know what performance PyPy gets, so I did my own benchmarks. This is a problem that Python is completely unsuited for, so it should give very exaggerated results.
The usual disclaimers apply: All benchmarks are lies, benchmarking of synthetic workloads even more so. My implementation is really naive (though I did optimize it a little bit to help CPython), don't use any of this code for anything real. The benchmarks ran on my rather old Intel i5-3230M laptop under Ubuntu 17.10.
With that said, my results are as follows:
Implementation | time | speedup over CPython | speedup over PyPy |
---|---|---|---|
CPython | 512.588 ± 2.362 s | 1 × | |
PyPy | 8.167 ± 0.007 s | 62.761 ± 0.295 × | 1 × |
'naive' C | 2.164 ± 0.025 s | 236.817 ± 2.918 × | 3.773 ± 0.044 × |
NumPy | 0.171 ± 0.002 s | 2992.286 ± 42.308 × | 47.678 ± 0.634 × |
This is running 1500x1500 matrix multiplications with (the same) random matrices. Every implementation is run 50 times in a fresh process. The results are averaged, the errors are bootstrapped 99% confidence intervals.
So indeed the speedup of switching from CPython to C is quite a bit higher than 47x! PyPy is much better than CPython, but of course can't really compete against GCC. And then the real professionals (numpy/OpenBLAS) are in a whole 'nother league. The speedup of the AVX numbers in the slide above is even higher than my NumPy numbers, which I assume is the result of my old CPU with two cores, vs. the 18 core CPU with AVX support. Lesson confirmed: leave matrix multiplication to people who actually know what they are doing.
import array | |
import numpy | |
from _mmc import ffi, lib | |
class M(object): | |
def __init__(self, data, shape): | |
# use an array for CPython's benefit, for PyPy a list would be fine too | |
self.data = array.array("d", data) | |
self.data = data | |
self.shape = shape | |
assert len(shape) == 2 | |
assert len(data) == shape[0] * shape[1] | |
@staticmethod | |
def fromshape(shape): | |
return M([0.0] * (shape[0] * shape[1]), shape) | |
def __getitem__(self, index): | |
return self.data[index[0] * self.shape[1] + index[1]] | |
def __setitem__(self, index, value): | |
self.data[index[0] * self.shape[1] + index[1]] = value | |
def matmul(self, other): | |
# approach from: | |
# https://tavianator.com/a-quick-trick-for-faster-naive-matrix-multiplication/ | |
assert self.shape[1] == other.shape[0] | |
rows = self.shape[0] | |
mid = self.shape[1] | |
columns = other.shape[1] | |
result = M.fromshape((rows, columns)) | |
selfdata = self.data | |
otherdata = other.data | |
resultdata = result.data | |
baseresultindex = 0 | |
baseselfindex = 0 | |
for i in range(rows): | |
baseotherindex = 0 | |
for k in range(mid): | |
selfik = selfdata[baseselfindex + k] | |
j = 0 | |
while j < columns: | |
resultdata[baseresultindex + j] += selfik * otherdata[baseotherindex + j] | |
j += 1 | |
baseotherindex += columns | |
baseresultindex += columns | |
baseselfindex += mid | |
return result | |
def test_index(): | |
x = M([1, 2, 3, 4, 5, 6], (2, 3)) | |
assert x[0, 0] == 1 | |
assert x[0, 1] == 2 | |
assert x[0, 2] == 3 | |
assert x[1, 0] == 4 | |
assert x[1, 1] == 5 | |
assert x[1, 2] == 6 | |
def test_matmul(): | |
x = M([1, 2, 1, 2, 2, 3], (3, 2)) | |
y = M([1, 2, 3, 7, 4, 5, 6, 2], (2, 4)) | |
assert x.matmul(y).data == [9, 12, 15, 11, 9, 12, 15, 11, 14, 19, 24, 20] | |
def main(): | |
import time, random, sys | |
random.seed(1) | |
method = sys.argv[1] | |
size = int(sys.argv[2]) | |
d1 = [random.random() for i in range(size**2)] | |
d2 = [random.random() for i in range(size**2)] | |
if method == "python": | |
m1 = M(d1, (size, size)) | |
m2 = M(d2, (size, size)) | |
t1 = time.time() | |
m1.matmul(m2) | |
t2 = time.time() | |
print t2 - t1 | |
elif method == "numpy": | |
m1 = numpy.array(d1) | |
m1.shape = (size, size) | |
m2 = numpy.array(d2) | |
m2.shape = (size, size) | |
t1 = time.time() | |
numpy.matmul(m1, m2) | |
t2 = time.time() | |
print t2 - t1 | |
elif method == "c": | |
m1 = ffi.new("double[]", size*size) | |
m2 = ffi.new("double[]", size*size) | |
m3 = ffi.new("double[]", size*size) | |
for i, x in enumerate(d1): | |
m1[i] = x | |
for i, x in enumerate(d2): | |
m2[i] = x | |
t1 = time.time() | |
lib.matmul(m3, m1, m2, size, size, size) | |
t2 = time.time() | |
print t2 - t1 | |
if __name__ == '__main__': | |
main() |
from cffi import FFI | |
ffibuilder = FFI() | |
# C code from: https://tavianator.com/a-quick-trick-for-faster-naive-matrix-multiplication/ | |
ffibuilder.set_source("_mmc", | |
r""" | |
void matmul(double *dest, const double *lhs, const double *rhs, | |
size_t rows, size_t mid, size_t cols) { | |
memset(dest, 0, rows * cols * sizeof(double)); | |
for (size_t i = 0; i < rows; ++i) { | |
const double *rhs_row = rhs; | |
for (size_t j = 0; j < mid; ++j) { | |
for (size_t k = 0; k < cols; ++k) { | |
dest[k] += lhs[j] * rhs_row[k]; | |
} | |
rhs_row += cols; | |
} | |
dest += cols; | |
lhs += mid; | |
} | |
} | |
""", | |
libraries=[], | |
extra_compile_args=["-O3", "-mtune=native"]) | |
ffibuilder.cdef(""" | |
// declarations that are shared between Python and C | |
void matmul(double *dest, const double *lhs, const double *rhs, | |
size_t rows, size_t mid, size_t cols); | |
""") | |
if __name__ == "__main__": | |
ffibuilder.compile(verbose=True) |
import json | |
import random | |
import math | |
from decimal import Decimal, ROUND_DOWN, ROUND_UP | |
REPETITIONS = 100000 | |
# some copy-pasted old code | |
def _mean(l): | |
return math.fsum(l) / float(len(l)) | |
def bootstrap_sample(values): | |
return [random.choice(values) for i in range(len(values))] | |
def confidence_slice(means, confidence="0.99"): | |
means = sorted(means) | |
# There may be >1 median indices, i.e. data is even-sized. | |
lower, middle_indicies, upper = _confidence_slice_indicies(len(means), confidence) | |
median = _mean([means[i] for i in middle_indicies]) | |
return means[lower], median, means[upper - 1] # upper is *exclusive* | |
def _confidence_slice_indicies(length, confidence_level=Decimal('0.95')): | |
"""Returns a triple (lower, mean_indicies, upper) so that l[lower:upper] | |
gives confidence_level of all samples. Mean_indicies is a tuple of one or | |
two indicies that correspond to the mean position | |
Keyword arguments: | |
confidence_level -- desired level of confidence as a Decimal instance. | |
""" | |
assert not isinstance(confidence_level, float) | |
confidence_level = Decimal(confidence_level) | |
assert isinstance(confidence_level, Decimal) | |
exclude = (1 - confidence_level) / 2 | |
if length % 2 == 0: | |
mean_indicies = (length // 2 - 1, length // 2) | |
else: | |
mean_indicies = (length // 2, ) | |
lower_index = int( | |
(exclude * length).quantize(Decimal('1.'), rounding=ROUND_DOWN) | |
) | |
upper_index = int( | |
((1 - exclude) * length).quantize(Decimal('1.'), rounding=ROUND_UP) | |
) | |
return lower_index, mean_indicies, upper_index | |
def format_error(val, err, unit='s'): | |
return "%s±%s %s" % (round(val, 3), round(err, 3), unit) | |
def compute_speedup(l, baseline): | |
speedups = [] | |
for i in range(REPETITIONS): | |
speedups.append(_mean(bootstrap_sample(baseline)) / _mean(bootstrap_sample(l))) | |
lower, mid, upper = confidence_slice(speedups) | |
return mid, _mean([mid - lower, upper - mid]) | |
NICE_NAMES = dict(python="CPython", pypy="PyPy", numpy="NumPy", c="'naive' C") | |
def main(): | |
with file("data.json") as f: | |
data = json.load(f) | |
print "matrix size", data["size"] | |
print " " * 20, "time".rjust(30), "speedup over CPython".rjust(30), "speedup over PyPy".rjust(30) | |
for name in "python pypy c numpy".split(): | |
l = data["results"][name] | |
samples = [_mean(bootstrap_sample(l)) for i in range(REPETITIONS)] | |
lower, mid, upper = confidence_slice(samples) | |
print NICE_NAMES[name].rjust(20), format_error(_mean(l), _mean([mid - lower, upper - mid])).rjust(30), | |
if name == "python": | |
print "1 ×".rjust(30) | |
else: | |
speedup, speeduperr = compute_speedup(l, data["results"]["python"]) | |
print format_error(speedup, speeduperr, "×").rjust(30), | |
if name == "pypy": | |
print "1 ×".rjust(30) | |
else: | |
speedup, speeduperr = compute_speedup(l, data["results"]["pypy"]) | |
print format_error(speedup, speeduperr, "×").rjust(30) | |
if __name__ == '__main__': | |
main() |
import subprocess | |
import time | |
import json | |
SIZE = 1500 | |
TIMES = 50 | |
def run_repeatedly(cmd, times=TIMES): | |
print cmd | |
res = [] | |
for i in range(times): | |
output = subprocess.check_output(cmd, shell=True) | |
res.append(float(output.strip())) | |
print i, res[-1] | |
return res | |
def main(): | |
t1 = time.time() | |
d = dict(size=SIZE, results={}) | |
try: | |
for name, cmd in [ | |
("numpy", "OPENBLAS_NUM_THREADS=2 python mm.py numpy"), | |
("c", "python mm.py c"), | |
("pypy", "pypy mm.py python"), | |
("python", "python mm.py python"), | |
]: | |
res = run_repeatedly("%s %s" % (cmd, SIZE)) | |
print name, res | |
d["results"][name] = res | |
with file("data.json", "w") as f: | |
f.write(json.dumps(d)) | |
finally: | |
t2 = time.time() | |
print "total time", t2 - t1 | |
d["totaltime"] = t2 - t1 | |
with file("data.json", "w") as f: | |
f.write(json.dumps(d)) | |
if __name__ == '__main__': | |
main() |