Skip to content

Instantly share code, notes, and snippets.

@Veedrac
Last active May 5, 2019 12:02
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Veedrac/d25148faf20669589993 to your computer and use it in GitHub Desktop.
Save Veedrac/d25148faf20669589993 to your computer and use it in GitHub Desktop.
Speed test. Warning: Please read first comment
#include <iostream>
#include <iterator>
#include <numeric>
#include <unordered_set>
#include <vector>
int32_t some_calculations(int32_t number) {
std::vector<int32_t> a;
std::unordered_set<int32_t> s;
// This is the fraction that PyPy uses:
// http://stackoverflow.com/q/25968487/1763356
//
// You can go faster by making this even smaller,
// but I'm already letting C++ use 32 bit integers
// and calling reserve on the vector.
s.max_load_factor(2./3);
a.reserve(number);
int32_t x = 0;
for (int32_t i=0; i<number; ++i) {
x += i;
int item = i%2 ? -x : x;
s.insert(item);
a.emplace_back(item);
}
int32_t tot = 0;
for (auto x=std::begin(a); x != std::end(a); ++x) {
for (auto y=std::next(x); y != std::end(a); ++y) {
if (-(*x+*y) != *x && -(*x+*y) != *y && s.find(-(*x+*y)) != std::end(s)) {
++tot;
}
}
}
return tot / 3;
}
int main(int, char **) {
int32_t tot = 0;
for (int i=0; i<500; ++i) {
tot += some_calculations(i);
}
std::cout << tot << std::endl;
}
def some_calculations(number):
a = []
x = 0
for i in range(number):
x += i;
a.append(-x if i%2 else x);
s = set(a)
tot = 0
for i, x in enumerate(a):
for y in a[i+1:]:
if -(x + y) not in (x, y) and -(x + y) in s:
tot += 1
return tot // 3
print(sum(map(some_calculations, range(500))))
"""
Graph the time to run the input commands.
Usage:
plot_times.py <n> <command>...
"""
import docopt
import numpy
import resource
import seaborn
import shlex
import subprocess
import sys
from matplotlib import pyplot
options = docopt.docopt(__doc__)
try:
repeats = int(options["<n>"])
except ValueError:
print("<n> has to be an integer.")
raise SystemExit
datas = []
# Time
for raw_command in options["<command>"]:
command = shlex.split(raw_command)
data = [resource.getrusage(resource.RUSAGE_CHILDREN).ru_utime]
for i in range(repeats):
print("\r{}: {} of {}".format(raw_command, i+1, repeats), end="")
sys.stdout.flush()
subprocess.check_output(command)
data.append(resource.getrusage(resource.RUSAGE_CHILDREN).ru_utime)
print()
datas.append(data)
# Plot
figure = pyplot.figure(figsize=(24, 6))
seaborn.set(style="white", font_scale=2)
for command, data in zip(options["<command>"], datas):
times = numpy.diff(data)
seaborn.distplot(
times,
label=command,
bins=len(set(times.round(2))),
norm_hist=True,
kde_kws={"clip": (min(times), max(times))},
hist_kws={"histtype": "stepfilled", "alpha": 0.2}
)
seaborn.despine()
pyplot.title("Time to run")
pyplot.legend()
figure.savefig("time_to_run.png")
@Veedrac
Copy link
Author

Veedrac commented Sep 22, 2014

Edit:

Like all benchmarks, this one is flawed. PyPy does resize at a load factor of 2/3, yes, but it resizes by a factor of 4, not a factor of 2 as C++ does. Accounting for this,

// Just doing "s.max_load_factor(2./3)" isn't enough
// as PyPy resizes by a factor of 4
int32_t reserved = 2;
while (2 * reserved <= 3 * number) { reserved <<= 2; }
s.reserve(reserved);

makes C++ come a few percentage points faster than PyPy. Saddeningly, one can no longer claim PyPy to be faster, although getting C++ to a competitive point took a lot more effort.

I wouldn't just dismiss parity with heavily-optimized C++, though!


>>> time pypy just_some_python.py
52676
pypy just_some_python.py  0.34s user 0.01s system 99% cpu 0.348 total
>>> g++ -std=c++14 -O3 just_some_cplusplus.cpp -o just_some_cplusplus.out
>>> time ./just_some_cplusplus.out
52676
./just_some_cplusplus.out  0.44s user 0.00s system 99% cpu 0.437 total

Versions, on reminder from wyldphyre:

>>> pypy --version
Python 2.7.6 (3cf384e86ef7, Jul 09 2014, 04:28:24)
[PyPy 2.4.0-alpha0 with GCC 4.9.0 20140604 (prerelease)]
>>> g++ --version
g++ (GCC) 4.9.1

perf stat output, on request from Imxset21:

>>> perf stat -B pypy just_some_python.py
52676

 Performance counter stats for 'pypy just_some_python.py':

        356.810792      task-clock (msec)         #    1.000 CPUs utilized          
                 1      context-switches          #    0.003 K/sec                  
                 1      cpu-migrations            #    0.003 K/sec                  
             7,747      page-faults               #    0.022 M/sec                  
     1,122,951,081      cycles                    #    3.147 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
     1,842,013,319      instructions              #    1.64  insns per cycle        
       426,880,039      branches                  # 1196.376 M/sec                  
        11,606,870      branch-misses             #    2.72% of all branches        

       0.356679893 seconds time elapsed
>>> perf stat -B ./just_some_cplusplus.out
52676

 Performance counter stats for './just_some_cplusplus.out':

        440.630517      task-clock (msec)         #    1.001 CPUs utilized          
                 1      context-switches          #    0.002 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               102      page-faults               #    0.231 K/sec                  
     1,390,791,156      cycles                    #    3.156 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
       553,660,623      instructions              #    0.40  insns per cycle        
       126,883,541      branches                  #  287.959 M/sec                  
        11,647,020      branch-misses             #    9.18% of all branches        

       0.440386974 seconds time elapsed

msiemens wanted more data than just a single time, so I plotted it:

python plot_times.py 128 "./just_some_cplusplus.out" "./just_some_cplusplus_default_load_factor.out" "pypy just_some_python.py" "pypy3 just_some_python.py"

PDE and histogram of time to run, comparing C++ with both a default load factor and a load factor of 2/3, and PyPy and PyPy3

@jkholodnov
Copy link

I suspect this is due to python storing smaller numbers as singletons. Thoughts?

@alex
Copy link

alex commented Sep 23, 2014

@jkholodnov That wouldn't be in: a) PyPy doesn't have that optimization, b) the reason CPython has that optimization is that int objects are Python objects which are allocated and stored on the heap, for C++, int and friends are just words on the stack, no need for heap allocation, and therefore no need to optimize away allocations.

@AKJ7
Copy link

AKJ7 commented May 5, 2019

I know that this gist is old, but why didn't yo use the optimized C++ compilation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment