This test was run on my desktop development machine, which is fairly beefy but not particularly new anymore. It has four real cores and four virtual, and runs at ~3.2 Ghz. Memory isn't relevant to this test.
The machine runs Windows 10, but shell used here was Cygwin64.
coriolinus@robocomp ~
$ rustc --version && cargo --version && python3 --version
rustc 1.4.0 (8ab8581f6 2015-10-27)
cargo 0.5.0-nightly (833b947 2015-09-13)
Python 3.4.3
All code used in this test is available here. This test was run on commit 2d6fddc.
Rust is a systems language priding itself on its speed and safety. I'm just learning it, so I'm using the adventofcode.com exercises as learning projects. I was pretty happy with the code I came up with here: it uses lightweight threads to eat up as much processor power as you're willing to give it. I'm not convinced that the code is completely optimized--in particular, there's a collect phase to ensure that an opportunistic thread with a higher unit of work doesn't accidentally return an incorrect answer, that I'm fairly confident could be optimized away somehow. At any rate, development of this program in Rust took about 7 hours all told for 176 lines of code. The build is in release mode, which enables compiler optimizations.
Python is a dynamic, interpreted language priding itself on ease of development. I'm fairly experienced with it. It's
fundamentally a single-threaded language given that it has a Global Interpreter Lock, and I
didn't want to spend the time to use the multiprocessing
module to work around the GIL, so my implementation here
uses only the main thread. Total development time was around 20 minutes, for 24 lines of code.
coriolinus@robocomp ~
$ git clone https://github.com/coriolinus/adventofcode-2015.git
Cloning into 'adventofcode-2015'...
remote: Counting objects: 174, done.
remote: Compressing objects: 100% (87/87), done.
remote: Total 174 (delta 74), reused 172Rec (deleivingta ob72), pjects:ack 34-Ke% (60/used174) 0
Receiving objects: 100% (174/174), 31.09 KiB | 0 bytes/s, done.
Resolving deltas: 100% (74/74), done.
Checking connectivity... done.
coriolinus@robocomp ~
$ cd adventofcode-2015/day4
coriolinus@robocomp ~/adventofcode-2015/day4
$ cargo build --release
Compiling libc v0.1.12
Compiling winapi-build v0.1.1
Compiling winapi v0.2.5
Compiling util v0.1.0 (file:///D:/cygwin64/home/coriolinus/adventofcode-2015/day4)
Compiling rustc-serialize v0.3.16
Compiling libc v0.2.2
Compiling advapi32-sys v0.1.2
Compiling kernel32-sys v0.2.1
Compiling gcc v0.3.20
Compiling rand v0.3.12
Compiling num_cpus v0.2.10
Compiling time v0.1.34
Compiling rust-crypto v0.2.34
Compiling day4 v0.1.0 (file:///D:/cygwin64/home/coriolinus/adventofcode-2015/day4)
coriolinus@robocomp ~/adventofcode-2015/day4
$ time echo coriolinus | ./target/release/day4bin.exe
Enter your secret key here: Coin: 2715135
real 0m3.038s
user 0m0.000s
sys 0m0.000s
coriolinus@robocomp ~/adventofcode-2015/day4
$ time echo coriolinus | python3 ./py/day4.py
Enter your secret key here: Coin: 2715135
real 0m8.285s
user 0m8.234s
sys 0m0.000s
The nature of the test is such that the number returned in the Coin value is a minimum bound on the number of MD5 hashes computed. Both programs agreed on the correct answer, meaning each of them computed at minimum 2.7 million hashes.
The nature of the implementations is a confounding factor: the Python version can return immediately once it has the answer, while the Rust version has to wait for all eight threads to complete their last units of work. That said, the unit of work chosen was only 1024 hashes: in the absolute worst case, the program must wait for only 7168 additional hashes. Given the overall throughput, that should represent only 8 additional milliseconds.
The results here shocked me: yes, Rust is faster--but remember, it has eight times the cores to use. Rust's per-thread hash rate was only 112,000/sec, while Python's was 328,000/sec. How come Python is so much faster?
I don't know. I suspect that the hashlib.md5()
function used in the Python version is more highly optimized than the
rust_crypto::md5::Md5
object used in the Rust version. It's fairly common in Python to encounter standard library functions
which have optimized C implementations, instead of sticking to pure Python. The Rust library, on the other hand, was pure-native
and looked fairly straightforward at a glance-over, which implies that it's not playing many optimization tricks.
The other possibility, of course, is that my Rust code kind of sucks and I've introduced inadvertent delays. In that case, I'd love to see patch requests, particularly if they include explanations of where I went wrong!
As redditor merdiocracy points out, threading doesn't scale linearly. In fact, in this case, it significantly affected the results. Running with a single core gives these results:
coriolinus@robocomp ~/adventofcode-2015/day4
$ cat src/bin.rs | grep cpus
CoinMiningConfig::default().cpus(1)) {
coriolinus@robocomp ~/adventofcode-2015/day4
$ time echo coriolinus | ./target/release/day4bin.exe
Enter your secret key here: Coin: 2715135
real 0m2.562s
user 0m0.015s
sys 0m0.015s
Not only is that almost four times faster than the Python version, it's shaved half a second from the eight-core Rust version! What's going on with that?
coriolinus@robocomp ~/adventofcode-2015/day4
$ cat src/bin.rs | grep cpus
CoinMiningConfig::default().cpus(4)) {
coriolinus@robocomp ~/adventofcode-2015/day4
$ time echo coriolinus | ./target/release/day4bin.exe
Enter your secret key here: Coin: 2715135
real 0m1.390s
user 0m0.000s
sys 0m0.000s
OK, that actually makes it clearer. Single core, Rust is pushing 1.05 million hashes per second. Using four cores, that improves to 1.95 million hashes per second: not quite double the total speed. In fact, the per-core speed is halved, down to 0.49 million hashes per second. You get improvements as you increase the number of cores, but they are in fact very nonlinear.
The second part of this is to remember that of the eight cores on my machine, only four are actually there. The other half are virtually implemented via hyperthreading. Normally, this is a good thing, when a system spends more time task-switching among mostly-idle processes than actually executing code in any of them. However, this is a CPU-heavy task which easily saturates all the cores. Given that, the performance hit from all that task-switching means there's a 2.5x slowdown when we attempt to saturate the virtual cores as well as the real ones.
When used properly, Rust does in fact hash faster than Python. In a fair, single-core comparison, I found 1.05 mio/sec in Rust vs Python's 328,000/sec. Using all four cores of the machine, Rust still had a per-core hash speed of 488,000/sec, which still beat Python on a per-core basis and significantly improved the overall runtime.
Surprisingly, there's no simple way to programmatically determine how many physical processors exist on a given machine in Rust. The best solution at this time appears to be to access hwloc via FFI. That is kind of a hassle, to be honest. A compromise might be to overthink the result of num_cpus::get()
: Trust and use the result if it is 1 or 2, or >= 32; halve the result otherwise.
Redditor sellibitze found both a moderately serious bug and a potential optimization in the code. Much thanks to them for that! Using the fixed version at commit b09a76f, we get even better results:
coriolinus@robocomp ~/adventofcode-2015/day4
$ time echo coriolinus | target/release/day4bin.exe
Enter your secret key here: Coin: 2715135
real 0m0.413s
user 0m0.000s
sys 0m0.015s
That's a throughput of 1.6 mio hashes/sec, per thread! Not too shabby. It looks like this experiment has proven that Rust is, in fact, faster than Python at hashing--though only after fixing the user errors.
https://crates.io/crates/num_cpus