Skip to content

Instantly share code, notes, and snippets.

@coriolinus
Last active June 6, 2022 05:48
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save coriolinus/f94748a0d232d32e4eb1 to your computer and use it in GitHub Desktop.
Save coriolinus/f94748a0d232d32e4eb1 to your computer and use it in GitHub Desktop.
Rust threaded code vs. Python's GIL: benchmarking using code from Advent of Code 2015
//! Edited version of bin.rs, hard-coding a number of cores to use instead of depending on the default,
//! which reports the total number of cores including virtual ones.
extern crate util;
use util::get_line_input;
extern crate day4lib;
use day4lib::mine_coin_with_conf;
use day4lib::CoinMiningConfig;
fn main() {
if let Ok(input) = get_line_input("Enter your secret key here: ") {
let input = input.trim();
if let Some(result) = mine_coin_with_conf(&input,
CoinMiningConfig::default().cpus(4)) {
println!("Coin: {}", result);
} else {
println!("No coin found")
}
}
}

System information

This test was run on my desktop development machine, which is fairly beefy but not particularly new anymore. It has four real cores and four virtual, and runs at ~3.2 Ghz. Memory isn't relevant to this test.

The machine runs Windows 10, but shell used here was Cygwin64.

coriolinus@robocomp ~
$ rustc --version && cargo --version && python3 --version
rustc 1.4.0 (8ab8581f6 2015-10-27)
cargo 0.5.0-nightly (833b947 2015-09-13)
Python 3.4.3

Setup

All code used in this test is available here. This test was run on commit 2d6fddc.

Rust is a systems language priding itself on its speed and safety. I'm just learning it, so I'm using the adventofcode.com exercises as learning projects. I was pretty happy with the code I came up with here: it uses lightweight threads to eat up as much processor power as you're willing to give it. I'm not convinced that the code is completely optimized--in particular, there's a collect phase to ensure that an opportunistic thread with a higher unit of work doesn't accidentally return an incorrect answer, that I'm fairly confident could be optimized away somehow. At any rate, development of this program in Rust took about 7 hours all told for 176 lines of code. The build is in release mode, which enables compiler optimizations.

Python is a dynamic, interpreted language priding itself on ease of development. I'm fairly experienced with it. It's fundamentally a single-threaded language given that it has a Global Interpreter Lock, and I didn't want to spend the time to use the multiprocessing module to work around the GIL, so my implementation here uses only the main thread. Total development time was around 20 minutes, for 24 lines of code.

coriolinus@robocomp ~
$ git clone https://github.com/coriolinus/adventofcode-2015.git
Cloning into 'adventofcode-2015'...
remote: Counting objects: 174, done.
remote: Compressing objects: 100% (87/87), done.
remote: Total 174 (delta 74), reused 172Rec (deleivingta  ob72), pjects:ack  34-Ke% (60/used174)  0
Receiving objects: 100% (174/174), 31.09 KiB | 0 bytes/s, done.
Resolving deltas: 100% (74/74), done.
Checking connectivity... done.

coriolinus@robocomp ~
$ cd adventofcode-2015/day4

coriolinus@robocomp ~/adventofcode-2015/day4
$ cargo build --release
   Compiling libc v0.1.12
   Compiling winapi-build v0.1.1
   Compiling winapi v0.2.5
   Compiling util v0.1.0 (file:///D:/cygwin64/home/coriolinus/adventofcode-2015/day4)
   Compiling rustc-serialize v0.3.16
   Compiling libc v0.2.2
   Compiling advapi32-sys v0.1.2
   Compiling kernel32-sys v0.2.1
   Compiling gcc v0.3.20
   Compiling rand v0.3.12
   Compiling num_cpus v0.2.10
   Compiling time v0.1.34
   Compiling rust-crypto v0.2.34
   Compiling day4 v0.1.0 (file:///D:/cygwin64/home/coriolinus/adventofcode-2015/day4)

Testing

coriolinus@robocomp ~/adventofcode-2015/day4
$ time echo coriolinus | ./target/release/day4bin.exe
Enter your secret key here: Coin: 2715135

real    0m3.038s
user    0m0.000s
sys     0m0.000s

coriolinus@robocomp ~/adventofcode-2015/day4
$ time echo coriolinus | python3 ./py/day4.py
Enter your secret key here: Coin: 2715135

real    0m8.285s
user    0m8.234s
sys     0m0.000s

The nature of the test is such that the number returned in the Coin value is a minimum bound on the number of MD5 hashes computed. Both programs agreed on the correct answer, meaning each of them computed at minimum 2.7 million hashes.

The nature of the implementations is a confounding factor: the Python version can return immediately once it has the answer, while the Rust version has to wait for all eight threads to complete their last units of work. That said, the unit of work chosen was only 1024 hashes: in the absolute worst case, the program must wait for only 7168 additional hashes. Given the overall throughput, that should represent only 8 additional milliseconds.

The results here shocked me: yes, Rust is faster--but remember, it has eight times the cores to use. Rust's per-thread hash rate was only 112,000/sec, while Python's was 328,000/sec. How come Python is so much faster?

I don't know. I suspect that the hashlib.md5() function used in the Python version is more highly optimized than the rust_crypto::md5::Md5 object used in the Rust version. It's fairly common in Python to encounter standard library functions which have optimized C implementations, instead of sticking to pure Python. The Rust library, on the other hand, was pure-native and looked fairly straightforward at a glance-over, which implies that it's not playing many optimization tricks.

The other possibility, of course, is that my Rust code kind of sucks and I've introduced inadvertent delays. In that case, I'd love to see patch requests, particularly if they include explanations of where I went wrong!

Some Ways I Went Wrong

As redditor merdiocracy points out, threading doesn't scale linearly. In fact, in this case, it significantly affected the results. Running with a single core gives these results:

coriolinus@robocomp ~/adventofcode-2015/day4
$ cat src/bin.rs | grep cpus
          CoinMiningConfig::default().cpus(1)) {

coriolinus@robocomp ~/adventofcode-2015/day4
$ time echo coriolinus | ./target/release/day4bin.exe
Enter your secret key here: Coin: 2715135

real    0m2.562s
user    0m0.015s
sys     0m0.015s

Not only is that almost four times faster than the Python version, it's shaved half a second from the eight-core Rust version! What's going on with that?

coriolinus@robocomp ~/adventofcode-2015/day4
$ cat src/bin.rs | grep cpus
          CoinMiningConfig::default().cpus(4)) {

coriolinus@robocomp ~/adventofcode-2015/day4
$ time echo coriolinus | ./target/release/day4bin.exe
Enter your secret key here: Coin: 2715135

real    0m1.390s
user    0m0.000s
sys     0m0.000s

OK, that actually makes it clearer. Single core, Rust is pushing 1.05 million hashes per second. Using four cores, that improves to 1.95 million hashes per second: not quite double the total speed. In fact, the per-core speed is halved, down to 0.49 million hashes per second. You get improvements as you increase the number of cores, but they are in fact very nonlinear.

The second part of this is to remember that of the eight cores on my machine, only four are actually there. The other half are virtually implemented via hyperthreading. Normally, this is a good thing, when a system spends more time task-switching among mostly-idle processes than actually executing code in any of them. However, this is a CPU-heavy task which easily saturates all the cores. Given that, the performance hit from all that task-switching means there's a 2.5x slowdown when we attempt to saturate the virtual cores as well as the real ones.

Conclusion

When used properly, Rust does in fact hash faster than Python. In a fair, single-core comparison, I found 1.05 mio/sec in Rust vs Python's 328,000/sec. Using all four cores of the machine, Rust still had a per-core hash speed of 488,000/sec, which still beat Python on a per-core basis and significantly improved the overall runtime.

Surprisingly, there's no simple way to programmatically determine how many physical processors exist on a given machine in Rust. The best solution at this time appears to be to access hwloc via FFI. That is kind of a hassle, to be honest. A compromise might be to overthink the result of num_cpus::get(): Trust and use the result if it is 1 or 2, or >= 32; halve the result otherwise.

Postscript

Redditor sellibitze found both a moderately serious bug and a potential optimization in the code. Much thanks to them for that! Using the fixed version at commit b09a76f, we get even better results:

coriolinus@robocomp ~/adventofcode-2015/day4
$ time echo coriolinus | target/release/day4bin.exe
Enter your secret key here: Coin: 2715135

real    0m0.413s
user    0m0.000s
sys     0m0.015s

That's a throughput of 1.6 mio hashes/sec, per thread! Not too shabby. It looks like this experiment has proven that Rust is, in fact, faster than Python at hashing--though only after fixing the user errors.

@Manishearth
Copy link

Surprisingly, there's no simple way to programmatically determine how many physical processors exist on a given machine in Rust.

https://crates.io/crates/num_cpus

@mbrubeck
Copy link

I think that num_cpus only returns the number of logical cores/threads, not the number of physical cores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment