miguelraz/perfnotesrust.md

## perfnotesrust.md

      
    Raw
  

              perfnotesrust.md
            
          
    Rust Speed Optimizations

From setting  core_affinity to measuring L1i cache misses with good ol' perf or poking at hardware counters, if there's standard perf tooling you expect to work, it should work well in Rust too (but it might look a bit different than what you expect.)
The Rust community has a healthy suite of monitoring for hundreds of perf mentrics, see perf.rust-lang.org.
Let's look at some Rust specific flavors of what you should know.
Prior training material art

We won't cover optimizing your compilation time here, but you can check out our slides on that over here.
Tooling to measure speed

Sometimes, the fanciest algorithms don't beat the fact that 99% of application's hot path is creating an empty vector - learning to detect that and design for it with specialized data structures like ThinVec or SmallVec and friends is key!
Profilers

The Rust performance book mentions several options for profiling.

perf/ flamegraph - for usual Rust code
measureme - for the Rust compiler itself

Among others.
Statistical tracing profilers payoff best when a single function or loop is dominating the runtime, because they poll at random intervals and thus gain depth with time.
"Flatter" profiles that don't have a single focal point (like a multipass compilation workload) less likely to show useful data in their entirety and may require bespoke workloads that stress particular code paths.
measureme and other profilers may allow for SQL-like queries to be given to the UI in order to better filter the polled data.
For memory/heap profiling, check

bytehound/ dhat - allows testing for specific amounts of allocation amounts and sizes.

Very useful for hunting allocations and memory leaks. See this case study for a thorough walk through of the bytehound's capabilities.
Honourable mention - rustc-perf just gained a collector for binary_sizes to compare different compiler versions and track regressions.
Causal Profilers


coz

Only on Linux, but offers extremely powerful "causal analysis" in multithreaded workloads.
coz requires one to instrument Rust code with some light macros and then to receive a graph with information like "if we speed up line by 37%, we will see an increase of 24% in our code".
This is especially important because sometimes speeding up parallel code can lead to a global slowdown! (Hint - think of the case where a busy thread obtains the global lock sooner and prevents progress for other threads.)
LLVM tooling:

Rust can only do so much - after a monomorphization and some profitable transformations done on Rust's MIR, all code is handed off to LLVM to be optimized.
Rust has a ...tendency to stress LLVM in charted territory in unfamiliar ways, so poking at what happens inside the machine is necessary for deep introspection.

llvm-mca - via GodBolt. About as low level as you can get when measuring machine cycles. Very architecture dependent, but pays off in very hot loops for compute dominated code. Also useful to figure out how many cycles/character string processing algorithms are doing.
llvm instrumentation coverage
cargo-llvm-lines - to see which LLVM IR optimization passes are taking the most time, or not firing at all.
cargo-show-asm - to inspect the outgoing assembly from specific functions in your crate. Very useful for a quick and dirty "did it produce lots of SIMD instructions or did something fail".
[llvm-opt] - A lot of LLVM IR passes come with many heuristics. opt lets you include your own passes in the pipeline for analysis, reorder the passes, and much more.

If you're puzzled about some optimizations not firing and you're this deep in the stack, you should reach out to some devs and likely file an issue.
Benchmarking Frameworks


criterion vs divan for Rust codes, hyperfine for CLI benchmarking of programs.

divan is what 99% of people should reach for when timing their code.
If you need a more robust timing framework, consider criterion.
hyperfine is a useful crate for timing programs on the commandline and having a small colorful TUI display the diff.
From the bleeding edge to mainstream

You can't get into the Linux kernel if you're a slouch. Here's some cutting edge implementations breaking ground and using Rust all the way, from STEM to Unicode.

RustFFT - a Rust based implementation of the Fast Fourier Transform, a code needed in basically anything with an antenna or PET scan, and much more.
egraphs - a resurrected data structure that solves the phase ordering problem for term rewriting with fine-grained concurrency. This implementation is guiding loads of compiler research avenues, alone.
ICU4X - official implementation of many Unicode internationalization algorithms, written in Rust.
Rust atomics - To write this book, Rust library lead Mara Bos implemented futexes on MacOS and many other techniques.
Compiler tech

Salsa - a generic framework for on-demand incrementalized computation. Spiritually equivalent to rustc's query driver, or, how a compiler is a funny kind of database.
MIRI - Are you sure your unsafe code is actually sound and will give you the perf benefits you require? MIRI (available in the Rust Playground) will bail out if it detects any UB or soundness issues in your hand-made "optimizations".
MMTK - Pluggable and tuneable GC backends. This framework is the result of a top-tier research group's GC innovations, available for all those who want to try them out.
Enzyme and MLIR - a framework to autodifferentiate (obtain the gradients of) LLVM IR code. This work has been upstreamed into the Rust compiler and is fundamental for gaining ground in the ML/AI space.
YJIT - a JIT for Ruby was rewritten in Rust and blew the benchmarks out of the water.
portable-simd - ambitious and progressing, book here

Stabilization is close TM!


Idiomatic Rust


Remember to use --release for the best results and C -march=$TARGET to leverage the most of your architecture
Fizzbuzz actually reuses same byte strings
Non-generic inner function - notable example, the function that calls MIR optimization passes optimization passes
iterators and perf (sized iterators)
stdlib idioms from STL and how they differ
const fns - notably missing: const functions on many floating point operations.
const generic - specialize heavily on integer parameters. Incredibly useful for embedded usecases and for finicky numerical codes that demand precise loop unrolling semantics.
byte arrays, bstr - Rust's standard library offers a myriad string processing functions, but it's easy to default to validated UTF8 instead of using basic byte arrays (via b' ' for an empty char or b"asdf" for raw byte values). The bstr crate offers APIs to operate on not-necessarily-valid UTF8 strings (which arise surprisingly often when dealing with random file name creation.)
panic! branches - branches that may panic! can wreck havoc on your optimizations. Setting panic = "abort" in your config.toml will reduce your code size and may also reduce pressure on the instruction cache.
mem::take - In proper Ferrous Systems style, we have a killer blogpost on it.
codegen units - consider setting #[inline] above important functions, using lto = "fat" in your config.toml and codegen-units-std=1 to have the optimization passes be able to pull in beefier global analysis passes (at the cost of heftier compile times.) See the rustc compiler dev guide for more details.
Compile time perfect hashmaps - construct your fancy data structures at compile time and use the optimized version at run time!
BufReader - a stdlib type that can save you lots of perf headaches if your bigger bottlenecks are I/O based. Wrap your file in a BufReader::new(file); and just iterate as usual on it.
Scoped threads - Dirt cheap and easy parallelism by spawning a thread on a core. Note: thread spawning will take about ~10us, so measuring your workload is essential to know if you can gain by using them.
struct-of-arrays - Rust and the borrowchecker will push you into a data-oriented Struct of Arrays architecture. Bevy's Entity Component System was written to leverage this, and it's worth learning about the data locality benefits.
Cow - Work with both owned and borrowed data!
Avoid dynamic dispatch - use enum_dispatch if possible.

Important crates


rayon
(&a, &mut b, &c).into_par_iter().for_each(|(a,b,c)| todo!()) // via `MultiZip`
vs
Zip::from(&mut a)
.and(&b)
.and(&c)
.and(&d)
.for_each(|w, &x, &y, &z| {
    w += x + y z;
});

par!
notably missing: parallel advisory


memchar2, ripgrep


polars


bytemuck


Perf walkthroughs - measure, measure, measure


Designing a SIMD algorithm from Scratch by MiguelDraws
1 Billion Row Challenge in Rust - Parsing 1 billion rows in parallel in under a second.
Comparing Parallel Rust and C++ - Incredibly in depth perf dive into a graph shortcut problem.