A bunch of stuff I learned while reading up on GHC garbage collection tuning.
GHC ticket 9221 (still open) GHC ticket 8224 (closed as "worksforme" - unclear why though; Simon Marlow commented "I don't think anyone has fully investigated what's going on [with that ticket]") GHC ticket 3758 (closed as apparently fixed once a deadlock was fixed; may have jumped the gun on calling it fixed)
I read through all of these threads and learned a lot.
A prerequisite for following them is understanding that you can control how GHC builds the Haskell runtime using the RTS
flags. The way those work is you run ghc +RTS some-flags-go-here -RTS
- and the +RTS
means "arguments after this one should be sent to the Run Time System" and -RTS
means "now we're done passing arguments to the RTS". You don't need the -RTS
if you have +RTS
and only RTS
flags after it.
These can also be added to a .cabal
file using -with-rtsopts
.
There are a bunch of RTS flags you can send.
Someone on a later ticket found that disabling parallel GC helped a lot with higher CPU counts:
it appears that +RTS -qg (disable parallel GC) helps a lot with the superlinear overhead. For example, the benchmark above with jobs=24 & caps=24 without -qg took:
real 1m3.596s user 6m31.072s sys 3m10.732s
With -qg:
real 0m47.747s user 1m33.352s sys 0m2.024s
Simon Marlow responded to this comment by saying: (emphasis mine)
We probably want to be running with larger heap sizes when there are lots of cores, to counteract the synchronization overhead of stop-the-world GC across many cores. e.g. +RTS -A32m at least.
+RTS -A32m
means "reserve 32MB for the nursery." By default, GHC reserves 1 megabyte for the nursery.
Someone experimented with tweaking the -A setting and found that increasing it substantially decreased execution time, but also substantially increased memory usage.
I ran some experiments with -A and it does help a lot with performance, but also increases peak memory usage. I observed continuous improvement all the way from -A1m to -A128m in terms of walltime (41s to 36s), but "total memory in use" also went up from 265MB to 2182MB. Not sure where the sweet spot is.
-A seems to help especially if the number of capabilities exceeds the number of cores. With 32 capabilities on a 16 core machine, a -qg run took 50s, -A128m took 41s (still a penalty over 36s but not nearly as bad) and a vanilla run took almost 2min. Of course, total memory use with -A128m went up to 4388m...
To this, Simon Marlow commented:
It's hard to know where to set the default on the memory-vs-time tradeoff curve. GHC has typically been quite conservative here.
A bit later, someone said something which I didn't understand, but which both SPJ and Simon Marlow agreed with, so I'm linking to it in case someone else can figure out what it means.
Someone read a paper and made a discovery:
The crucial detail: work stealing is enabled by default for gen=1 and upper. As default nursery is tiny we don't do stealing from it. That's why I see poor GC parallelism on large nurseries for this compilation workload
To this Simon Marlow responded:
Yes, perhaps we should default to -qb0 when -A is larger than some threshold.
-qb0
means "Use load-balancing in the parallel GC in generation 0 and higher." (Default is generation 1 and higher. So this says "use load balancing in the nursery.") Based on what the linked docs say, it would also potentially be beneficial to experiment with -qb
(omitting the zero) which disables load balancing entirely.
Ultimately someone on the GHC team benchmarked this and confirmed a speedup. On GHC 8+ when you use -A32M
or higher, it enables -qb0
- but we can enable it manually.
The -n
flag, which in the current GHC now defaults to -n4m
for -A16M
or higher, improves how the megabytes of memory specified in -A
get distributed among cores.
Without -n, each core gets a fixed-size allocation area specified by the -A, and the first core to exhaust its allocation area triggers a GC across all the cores. This can result in a collection happening when the allocation areas of some cores are only partially full, so the purpose of the -n is to allow cores that are allocating faster to get more of the allocation area. This means less frequent GC, leading a lower GC overhead for the same heap size.