rtfeldman/GCStuff.md Secret

## GCStuff.md

      
    Raw
  

              GCStuff.md
            
          
    A bunch of stuff I learned while reading up on GHC garbage collection tuning.
GHC Tickets

GHC ticket 9221 (still open)
GHC ticket 8224 (closed as "worksforme" - unclear why though; Simon Marlow commented "I don't think anyone has fully investigated what's going on [with that ticket]")
GHC ticket 3758 (closed as apparently fixed once a deadlock was fixed; may have jumped the gun on calling it fixed)
I read through all of these threads and learned a lot.
A prerequisite for following them is understanding that you can control how GHC builds the Haskell runtime using the RTS flags. The way those work is you run ghc +RTS some-flags-go-here -RTS - and the +RTS means "arguments after this one should be sent to the Run Time System" and -RTS means "now we're done passing arguments to the RTS". You don't need the -RTS if you have +RTS and only RTS flags after it.
These can also be added to a .cabal file using -with-rtsopts.
There are a bunch of RTS flags you can send.
Disabling Parallel GC

Someone on a later ticket found that disabling parallel GC helped a lot with higher CPU counts:

it appears that +RTS -qg (disable parallel GC) helps a lot with the superlinear overhead. For example, the benchmark above with jobs=24 & caps=24 without -qg took:
real 1m3.596s user 6m31.072s sys 3m10.732s
With -qg:
real 0m47.747s user 1m33.352s sys 0m2.024s

Simon Marlow responded to this comment by saying: (emphasis mine)

We probably want to be running with larger heap sizes when there are lots of cores, to counteract the synchronization overhead of stop-the-world GC across many cores. e.g. +RTS -A32m at least.

+RTS -A32m means "reserve 32MB for the nursery." By default, GHC reserves 1 megabyte for the nursery.
Someone experimented with tweaking the -A setting and found that increasing it substantially decreased execution time, but also substantially increased memory usage.

I ran some experiments with -A and it does help a lot with performance, but also increases peak memory usage. I observed continuous improvement all the way from -A1m to -A128m in terms of walltime (41s to 36s), but "total memory in use" also went up from 265MB to 2182MB. Not sure where the sweet spot is.
-A seems to help especially if the number of capabilities exceeds the number of cores. With 32 capabilities on a 16 core machine, a -qg run took 50s, -A128m took 41s (still a penalty over 36s but not nearly as bad) and a vanilla run took almost 2min. Of course, total memory use with -A128m went up to 4388m...

To this, Simon Marlow commented:

It's hard to know where to set the default on the memory-vs-time tradeoff curve. GHC has typically been quite conservative here.

A bit later, someone said something which I didn't understand, but which both SPJ and Simon Marlow agreed with, so I'm linking to it in case someone else can figure out what it means.
Load Balancing the Nursery with -qb0

Someone read a paper and made a discovery:

The crucial detail: work stealing is enabled by default for gen=1 and upper. As default nursery is tiny we don't do stealing from it. That's why I see poor GC parallelism on large nurseries for this compilation workload

To this Simon Marlow responded:

Yes, perhaps we should default to -qb0 when -A is larger than some threshold.

-qb0 means "Use load-balancing in the parallel GC in generation 0 and higher." (Default is generation 1 and higher. So this says "use load balancing in the nursery.") Based on what the linked docs say, it would also potentially be beneficial to experiment with -qb (omitting the zero) which disables load balancing entirely.
Ultimately someone on the GHC team benchmarked this and confirmed a speedup. On GHC 8+ when you use -A32M or higher, it enables -qb0 - but we can enable it manually.
Nursery Chunking with -n4m

The -n flag, which in the current GHC now defaults to -n4m for -A16M or higher, improves how the megabytes of memory specified in -A get distributed among cores.

Without -n, each core gets a fixed-size allocation area specified by the -A, and the first core to exhaust its allocation area triggers a GC across all the cores. This can result in a collection happening when the allocation areas of some cores are only partially full, so the purpose of the -n is to allow cores that are allocating faster to get more of the allocation area. This means less frequent GC, leading a lower GC overhead for the same heap size.