Skip to content

Instantly share code, notes, and snippets.

@zeux
Last active February 23, 2022 22:53
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save zeux/6e79f5d14a9f5a80bad9db85aa9a744f to your computer and use it in GitHub Desktop.
Save zeux/6e79f5d14a9f5a80bad9db85aa9a744f to your computer and use it in GitHub Desktop.
Luau GC exploration: doing some math on relationship between S / G / P

GC pacing

This document tries to establish a mathematical formulation for GC pacing in Luau GC, assuming a uniform rate of allocation in an application with steady live set.

GC algorithm assumptions

  • GC proceeds in three phases: mark, atomic, sweep
  • During mark, the heap size only grows as we don't deallocate memory short of table resize
  • During sweep, the heap size grows due to new allocations and shrinks due to swept objects
  • Live set is fixed at atomic time (between mark & sweep)
  • Mark needs to traverse all live objects
  • Sweep needs to traverse all objects in existence at the time sweep started, but not objects allocated since

The latter is true if mutator allocates aggressively, because new pages get prepended to the total page set, so they are not visited during sweep cycle. It's not quite precise because if an allocation happens in a page that existed at start of sweep cycle and hasn't been visited yet, we will still traverse the object during sweep, but this complicates the accounting.

GC phase: atomic

We will try to compute the delta between two atomic steps by modeling the behavior of a steadily allocating application. The reason why doing the delta between two atomic steps is convenient is that during atomic we have just traversed the entire live set, so we know the size of live set as well as the size of the total heap exactly.

Let:

  • Ha = total heap size during atomic
  • R = ratio of live objects in the heap (live set = R*Ha bytes)
  • A = allocation rate (bytes/second)

We will also use two GC tunables: mark speed (# of bytes marked per byte of allocation) Sm and sweep speed (# of bytes swept per byte of allocation) Ss.

GC phase: sweep

During the sweep phase we need to sweep the entire heap as seen during atomic phase. Note that during sweep the mutator will continue allocating new objects at a constant rate. Assuming Ts is the total time sweep takes (note, here and elsewhere time is measured in total application seconds and doesn't imply anything about sweep cost), sweep ends when we've traversed the entire heap, and the heap has grown during that time as the application keeps allocating:

Ha = Ts * A * Ss

Thus:

Ts = Ha / (A * Ss)

At the end of the sweep cycle, the application will have allocated Ts * A bytes, but also the sweeper will have deallocated Ha * (1 - R) bytes, leaving no garbage from the original heap. We can now compute the heap size at the end of the sweep cycle:

Hs = Ha * R + (Ha / (A * Ss)) * A = Ha * R + Ha / Ss

Thus, at the end of the sweep cycle we know the predicted heap size:

Hs = Ha * R + Ha / Ss

We keep the sum as two terms: the first term is all live objects, and we don't know the liveness of the second term. We'll come back to this.

GC phase: pause

This is an opportunity for us to take a pause and not collect anything for a while after sweeping finished. Let's arbitrarily say that we are willing to let the application grow the heap size by an extra fraction of P of the "atomic" heap size before starting the next mark. Thus we now know the heap size at the end of the pause:

Hp = Ha * R + Ha / Ss + Ha * P

GC phase: mark

Now that we've started marking, we'll note that during marking, we don't free memory, but we do traverse all newly allocated objects. We're interested in the mark phase ending, which is a little more complex than sweep because we're chasing an ever growing live set: if Tm is the mark duration, then we've scanned Tm * A * Sm during that; and we've also allocated Tm * A extra objects some of which might be live.

This is where the analysis gets a little difficult because we don't know how many of the newly allocated objects are garbage. It's tempting to assume that the ratio Ra that we recorded at the start of atomic holds here as well, but the reality is that Ra at the end of atomic phase isn't the same as at the end of sweep! (because we've allocated more garbage)

For now let's make an assumption that we're only allocating garbage during both mark and sweep. This may seem weird, but it makes sense for a steady state of a running application - after all, if the heap doesn't grow or shrink, we might as well assume that some objects never die and all other objects are garbage. It might be that we're allocating some garbage, and replacing some existing objects with it, but this needlessly complicates the analysis.

As such, the total live set we need to scan during mark is the same as the live set we've started with: Ha * R bytes. Everything we've allocated since is garbage.

Thus we can compute Tm:

Tm * A * Sm = Ha * R

Tm = Ha * R / (A * Sm)

During this we've allocated Tm * A objects which brings us to the total heap size after mark:

Hm = Hp + Tm * A = Ha * R + Ha / Ss + Ha * P + Ha * R / (A * Sm) * A

Hm = Ha * R + Ha / Ss + Ha * P + Ha * R / Sm

Conveniently, all these terms are factors of Ha, so we can simplify this by looking at the heap growth ratio:

Hm / Ha = R + 1 / Ss + P + R / Sm

GC: steady state

Note that Hm is the total heap size at the end of mark, which is also the beginning of the next atomic phase. During atomic phase the state doesn't change. For the heap to have a stable size at steady state, we naturally want the ratio to be 1 - if the ratio is consistently above 1, this means we're effectively leaking memory (assuming live set doesn't grow). So:

1 = R + 1 / Ss + P + R / Sm

This is a really important equation, because there's a deep meaning to the constant R. We would normally expect the heap size to be largest during atomic phase, as we haven't deallocated any objects during the pause/mark phase, and right after atomic we're starting to sweep.

As such, 1 / R is otherwise known as "heap size goal": the ratio of total heap size to live state. For example, if heap size goal is 200%, during atomic we'd expect ~half of the objects to be garbage and ~half of the objects to be live.

So we can assume 1 / R is the configuration constant - we know it because it's given to us. Let's call it G and rework the equation above:

1 = 1 / G + 1 / Ss + P + 1 / G / Sm

G = 1 + G / Ss + G * P + 1 / Sm

G * (1 - 1 / Ss - P) = 1 + 1 / Sm

G = (1 + 1 / Sm) / (1 - 1 / Ss - P)

Repeating: we now have the Fundamental Equation of GC:

G = (1 + 1 / Sm) / (1 - 1 / Ss - P)

Note that this equation assumes we don't do background GC, and are purely operating in assist mode. It also assumes a bunch of other things that may not hold true.

Digression: validating full gc trigger code

Now that we're here let's make a brief detour and double check our pause estimation after the full GC (see luaC_fullgc). This is mostly to validate that the reasoning above is roughly on the same wavelength as some basic code.

Assuming we've just done a full GC - a synchronous mark & sweep, which obviously changes the calculations above - what should P be for us to reach the heap goal G given S?

After a full GC, only objects remaining are live.

After this we have a pause, during which we allocate H * P bytes (all garbage per assumptions above), followed by mark during which we must traverse H bytes of live data, and that causes us to allocate H / S bytes.

Thus total heap size is H + H * P + H / S, which we'd like to be equal to H * G.

G = 1 + P + 1 / S

1 + P = G - 1 / S = (G * S - 1) / S

Which is the same as our code (note that GCthreshold is H * (1 + P) because the delta between threshold and totalbytes is pause size = H*P).

g->GCthreshold = g->totalbytes * (g->gcgoal * g->gcstepmul / 100 - 100) / g->gcstepmul;

So far so good! Our code is correct. Yay.

G can't possibly be 1.25 with Sm at 2

This leads us to the next conclusion: we're setting GC goal to 1.25 but if we are indeed using Sm=2, it's impossible for us to ever reach this goal.

Indeed, (G * S - 1) / S = 0.75. So the equation above heroically sets g->GCthreshold to 0.75 of the current heap size, with the code right after stating:

    // but it might be impossible to satisfy that directly
    if (g->GCthreshold < g->totalbytes)
        g->GCthreshold = g->totalbytes;

Well, duh. In fact, the formula above (which effectively assumes infinitely fast sweep as it runs right after full GC) with S=2 produces the result of 1 (no pause, just able to meet the goal) at G=1.5. So even if sweep is infinitely fast, G=1.25 is not reachable unless marking is actually much faster than 2x.

Experimental evidence that pacing formulas above are accurate

Let's run this contrived program that has a baseline heap cost and continues to generate garbage:

local N = 1000000 
local M = 100000000

local t = {}
for i=1,N do
	t = {t=t} -- each table is 56 byte header + 32 byte single-node allocation = 88 bytes
end

print(gcinfo())

for i=1,M do
	local _ = {}
end

print(gcinfo())

Let's run this with heap goal of 1.25 and mark speed 2, and measure the peak memory consumption relative to stable. Using latest Luau CLI we get:

  • baseline heap: 88*10^6 bytes
  • peak heap: 136*10^6 bytes
  • G = 1.545

G here is higher than the heap goal, but does it match our model? Let's see:

G = (1 + 1 / Sm) / (1 - 1 / Ss - P)

We know Sm is 2. We can assume P is 0 because we're failiing to reach goal as it is. What's our actual sweep speed?

Paged sweeper multiplies the object count it looked at by 4 to compute the sweeping cost. The unit here is "bytes", and mark directly returns object size as cost; since the only object we have here is a table, which has a 56 byte header, we are effectively being asked to sweep A*2 bytes for every A bytes allocated, but instead we sweep A*2/4 56-byte tables, so our actual sweeping speed is 28! (note, this will depend on the average size of a GCO in the heap and as such may fluctuate, but in our benchmark the sizes are flat since we're only dealing with tables).

Plugging this into our formula:

G = (1 + 1 / 2) / (1 - 1 / 28) = 1.555

Close enough. (actually this gets more nuanced because our heap consists of a bunch of non-GCO state that the sweeper doesn't care about...)

Optimistic computation of mark speed

The analysis above tells us that we need to be adjusting S according to G, which we knew.

One other observation that's interesting is that Ss is wicked fast in practice. We said above it's 28, but actually it's even higher than that, because the total heap size (and mark cost) takes non-GCO data into account, whereas sweep doesn't.

It feels like we might as well assume that Ss is infinite. This gives us a simplified heap formula, where S is the mark speed:

G = (1 + 1 / S) / (1 - P)

We now have a choice: we can scale S with G, leaving P as fixed 0, scale P with G, leaving S as fixed 2, or some combination thereof. For example, to reach G=1.5, we have the following options:

  • S=2 P=0
  • S=2.85 P=0.1
  • S=5 P=0.2

I'm going to make the claim that P=0 is always the best outcome. This is because we're fundamentally optimizing for worst case pauses, and increasing P will increase S - this will eliminate GC work completely from some frames, but it will also mean that the GC work during other frames will be more significant.

As such, the correct decision is to fix P at 0 for the purpose of this modeling, and thus we can derive S trivially from G:

S = 1 / (G - 1)

Indeed, to reach 1.25x heap size we need to set S=4, and doing so in the test example above results in the actual goal reached of 1.26.

Note that this formula can lead to S that is smaller than the 2x we use today. For example, our stock settings outside of Roblox are G=2 S=2, which are inconsistent with the formula above; setting G=2x only requires S=1x. However:

  • This formula now assumes infinite sweep speed; this is likely to be a faulty assumption
  • This entire modeling assumed a simple world where mark phase doesn't need to catch up to new allocations because new allocations immediately become garbage. There might be contrived programs that effectively replace the live portion of the heap with newly allocated objects and then make all existing objects dead by detaching them. This technically preserves the live heap size in the average sense, but it does require marking to traverse up to the full heap size, not just the live objects.

Pessimistic computation of mark speed

The analysis above assumed a well-formed application that generates garbage, and derived S = 1 / (G - 1) as the optimal mark speed.

The problem is that for high values of G, this speed may result in certain mutation patterns that extend mark stage indefinitely. The analysis so far assumed that all new allocations are garbage and as such it's sufficient for us to mark the live set that mark stage starts with. However, it is possible to devise applications that aggressively allocate new reachable state ahead of the collector, and thus the mark process would converge very slowly.

Let's model this situation separately. Assume mark stage starts from heap size H, and as before mutator continues to allocate at a steady rate of A bytes/second, but all allocated objects are immediately reachable, for example because the mutator refers to them from reachable but not-yet-unallocated objects.

In such a scenario, mark speed S must be more aggressive. Specifically:

  • At time T, the heap size is H + T * A
  • At time T, the collector marked T * A * S
  • For these to converge, the collector must catch up to the allocation rate, and as such:

H + T * A = T * A * S

We'd like to incorporate heap goal into this calculation. To do that, we're going to claim that we'd not only like to catch up in general, but we'd also like to catch up once heap H grows by up to G from its initial (presumed live) state. This should intuitively make sense - end of mark should result in the peak watermark, so if the collector follows instructions it should try to end every mark cycle at G*H where H is the live set size, and in this experiment we are assuming the entire heap is live.

Therefore:

G * H = H + T * A

T * A = (G - 1) * H

G * H = T * A * S

G * H = (G - 1) * H * S

S = G / (G - 1)

This establishes the upper bound for the collection speed; the previous section established the lower bound 1 / (G - 1), so the upper bound is larger than the lower bound by exactly one.

For example:

  • For G=1.25, the lower bound S is 4, the upper bound S is 5
  • For G=2, the lower bound S is 1, the upper bound S is 2 (this happens to be the default setting!)

Notice that for high factors of G, the lower bound is actually less than one. This makes sense under the assumption that live set doesn't grow (no need to mark faster than you allocate if all allocations are garbage), but with a constantly growing heap this means we never reach mark termination, whereas with the conservative upper bound this problem doesn't exist.

Note that for some reasonable values of G, the extra overhead of pessimism is substantial: for G=1.5, lower bound for S is 2 and upper bound is 3, which is 50% more GC work than lower bound. In applications that obey generational hypothesis the lower bound is likely to be good enough.

Effect of background GC

TODO

Note that all of this doesn't add up as we can't reach the goal of 1.25 experimentally, at least on large heaps with fixed Sm=2. This suggests one of two things:

  • Either G can't actually get to 1.25 in practice
  • Or something is inflating the rate at which we mark dramatically.

Given the GC formula:

G = (1 + 1 / Sm) / (1 - 1 / Ss - P)

For us to reach G=1.25, even with P=0 we must have Sm be at least 4 (with infinitely fast sweep).

How can this be? It is probably due to background GC:

int gcRequest = std::max(1, std::min(DFInt::LuaGcMaxKb, int(globalState->gcAllocAvg.value() * DFInt::LuaGcBoost)));
lua_gc(globalState->state, LUA_GCSTEP, gcRequest);

lua_gc treats this as as simulated allocation of gcRequest KB, and runs the collector as usual. The issue though is that GcBoost is set to 1, so with this code, unless our code here is just completely bogus, seems to indicate that we're simply asking the GC to collect as much data as it normally would have collected under normal allocation rate. Note that because explicit GC acts as credit for the assist GC work, this should not normally change the relationship between S and G - the math in this document so far has assumed we do all the work during assists, but the idea of assist credits is that background GC simply does the work assists would have done anyway.

So now I am confused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment