CAFxX/golang_minimize_allocations.md

## golang_minimize_allocations.md

      
    Raw
  

              golang_minimize_allocations.md
            
          
    📂 Minimize allocations in Go

A collection of tips for when you need to minimize the number of allocations in your Go programs.
Use the go profiler to identify which parts of your program are responsible for most allocations.
⚠️ Never apply these tricks blindly (i.e. without measuring the actual performance benefit/impact). Most of these tricks cause a tradeoff between reducing memory allocations and other aspects (including e.g. higher peak memory usage, higher CPU usage, lower maintainability, higher probability of introducing subtle bugs). Only apply these tricks if the tradeoff in every specfic case is globally positive.
Protobuf

Working with protobuf in Go puts significant load on the memory subsystem,
as protobuf-generated structures often contain a significant amount of pointers.
One way to minimize the number of allocations is to allocate all the fields at
the same time, and then use internal pointers to wire them up.
Let's pretend you have (the generated Go code for) four messages A, B, C, and D
(note: for clarity this is not really protobuf-generated code, but the technique
genralizes trivially):
type A struct {
	b *B
	c *C
	d *D
}

type B struct {
	int
}

type C struct {
	string
}

type D struct {
	c *C
}
Normally, assuming you need to populate all of them, you would do something like:
a := &A{
  b: &B{42},
  c: &C{"hello"},
  d: &D{
    c: &C{"world"},
  },
}
This is very idiomatic and compact, but has the downside of requiring 5 allocations just for
the message structure.
Storage packing

Can we do better? Ideally we would like to perform a single allocation, instead of having to
allocate every single field. Luckily, yes, we can.
To do this we can build an auxiliary struct that will provide the storage for all the fields
and then wire up the pointers appropriately. To allocate an appropriate storage structure, we
can do e.g.
s := &struct {
	A
	_b   B
	_c   C
	_d   D
	_d_c C
}{}
Then, to wire up all the internal pointers (in order to maintain compatibility with callers
that are unaware of the change, like the protobuf ecosystem) we can do:
// none of these allocate
a := &s.A
a.b = &s._b
a.c = &s._c
a.d = &s._d
a.d.c = &s._d_c
At this point we can initialize the values as required:
*a.b = B{42}
*a.c = C{"hello"}
*a.d.c = C{"world"}
Note that a and *a are, repsectively,  an *A and A indistinguishable
from any other *A and A allocated in the regular way.
This playground link shows the complete example, and includes a test that shows
that a single allocations is being performed: https://play.golang.org/p/R8DBVe7jaeX.
Note that all the scaffolding for allocating the full structure has been moved to
a separate function getA that returns a plain *A.
There are also some possible variations on the generic approach above, that can help
in specific cases, e.g.:

only pack a subset of fields and manually allocate the others when needed
reorder the fields for better locality
reserve storage in each field instead of in the top level (e.g. create a second
packing structure to replace D, containing both *C and the storage for
C, and include this structure in the top level storage structure).

For the record, I also suggested to implement this (as an option) in gogoproto:
gogo/protobuf#555.
Upsides


Better memory locality (as all fields are contiguous to each other)
Better cache utilization (this depends on your access patterns, it is possible that in certain cases cache utilization will be worse)
Lower number of memory allocations, and therefore lower CPU usage
The mechanism is generic, and can be used not just in case of protobuf messages

Downsides

Are there any downsides to this approach, or it can be safely used in all cases? Glad you asked!
There are some downsides (suggestions welcome!):

You need to decide at compile-time which fields to allocate; if then at runtime you don't need
all of them, you will be wasting the memory for the fields you are not using.
Related to the previous point: because all fields are allocated in a single object, as long as
any field is alive all the other fields can not be garbage collected.
This approach only reduces the number of allocations, not the amount of memory allocated.

None of this is a deal-breaker, but depending on your workload and use cases it may cause memory
consumption to increase.
Closures

If you create closures on a hot path and the closure captures more than one variable from its parent environment, each captured variable escapes to the heap and therefore will need to be heap-allocated with a dedicated allocation. This means that your closure with N captures will normally require N+1 allocations.
func foo(a *bar, b int) func() {
	// This requires 3 allocations: 1 for the closure itself, and 2 for the
	// captured variables a and b.
	go func() {
		doSomething(a, b)
	}()
}
You can cut this down to 2 allocations by grouping the captured variables in a temporary struct, and capturing the struct instead:
func foo(a *bar, b int) func() {	
	// This requires 2 allocations: 1 for the closure itself, and 1 for the
	// captured variable x (regardless of how many fields the structure contains).
	x := struct{a *bar; b int}{a, b}
	go func() {
		doSomething(x.a, x.b)
	}()
}
Slices of pointers

You can use something like the following generic function to reduce the number of allocations of a []*T from N+1 down to 2:
// NewPtrSlice allocates a slice of n *T, each pointing to
// a newly-allocated T.
//
// It performs at most 2 allocations, regardless of n.
func NewPtrSlice[T any](n int) []*T {
  // could also be just
  // return GrowPtrSlice[T](nil, n)
  s := make([]*T, n)
  p := make([]T, n)
  for i := range s {
    s[i] = &p[i]
  }
  return s
}

// GrowPtrSlice appends n elements to the slice, each 
// pointing to a newly-allocated T.
// The resulting slice has length equal to len(s)+n.
//
// It performs at most 2 allocations, regardless of n.
func GrowPtrSlice[T any](s []*T, n int) []*T {
  s = slices.Grow(s, n)
  p := make([]T, n)
  for i := 0; i < n; i++ {
    s = append(s, &p[i])
  }
}
The important thing to consider is that as long as any pointer to any element of the p slice remains alive, the whole p slice will remain alive (and therefore so will all of the elements of p). Use this approach only when you know that the lifetime of all elements of the slice is approximately the same.
Preallocate slices

Use slices.Grow (or manually preallocate slices using make([]T, ..., N)) whenever you can cheaply predict how many entries the slice will contain. This prevents repeated slice growth (allocation+copy).
Use the lower-bound capacity allocation idiom (s := append([]T(nil), make([]T, N)...)) whenever your application logic does not depend on the exact slice capacity for correctness.
Preallocate maps

Provide a size hint when allocating maps (with make(map[...]..., N)) whenever you can cheaply predict how many entries the map will contain. This prevents repeated map growth (allocation+copy).
Shrink slices

Re-slicing an existing slice does not allow the garbage collection to collect the part of the slice that has been left out, nor any object pointed to by any of the slice elements that have been left out. Consider the following:
s := []*Foo{
	&Foo{ID: 1},
	&Foo{ID: 2},
}
// s contains two pointers, each pointing to an instance of Foo.

s = s[0:1]
// s is now equivalent to []*Foo{ &Foo{ID: 1} }, but it still referring
// to the same backing array
Intuitively one may expect the second Foo object (ID:2) to be garbage collectible at this point, but this is not the case. As long as the backing array of s is not collected, and as long as the second pointer is not somehow overwritten, the second Foo object will remain alive.
The simplest and most general way to ensure that this does not happen is to use somthing like slices.Clone to obtain a new shallow copy of the slice, with both minimal capacity and no old values being kept alive.
This is especially important if you need to keep alive a slice for an extended period of time even in case no re-slicing was performed (this is because during slice construction the slice may have been overallocated, e.g. by append). Cloning the slice as suggested above will ensure that the slice uses as little memory as possible.
Shrink maps

Maps currently are never shrunk by the runtime when entries are removed, so their memory usage is tied to the peak number of entries that are added.
If you have a long-lived map whose size fluctuates over time, it can be a good idea to periodically use maps.Clone to create a shallow copy of the original map, replacing it, and allowing the original map (and its peak memory usage) to be garbage collected. This also applies if you have a map created wihtout a precise hint that never has keys added/removed once it has been initialized: in this case using maps.Clone immediately after initialization has completed will almost certainly yield a map that consumes less memory than the original one.
Design APIs that minimize allocations


When your function returns a slice, it should normally also accept a slice as argument that, if provided, will be appended to instead of allocating a new slice.
Similar point applies in general whenever you return a newly-allocated object *T. In this case your function should also accept a *T as an argument that, if not nil, is used instead of allocating a new T.

Use smaller types

If you can use smaller types, you can reduce the amount of memory allocated per allocation, e.g.

int8 instead of int
byte or rune instead of a string that is always one character long
a N-bytes array (not slice!) [N]byte containing the binary representation of numbers, UUIDs, IPs instead of string representing the same value in human-readable form

Small strings (up to 16 bytes on 64-bit platforms) are more compact represented as an N-byte array (not slice!) [N]byte than as string.
Reorder fields in structs

Space may be wasted in structs in some cases (most often, and simplifying, because smaller fields are placed before larger fields). You can use static analyzers such as structslop to detect this. Note though that reordering fields may cause other performance issues (e.g. in case the reordering makes fields that are frequently set/cleared together not contiguous, or in case they are made to straddle a cache line boundary).