Skip to content

Instantly share code, notes, and snippets.

@MaskRay
Last active November 28, 2023 15:06
Show Gist options
  • Save MaskRay/ac54b26d72452ac77ac578f2e625369f to your computer and use it in GitHub Desktop.
Save MaskRay/ac54b26d72452ac77ac578f2e625369f to your computer and use it in GitHub Desktop.
musl mallocng

Introduced in musl v1.2.1. Design goal: "Strong hardening against memory usage errors by the caller, including detection of overflows, double-free, and use-after-free, and does not admit corruption of allocator state via these errors."

context

// Singleton: ctx
struct malloc_context {
  uint64_t secret;
#ifndef PAGESIZE
  size_t pagesize;
#endif
  // When alloc_meta is invoked for the first time, initialize secret and pagesize.
  int init_done;
  unsigned mmap_counter;
  struct meta *free_meta_head;
  // [avail_meta,avail_meta_count) are available meta objs.
  struct meta *avail_meta;
  // When a new page is allocates, (4096-sizeof(meta_area))/sizeof(meta) meta objs are newly available.
  size_t avail_meta_count, avail_meta_area_count, meta_alloc_shift;
  struct meta_area *meta_area_head, *meta_area_tail;
  // [avail_meta_areas,avail_meta_areas+avail_meta_area_count) are available meta areas.
  unsigned char *avail_meta_areas;
  // Doubly linked meta list by size class,
  struct meta *active[48];
  size_t usage_by_class[48];
  uint8_t unmap_seq[32], bounces[32];
  uint8_t seq;
  // [initial brk(0)+pagesize, brk) stores available meta areas.
  uintptr_t brk;
};

meta area

A meta area is a container of meta objects. By default, musl allocates a guard page at initial brk(0) (usually the aligned _end) and allocates meta area pages after the guard page. Each meta area is of 4096 bytes. Meta areas are in a singly-linked list headed by ctx.meta_area_tail.

struct meta_area {
  // ctx.secret
  uint64_t check;
  struct meta_area *next;
  // (4096-sizeof(struct meta_area))/sizeof(struct meta);
  int nslots;
  struct meta slots[];
};

meta and group

A group contains a header (in-band metadata) and 1~32 slots for allocation. Each group is paired with a meta object (out-of-band metadata) which references back to the group.

The slot size (stride) is decided by the size class. The number of slots is decided by the size class and its usage.

For a size request larger than or equal to MMAP_THRESHOLD (131052=2**17-IB-UNIT), the group has one single slot.

When a group is allocated, last_idx+1 stores the number of slots. The available slots are partitioned into two bitmasks, avail_mask (active never-allocated slots) and freed_mask (freed slots and inactive never-allocated slots). Depending on the size class, they are initialized differently.

  • If size*cnt+UNIT <= pagesize/2 (a small size class), mem->active_idx = last_idx; avail_mask = (2u<<last_idx)-1; freed_mask = 0; The group is nested in a larger group and is not freeable. chunk::idx>>5 is 6 and attemtping to free the chunk will assert.
  • If size*cnt+UNIT > pagesize/2 (a large size class), mem->active_idx = min(last_idx, max((4096-UNIT)/size-1,0)); avail_mask = (2u<<mem->active_idx)-1; freed_mask = (2u<<last_idx)-1 - avail_mask; active_idx is chosen so that the first few slots making up a 4096-byte area are active. The inactive slots are stored in freed_mask.
struct meta {
  struct meta *prev, *next;
  struct group *mem;
  // Two bitmasks representing not-in-use slots. They have no intersection.
  // Initially their union is (2u<<last_idx)-1.
  volatile int avail_mask, freed_mask;
  // The index of the last slot, i.e. the number of slots minus 1.
  uintptr_t last_idx:5;
  uintptr_t freeable:1;
  uintptr_t sizeclass:6;
  uintptr_t maplen:8*sizeof(uintptr_t)-12;
};

struct group {
  struct meta *meta;
  // Slot 0~active_idx are active.
  unsigned char active_idx:5;
  char pad[UNIT - sizeof(struct meta *) - 1];
  // N slots. A slot starts at offset (1+stride*slot_index)*UNIT.
  unsigned char storage[];
};

chunk

Each chunk is user data preceded by a 4-byte header ($IB = 4$). For a requested size $n$, the chunk size is $n+4$. The slot size needs to be at least as large as $n+4$.

The canonical placement starts the user data at the slot start (offset: 0) and makes its header overlap the last 4 bytes of the previous slot. (The last 4 bytes of the current slot is reserved by the next chunk). This placement can be used for an unaligned allocation.

If the slot is of size $n+UNIT*slack$, the chunk has $slack+1$ candidate in-slot offsets. The selected in-slot offset is stored at slot[-2]. slot[-3] is 7<<5 to sabotage an invalid free attempt. The offset to storage ((p-g->mem->storage)/UNIT) is stored in offset (p[-2]) of the chunk header. The code is more involved but its idea is to replace a slow modular operation (off%(slack+1)) with an approximation ((off%pow2ceil(slack+1)%(slack+1)).

  // cycle offset within slot to increase interval to address
  // reuse, facilitate trapping double-free.
  int off = (p[-3] ? *(uint16_t *)(p-2) + 1 : ctr) & 255;
  assert(!p[-4]);
  if (off > slack) {
    size_t m = slack;
    m |= m>>1; m |= m>>2; m |= m>>4;
    off &= m;
    if (off > slack) off -= slack+1;
    assert(off <= slack);
  }

For debugging aid and hardening, musl cycles the $slack+1$ offsets when the same slot gets reused. The idea is that by handling a different resource identifier, we can tell whether the double-free is due to an old pointer or a new pointer.

For aligned_alloc, musl raises the alignment to be at least UNIT and increases the requested size to n+align-UNIT. The range guarantees an offset satisfying the alignment requirement. While malloc ensures (p-g->mem->storage)/UNIT is 16-bit, due to rounding up p to the alignment, the new offset may be larger. There are two cases:

  • offset <= 0xffff: p[-2] (chunk::idx) stores the 16-bit offset.
  • offset > 0xffff: p[-2] = 0 and p[-8] stores the 32-bit offset.

A chunk may have some trailing spare bytes, named reserved. If reserved >= 5, the last 5 bytes store a zero canary byte and 32-bit reserved. get_nominal_size (called by malloc and free) checks the zero canary byte and *(uint32_t*)(end-4) to detect out-of-bounds writes.

struct chunk {
  uint8_t zero;
  // slot_index | (min(reserved,5)<<5)
  // If this belongs to a group nested in a larger group (not freeable), slot_index | 6<<5.
  // If p is at the slot start, slot_index | 7<<5.
  uint8_t idx;
  uint16_t offset; // (p-g->mem->storage) / UNIT
  // If reserved bytes >= 5, end[-5] is 0 and *(uint32_t*)(end-4) stores reserved.
  char p[stride-IB];
};

struct chunk32 {
  uint32_t offset; // (p-g->mem->storage) / UNIT
  uint8_t non_zero;
  uint8_t idx; // slot_index | (min(reserved,5)<<5)
  uint16_t zero;
  // If reserved bytes >= 5, end[-5] is 0 and *(uint32_t*)(end-4) stores reserved.
  char p[stride-IB];
};

free

To free a chunk, extract offset from the chunk header, next compute the group header, then get the associated meta and verify integrity.

If the group has no freed slot or freeing this slot will make all slots either unused or free, take a writer lock and perfom a nontrivial free; otherwise, atomically bitwise OR meta::free_mask.

Counters

size_t usage_by_class[48]

malloc has a coarse size class heuristic to prefer odd size classes. To be constructed.

uint8_t seq

A sequence number for correcting bounces and unmap_seq. Incremented for a heavy event (mmap/munmap).

uint8_t bounces[32]

For size classes 24 and up, the program may get into a situation where the same group is mapped and unmapped over and over again. The bounce counter (incremented when mmap is needed) cuts off perpetual mmap/munmap cycling by keeping fully-free groups live if we expect them to be reused in the near future.

A bounce counter can be decayed: when allocating a slot, if m->avail_mask is 0. This is to periodically probe whether retention of fully-free groups is still needed.

uint8_t unmap_seq[32]

Used to correct bounces. Count bounces that occur with few intervening heavy events.

Reset to 1 when ctx.seq reaches 255.

record_seq: reset to ctx.seq when a group of the size class is freed.

Characteristics

  • malloc fast path (avail_mask!=0): reader lock+atomic cas
  • malloc slow path: writer lock
  • free fast path (freed_mask!=0 && (avail_mask|freed_mask)!=all): atomic cas
  • free slow path: writer lock

Currently, reader-writer lock (upgradelock) is not implemented.

Debugging aid and hardening:

  • malloc after free: first-fit effects, similar to glibc (LIFO in tcache/fastbin; FIFO in unsorted bin). However, randomness can be trivially implemented by changing the current least significant bit heuristic. Another way is to add headroom and reuse the non-zero-offset enframing code.
  • Main metadata is stored separately. Out-of-bounds write forges the header of the next slot, but it can hardly do harm.
  • double free/invalid free: guaranteed crash due to rigorous verification. In glibc, double free of non-top chunk in fastbin cannot be detected (double free or corruption (fasttop), other bins have more checks).
  • For a group allocated inside another group's slot, its group header is accessible. musl uses an invalid reserved (6<<5) to sabotage free attempts.
  • No malloc_set_zero_contents/malloc_set_pattern_fill_contents (scudo). Data leak from freed memory is possible.
  • No __malloc_hook or __free_hook (exploit targets in glibc).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment