Skip to content

Instantly share code, notes, and snippets.

@dannas
Created April 23, 2020 16:12
Show Gist options
  • Save dannas/1d4b9f4d0c702985a06858d0f2df1cc7 to your computer and use it in GitHub Desktop.
Save dannas/1d4b9f4d0c702985a06858d0f2df1cc7 to your computer and use it in GitHub Desktop.
Notes for Adhoc event April 23rd

Adhoc: a virtual event to explore system engineering

https://adhoc.community/#

Why mmap is faster than system calls

Q: From MDB post: Some op benefit from mmap at the expense of others: Any more insight here? Q: How do you deal with I/O errors? They cause SIGSEGV, right? Q: Did you have any problems with network drives: May give SIGBUS instead of SIGSEGV Q: Does WiredTiger support Windows? Any problems using MapViewOfFile instead of mmap. I've heard MapViewOfFile puts in a lock? Q: Do you run WiredTiger in 32-bit configurations? Any problems finding continous regions due to address space fragmentation? Q: Have you had any problems with page flushing in the kernel? How can you be sure the data written reaches the disk?

Getting storage engines ready for fast storage devices

  • HDD => SSD improved performance 10x
  • SATA => PCIe improved performance 2x
  • Improvements/Innovations SSD
  • Intel Optane
  • Less wait time means software stack overhead starts to dominate
  • One solutions: specialized file systems like SplitFS, but not battle-tested nor portable
  • Instead: Use mmap and batched syscalls in MongoDB storage engine WiredTiger
  • 63% imporevemtns for 19 out of 65 benchmarks for mainstream SSDs
  • mmap means put N pages of file content into buffer cache and map virtual address to it
  • Access to mmapped pages falls into three scenarios
    1. phy page in buffer cache and page in TLB: no kernel involvment
    2. phy page in buffer cache, but page not in TLB: transition into kernel, walk page table, install it
    3. phy page not in buffer cache: trap into OS, ask fs to fetch the page, setup page table entry and proceed as in 2.
  • Mmap is faster due to less kernel transistions and can use AVX
  • Append to file means metadata writes. Lots of overhead
  • Amortize that cost by pre-allocating file space => batch fs op to reduce their overhead
  • First prototype: fixed mmap size. If shrink/grow: fall back to system calls
  • Resizing problem: two threads where (1) copies and (2) truncates. Then (1) would cause a segfault
  • Solution: use a lock when accessing the file
  • But that would serialize I/O.
  • So Instead separate threads into writers (writes beond EOF, fallocate or truncate) and readers.
  • While writer active: route I/O to system calls which are properly synchronized
  • Do it without locking via solution insprired by RCU
    1. mmap_resizing: writer sets this flag before op
    2. mmap_use_count: reader increments. writer waits until zero
  • write pseudo code is executed before a write op
wait:
	while mmap_resizing
		spin_backoff(...)
	if cas(mmap_resizing, 1, ...)
		goto wait
	while mmap_use_count
		spin_backoff(...)
	do_write(...)
	unmap(file)
	remap(file, file)
	mmap_resizing = 0

  • read (something similar needs to be done by writers too for checking if they can use mmap)
atomic_add(mmap_use_count, 1)
if mmap_resizing
	atomic_decr(mmap_use_count, 1)
	read_syscall(...)
else
	memcpy(dst_buf, mapped_buffer, ...)
	atomic_decr(mmap_use_count, 1)

Why mmap is faster than system calls

  • Copy experiment with read and mmap syscalls: 1. Read to a buffer in userspace. 2. Copy from a mmapped buffer
  • Kernel and userspace uses virtual addresses but are not reachable from each other
  • If kernel used userspace memory a NULL ptr would crash the kernel
  • If userspace used kernel memory => many safety problems
  • mmap means that user space can use virtual addresses that points into kernel buffers
  • Read traps explicitly for each syscall and implicitly for page faults
  • Read needs to copy memory from kernel to userspace
  • Kernel tries to read from buffer cache. If page not there it reads and a page fault is generated
  • Same amount of page faults will be generated for read and mmap.
  • Many more traps for read due to crossing the syscall boundary.
  • Result: mmap is 2-6x faster than read, but syscall overhead wasn't the reason
  • read: 60% of time spent in copy_to_user
  • mmap: 60% of time spent in memmove
  • mmap uses mmemv which is much faster than copy_to_user due to AVX
  • Reason: copy_to_user can't use vector instructions due to overhead of saving extra architectural state

Making safe memory reclamation a feature of your memory allocator

Making Lockless Synchrnonization fast: Performance Implications of Memory Reclamation (Hard, McKenney, and Brown, 2006)

  • Memory reclamation problem: For lock-free dynamic data strucures, when is it safe to delete the object?
  • Blocking strategies: Quiescent State Based Reclamation (QSBR) and Epoch Based Reclamation (EBR) which uses a grace period
  • A grace period is a time interval [t0, t1] such that after t1 all nodes removed before t0 can be reclaimed
  • QSBR uses quiescent states to detect grace periods.
  • A quiescent state is a time period for a thread T where T holds no references to shared nodes
  • OS uses voluntary context switches as quieescent state
  • QSBR can block if one thread does not go into a quiescent state
  • EBR uses global epochs but I don't understand the details of it
  • Non-blocking: Hazard Pointer Based Reclamation (HPBR) and Lock-Free Reference Counting (LFRC)
  • HPBR sets a ptr into some element in the data structure. If that ptr is not set and the node has been removed, it can be reclaimed. I don't understand when the node ptr is cleared.
  • For LFRC threads tracks the number of references to nodes. Nodes are reclaimed when their count is zero.
  • QSBR: Usually fastest, but can suffer in the face of thread preemption
  • Blocking reclamation can be useful with non-blocking algorithms

The slab allocator: An Object-Caching Kernel Memory Allocator (Bonwick, 1994)

  • Often the cost of initializing/destroying an object exceeds the cost of allocating and freeing memory for it
  • Object caching: preserve the invariant portion of an object -it's constructed state - between uses
  • High level design of object cache
    • allocate: if obj in cache: take it else alloc mem and construct
    • free: return to the cache (no destruction)
    • reclaim memory: take some obj from cache. Destroy the objs. Free underlying mem
  • Why object cache in central allocator
    • Knowledge of who needs mem
    • better accounting and debugging
    • smaller code size with 1 obj cache instead of N
  • Interface
struct kmem_cache *kmem_cache_create(char *name, size_t size, int align, 
	void (*constructor)(void*, size_t), void(*destructor(void*, size_t)));
void kmem_cache_destroy(struct kmem_cache *cp);

struct kmem_cache_alloc(struct kmem_cache *cp, int flags);
void kmem_cache_free(struct kmem_cache *cp, void *buf);
  • Example client usage
void foo_constructor(void *buf, int size) {
	struct foo *foo = buf;
	mutex_init(&foo->lock, ...);
	cv_init(&foo->foo_cv, ...);
	foo->foo_refcnt = 0;
	foo->foo_barlist = NULL;
}

void foo_destructor(void *buf, int size) {
	struct foo *foo = buf;
	assert(foo->foo_barlist == NULL);
	assert(foo->foo_refcnt == 0);
	cv_destroy(&foo->foo_cv);
	mutex_destroy(&foo->foo_lock);
}

void usage() {
	foo_cache = kmem_cache_create("foo_cache", sizeof(struct foo), 0, foo_constructor, foo_destructor);
	foo = kmem_cache_alloc(foo_cache, KM_SLEEP);
	use(foo);
	kmem_cache_free(foo_cache, foo);
}
  • The allocation of foo will be fast, since the allocator will usually just fetch an already-constructed foo from the cache.
  • foo_constructor and foo_destructor will only be called to populate and drain the cache
  • positive side-effect: icache footprint since ctor/dtor as so seldomly called
  • The slab impl (backend) consists of kmem_cache_grow and kmem_cache_reap
  • Backend driven by memory pressure
  • Slab allocator consists of many independent caches that have individual locks
  • Each cache maintains its own statistics
  • Standard non-caching alloc routines, kmem_alloc/kmem_free, use object caches internally
  • Create 30 kmem sets at startup ranging from 8 bytes to 9K
  • Slab = one more more pages of virtual contigous mem carved into equal size chunks with refcounts

... To be continued

Magazines and Vmem: Extending the Slab allocator to many cpus and Arbitary Resources (Bonwick, 2001)

subr_smr.c

smr.h

A 15-minute history of task isolation on Linux

Patch series: Task isolation mode

LWN: A full task-isolation mode for the kernel

LWN: Dropping the timer tick - for real this time

SUSE Linux's sheilding linux resources

Jitmap, and execution engine for bitmap expressions

Bitmap Index Design and Evaluation (Chan and Loanddidis, 1998)

The Jamming section of A catalogue of Optimizing transformations (Allen and Cocke, 1971)

jitmap code

Faster Population Counts using AVX2 instructions (Mula, Kurz and Lemire, 2018)

LLVM Tree-Height-Reduction optimization pass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment