Q: From MDB post: Some op benefit from mmap at the expense of others: Any more insight here? Q: How do you deal with I/O errors? They cause SIGSEGV, right? Q: Did you have any problems with network drives: May give SIGBUS instead of SIGSEGV Q: Does WiredTiger support Windows? Any problems using MapViewOfFile instead of mmap. I've heard MapViewOfFile puts in a lock? Q: Do you run WiredTiger in 32-bit configurations? Any problems finding continous regions due to address space fragmentation? Q: Have you had any problems with page flushing in the kernel? How can you be sure the data written reaches the disk?
- HDD => SSD improved performance 10x
- SATA => PCIe improved performance 2x
- Improvements/Innovations SSD
- Intel Optane
- Less wait time means software stack overhead starts to dominate
- One solutions: specialized file systems like SplitFS, but not battle-tested nor portable
- Instead: Use mmap and batched syscalls in MongoDB storage engine WiredTiger
- 63% imporevemtns for 19 out of 65 benchmarks for mainstream SSDs
- mmap means put N pages of file content into buffer cache and map virtual address to it
- Access to mmapped pages falls into three scenarios
- phy page in buffer cache and page in TLB: no kernel involvment
- phy page in buffer cache, but page not in TLB: transition into kernel, walk page table, install it
- phy page not in buffer cache: trap into OS, ask fs to fetch the page, setup page table entry and proceed as in 2.
- Mmap is faster due to less kernel transistions and can use AVX
- Append to file means metadata writes. Lots of overhead
- Amortize that cost by pre-allocating file space => batch fs op to reduce their overhead
- First prototype: fixed mmap size. If shrink/grow: fall back to system calls
- Resizing problem: two threads where (1) copies and (2) truncates. Then (1) would cause a segfault
- Solution: use a lock when accessing the file
- But that would serialize I/O.
- So Instead separate threads into writers (writes beond EOF, fallocate or truncate) and readers.
- While writer active: route I/O to system calls which are properly synchronized
- Do it without locking via solution insprired by RCU
- mmap_resizing: writer sets this flag before op
- mmap_use_count: reader increments. writer waits until zero
- write pseudo code is executed before a write op
wait:
while mmap_resizing
spin_backoff(...)
if cas(mmap_resizing, 1, ...)
goto wait
while mmap_use_count
spin_backoff(...)
do_write(...)
unmap(file)
remap(file, file)
mmap_resizing = 0
- read (something similar needs to be done by writers too for checking if they can use mmap)
atomic_add(mmap_use_count, 1)
if mmap_resizing
atomic_decr(mmap_use_count, 1)
read_syscall(...)
else
memcpy(dst_buf, mapped_buffer, ...)
atomic_decr(mmap_use_count, 1)
- Copy experiment with read and mmap syscalls: 1. Read to a buffer in userspace. 2. Copy from a mmapped buffer
- Kernel and userspace uses virtual addresses but are not reachable from each other
- If kernel used userspace memory a NULL ptr would crash the kernel
- If userspace used kernel memory => many safety problems
- mmap means that user space can use virtual addresses that points into kernel buffers
- Read traps explicitly for each syscall and implicitly for page faults
- Read needs to copy memory from kernel to userspace
- Kernel tries to read from buffer cache. If page not there it reads and a page fault is generated
- Same amount of page faults will be generated for read and mmap.
- Many more traps for read due to crossing the syscall boundary.
- Result: mmap is 2-6x faster than read, but syscall overhead wasn't the reason
- read: 60% of time spent in copy_to_user
- mmap: 60% of time spent in memmove
- mmap uses mmemv which is much faster than copy_to_user due to AVX
- Reason: copy_to_user can't use vector instructions due to overhead of saving extra architectural state
Making Lockless Synchrnonization fast: Performance Implications of Memory Reclamation (Hard, McKenney, and Brown, 2006)
- Memory reclamation problem: For lock-free dynamic data strucures, when is it safe to delete the object?
- Blocking strategies: Quiescent State Based Reclamation (QSBR) and Epoch Based Reclamation (EBR) which uses a grace period
- A grace period is a time interval [t0, t1] such that after t1 all nodes removed before t0 can be reclaimed
- QSBR uses quiescent states to detect grace periods.
- A quiescent state is a time period for a thread T where T holds no references to shared nodes
- OS uses voluntary context switches as quieescent state
- QSBR can block if one thread does not go into a quiescent state
- EBR uses global epochs but I don't understand the details of it
- Non-blocking: Hazard Pointer Based Reclamation (HPBR) and Lock-Free Reference Counting (LFRC)
- HPBR sets a ptr into some element in the data structure. If that ptr is not set and the node has been removed, it can be reclaimed. I don't understand when the node ptr is cleared.
- For LFRC threads tracks the number of references to nodes. Nodes are reclaimed when their count is zero.
- QSBR: Usually fastest, but can suffer in the face of thread preemption
- Blocking reclamation can be useful with non-blocking algorithms
- Often the cost of initializing/destroying an object exceeds the cost of allocating and freeing memory for it
- Object caching: preserve the invariant portion of an object -it's constructed state - between uses
- High level design of object cache
- allocate: if obj in cache: take it else alloc mem and construct
- free: return to the cache (no destruction)
- reclaim memory: take some obj from cache. Destroy the objs. Free underlying mem
- Why object cache in central allocator
- Knowledge of who needs mem
- better accounting and debugging
- smaller code size with 1 obj cache instead of N
- Interface
struct kmem_cache *kmem_cache_create(char *name, size_t size, int align,
void (*constructor)(void*, size_t), void(*destructor(void*, size_t)));
void kmem_cache_destroy(struct kmem_cache *cp);
struct kmem_cache_alloc(struct kmem_cache *cp, int flags);
void kmem_cache_free(struct kmem_cache *cp, void *buf);
- Example client usage
void foo_constructor(void *buf, int size) {
struct foo *foo = buf;
mutex_init(&foo->lock, ...);
cv_init(&foo->foo_cv, ...);
foo->foo_refcnt = 0;
foo->foo_barlist = NULL;
}
void foo_destructor(void *buf, int size) {
struct foo *foo = buf;
assert(foo->foo_barlist == NULL);
assert(foo->foo_refcnt == 0);
cv_destroy(&foo->foo_cv);
mutex_destroy(&foo->foo_lock);
}
void usage() {
foo_cache = kmem_cache_create("foo_cache", sizeof(struct foo), 0, foo_constructor, foo_destructor);
foo = kmem_cache_alloc(foo_cache, KM_SLEEP);
use(foo);
kmem_cache_free(foo_cache, foo);
}
- The allocation of foo will be fast, since the allocator will usually just fetch an already-constructed foo from the cache.
- foo_constructor and foo_destructor will only be called to populate and drain the cache
- positive side-effect: icache footprint since ctor/dtor as so seldomly called
- The slab impl (backend) consists of kmem_cache_grow and kmem_cache_reap
- Backend driven by memory pressure
- Slab allocator consists of many independent caches that have individual locks
- Each cache maintains its own statistics
- Standard non-caching alloc routines, kmem_alloc/kmem_free, use object caches internally
- Create 30 kmem sets at startup ranging from 8 bytes to 9K
- Slab = one more more pages of virtual contigous mem carved into equal size chunks with refcounts
... To be continued