dannas/adhoc_event.md

## adhoc_event.md

      
    Raw
  

              adhoc_event.md
            
          
    Adhoc: a virtual event to explore system engineering

https://adhoc.community/#
Why mmap is faster than system calls

Q: From MDB post: Some op benefit from mmap at the expense of others: Any more insight here?
Q: How do you deal with I/O errors? They cause SIGSEGV, right?
Q: Did you have any problems with network drives: May give SIGBUS instead of SIGSEGV
Q: Does WiredTiger support Windows? Any problems using MapViewOfFile instead of mmap. I've heard MapViewOfFile puts in a lock?
Q: Do you run WiredTiger in 32-bit configurations? Any problems finding continous regions due to address space fragmentation?
Q: Have you had any problems with page flushing in the kernel? How can you be sure the data written reaches the disk?
Getting storage engines ready for fast storage devices


HDD => SSD improved performance 10x
SATA => PCIe improved performance 2x
Improvements/Innovations SSD
Intel Optane
Less wait time means software stack overhead starts to dominate
One solutions: specialized file systems like SplitFS, but not battle-tested nor portable
Instead: Use mmap and batched syscalls in MongoDB storage engine WiredTiger
63% imporevemtns for 19 out of 65 benchmarks for mainstream SSDs
mmap means put N pages of file content into buffer cache and map virtual address to it
Access to mmapped pages falls into three scenarios

phy page in buffer cache and page in TLB: no kernel involvment
phy page in buffer cache, but page not in TLB: transition into kernel, walk page table, install it
phy page not in buffer cache: trap into OS, ask fs to fetch the page, setup page table entry and proceed as in 2.


Mmap is faster due to less kernel transistions and can use AVX
Append to file means metadata writes. Lots of overhead
Amortize that cost by pre-allocating file space => batch fs op to reduce their overhead
First prototype: fixed mmap size. If shrink/grow: fall back to system calls
Resizing problem: two threads where (1) copies and (2) truncates. Then (1) would cause a segfault
Solution: use a lock when accessing the file
But that would serialize I/O.
So Instead separate threads into writers (writes beond EOF, fallocate or truncate) and readers.
While writer active: route I/O to system calls which are properly synchronized
Do it without locking via solution insprired by RCU

mmap_resizing: writer sets this flag before op
mmap_use_count: reader increments. writer waits until zero


write pseudo code is executed before a write op

wait:
	while mmap_resizing
		spin_backoff(...)
	if cas(mmap_resizing, 1, ...)
		goto wait
	while mmap_use_count
		spin_backoff(...)
	do_write(...)
	unmap(file)
	remap(file, file)
	mmap_resizing = 0


read (something similar needs to be done by writers too for checking if they can use mmap)

atomic_add(mmap_use_count, 1)
if mmap_resizing
	atomic_decr(mmap_use_count, 1)
	read_syscall(...)
else
	memcpy(dst_buf, mapped_buffer, ...)
	atomic_decr(mmap_use_count, 1)


Why mmap is faster than system calls


Copy experiment with read and mmap syscalls: 1. Read to a buffer in userspace. 2. Copy from a mmapped buffer
Kernel and userspace uses virtual addresses but are not reachable from each other
If kernel used userspace memory a NULL ptr would crash the kernel
If userspace used kernel memory => many safety problems
mmap means that user space can use virtual addresses that points into kernel buffers
Read traps explicitly for each syscall and implicitly for page faults
Read needs to copy memory from kernel to userspace
Kernel tries to read from buffer cache. If page not there it reads and a page fault is generated
Same amount of page faults will be generated for read and mmap.
Many more traps for read due to crossing the syscall boundary.
Result: mmap is 2-6x faster than read, but syscall overhead wasn't the reason
read: 60% of time spent in copy_to_user
mmap: 60% of time spent in memmove
mmap uses mmemv which is much faster than copy_to_user due to AVX
Reason: copy_to_user can't use vector instructions due to overhead of saving extra architectural state

Making safe memory reclamation a feature of your memory allocator

Making Lockless Synchrnonization fast: Performance Implications of Memory Reclamation (Hard, McKenney, and Brown, 2006)


Memory reclamation problem: For lock-free dynamic data strucures, when is it safe to delete the object?
Blocking strategies: Quiescent State Based Reclamation (QSBR) and Epoch Based Reclamation (EBR) which uses a grace period
A grace period is a time interval [t0, t1] such that after t1 all nodes removed before t0 can be reclaimed
QSBR uses quiescent states to detect grace periods.
A quiescent state is  a time period for a thread T where T holds no references to shared nodes
OS uses voluntary context switches as quieescent state
QSBR can block if one thread does not go into a quiescent state
EBR uses global epochs but I don't understand the details of it
Non-blocking: Hazard Pointer Based Reclamation (HPBR) and Lock-Free Reference Counting (LFRC)
HPBR sets a ptr into some element in the data structure. If that ptr is not
set and the node has been removed, it can be reclaimed. I don't understand when the node ptr is cleared.
For LFRC threads tracks the number of references to nodes. Nodes are reclaimed when their count is zero.
QSBR: Usually fastest, but can suffer in the face of thread preemption
Blocking reclamation can be useful with non-blocking algorithms

The slab allocator: An Object-Caching Kernel Memory Allocator (Bonwick, 1994)


Often the cost of initializing/destroying an object exceeds the cost of allocating and freeing memory for it
Object caching: preserve the invariant portion of an object -it's constructed state - between uses
High level design of object cache

allocate: if obj in cache: take it else alloc mem and construct
free: return to the cache (no destruction)
reclaim memory: take some obj from cache. Destroy the objs. Free underlying mem


Why object cache in central allocator

Knowledge of who needs mem
better accounting and debugging
smaller code size with 1 obj cache instead of N


Interface

struct kmem_cache *kmem_cache_create(char *name, size_t size, int align, 
	void (*constructor)(void*, size_t), void(*destructor(void*, size_t)));
void kmem_cache_destroy(struct kmem_cache *cp);

struct kmem_cache_alloc(struct kmem_cache *cp, int flags);
void kmem_cache_free(struct kmem_cache *cp, void *buf);


Example client usage

void foo_constructor(void *buf, int size) {
	struct foo *foo = buf;
	mutex_init(&foo->lock, ...);
	cv_init(&foo->foo_cv, ...);
	foo->foo_refcnt = 0;
	foo->foo_barlist = NULL;
}

void foo_destructor(void *buf, int size) {
	struct foo *foo = buf;
	assert(foo->foo_barlist == NULL);
	assert(foo->foo_refcnt == 0);
	cv_destroy(&foo->foo_cv);
	mutex_destroy(&foo->foo_lock);
}

void usage() {
	foo_cache = kmem_cache_create("foo_cache", sizeof(struct foo), 0, foo_constructor, foo_destructor);
	foo = kmem_cache_alloc(foo_cache, KM_SLEEP);
	use(foo);
	kmem_cache_free(foo_cache, foo);
}


The allocation of foo will be fast, since the allocator will usually just fetch an already-constructed foo from the cache.
foo_constructor and foo_destructor will only be called to populate and drain the cache
positive side-effect: icache footprint since ctor/dtor as so seldomly called
The slab impl (backend) consists of kmem_cache_grow and kmem_cache_reap
Backend driven by memory pressure
Slab allocator consists of many independent caches that have individual locks
Each cache maintains its own statistics
Standard non-caching alloc routines, kmem_alloc/kmem_free, use object caches internally
Create 30 kmem sets at startup ranging from 8 bytes to 9K
Slab = one more more pages of virtual contigous mem carved into equal size chunks with refcounts

... To be continued
Magazines and Vmem: Extending the Slab allocator to many cpus and Arbitary Resources (Bonwick, 2001)

subr_smr.c

smr.h

A 15-minute history of task isolation on Linux

Patch series: Task isolation mode

LWN: A full task-isolation mode for the kernel

LWN: Dropping the timer tick - for real this time

SUSE Linux's sheilding linux resources

Jitmap, and execution engine for bitmap expressions

Bitmap Index Design and Evaluation (Chan and Loanddidis, 1998)

The Jamming section of A catalogue of Optimizing transformations (Allen and Cocke, 1971)

jitmap code

Faster Population Counts using AVX2 instructions (Mula, Kurz and Lemire, 2018)

LLVM Tree-Height-Reduction optimization pass