Skip to content

Instantly share code, notes, and snippets.

@KirillLykov
Last active September 18, 2023 15:02
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save KirillLykov/c8154c6d7d0db91ea7c1393b08a25b1c to your computer and use it in GitHub Desktop.
Save KirillLykov/c8154c6d7d0db91ea7c1393b08a25b1c to your computer and use it in GitHub Desktop.
Systems performance book by B. Gregg

Summary for Systems Performance book by Brendan Gregg

Book it sel on o'reilly

OS-related things

  • Kernel -- software that manages the system by providing access to hardware and other resources (memory, network stack, scheduling cpu, etc). Runs in priveleged CPU mode -- kernel mode. It uses kernel-space stack.

  • Process -- an OS abstraction for running programs. Runs in user mode with access to kernel mode via system calls. It consists of: memory address space, file descriptors, thread stacks, registers. Process uses user-space stacks. For example, syscalls use a s kernel exception stack associated with each thread.

  • System call -- a protocol for user programs to request kernel to perform privileged operations

  • Trap -- a signal sent to kernel to request a system routine (system calls, processor exceptions, interrups)

  • Hardware interrupt -- a signal sent by devise to kernel, it is a type of trap

  • Socket -- endpoints by which user-level aplications access the network

  • Network interface -- physical device to connect to network

  • IPIs -- inter-processor interrupt which is a way to coordinate processors in multiprocessor system

  • SystemD -- service manager on Linux. Among features -- dependecy-aware service startup, analytics (systemd-analyze)

  • KPTI -- kernel page table isolation patches to mitigate "meltdown" vulnerability. The cost is 0.1% - 6% depending on number of syscalls (context switch is more expensive)

CPU

  • Functional unit CPU component does the following: fetch instruction, decode instruction, execute and (optionaly) memory access, register write-back.

  • user-time/kernel-time ratio is high for computationally intensive applications and low for I/O bound (web server, etc).

  • MMU -- memory managment unit. Performs virtual memory management. Memory is paged and the chache for pages translation is TLB (translation lookaside buffer). If there is tlb miss, MMU goes to Page tables (main memory).

Performance tools

Some tips and tricks

  • futex -- fast user space mutex which is using atomic integer under the hood. Basic operations are WAIT(addr, val) (if *addr == val => current thread sleeps and WAKE(addr, num) (wakes up num threads waiting for the address addr.

  • toplev -- tool to find bottleneck in software https://github.com/andikleen/pmu-tools/wiki/toplev-manual

@KirillLykov
Copy link
Author

https://www.scylladb.com/2017/07/06/scyllas-approach-improve-performance-cpu-bound-workloads/

Interesting article about icache load and optimizations. Sounds like futures and postponed grouped computations might be a thing to consider

@KirillLykov
Copy link
Author

KirillLykov commented Oct 11, 2021

article about why CPU utilization reported by top is actually non-idle time. It means that it includes "busy" and "waiting for memory/etc" time.

The only way to understand your CPU utilization is to look on IPC (using perf or others).
Modern processors have 4-5 IPCs.
tiptop command is like top but shows IPC.

If IPC < 1.0 -- software is memory stalled
if IPC > 1.0 instruction bound

The problem with IPC is that if there is a spinlock -- it will show good results but there is no progress.

https://www.brendangregg.com/blog/2017-05-09/cpu-utilization-is-wrong.html

@KirillLykov
Copy link
Author

Performance reading list

@KirillLykov
Copy link
Author

KirillLykov commented Oct 18, 2021

Plan for the week before interview

CPU architecture (book by Denis)

Pipelining stages

  1. Instruction fetch
  2. Instruction decode
  3. Execute
  4. Memory access
  5. Write back

Basic machine cycle or clock - time to move an instruction from slowest stage to another.

Pipelining hazards (lead to stalls)

  • structural -- resource conflicts
  • data hazards -- dependencies on data. Read-after-write (overcome by bypassing), write-after-read (reg renaming)
  • control hazard -- branching etc

Pipelining enables instruction-level parallelism.

Superscalar CPU -- can issue more than one instruction in cycle. Typical issue-width 2-6.
Branch misprediction penalty, Reorder buffer (ROB) -- structure on CPU to store results of branch prediction speculations

CPU memory hierarchy has 2 fundamental properties:

  • Temporal locality -- recently accessed memory likely to be access again soon
  • Spatial locality -- related data is placed next to each other
  • set-associative cache -- cache lines are grouped in sets (2-8 lines). A given address is first mapped to a set and than it can be placed anywhere within this set.

Memory (RAM) characteristics:

  1. latency
    1.1 memory access time
    1.2 memory cycle time -- min time between two accesses
  2. bandwidth
  3. capacity

page table -- tables for mapping between virtual memory and physical one
virtual address = page number + page offset.
page fault -- if requested page is not in the main memory. It is handled by OS by preparing this page or by denying if it is restricted.

How virtual address is organised
Input:

  • L1d size 8K=2^13
  • Cache line size 64b=2^6
  • 8-fold associativity (cache lines are grouped by 8 in sets)
    Output:
  • Offset -- address of value within the cache line. We need to address 64b => 6bits size
  • Index -- 2^13/(8 * 2^6) = 2^4
  • Tag -- stores 38 MSB of phys address and some additional bits (dirty, valid, etc)

CPU

  1. Front-end: fetch and decode instructions
  • fetch 16bytes per cycle from iL1 (shared among threads)
  • pre-fetch -> instruction queue -split between 2 threads-> decoder (up to 5 instructions) -> fixed-length UOPs
  • Decoded stream buffer (DSB) -- UOP cache for macro-Ops to UOPs map.
  • Complex operations cannot be handled by decoder and they are tackled by Microcode sequencer
  • Instruction decode queue (IDQ) -- interface between front end and back end. Hash 128 UOPs.
  1. Back-end: exec instructions and stores result
  • ReOrder buffer (ROB) -- handles data deps. Maps architecture visible registers to phys regs., does reordering etc
  • Reservation station/scheduler (RS) -- structure that tracks the availability of resources for given UOP and later runs it. Dispatches 8 UOPs per cycle. Units which do execution are grouped by Ports
  • Performance monitoring unit (PMU)
  • not that many PMCs so it is not enough for all the performance events, run several times

Terminology and metrics

  • retired instructions -- number of commited instructions (doesn't take into account mispredictions): perf stat -e instructions <exec>
  • CPU utilization
perf stat -- a.exe
0.634874 task-clock (msec) # 0.773 CPUs utilized
  • IPC:
perf stat -e cycles,instructions -- <exec>
2369632 cycles
1725916 instructions
=> IPC=1725916/2369632=0.73

OS (signals, terminology, how to implement malloc)

@KirillLykov
Copy link
Author

KirillLykov commented Oct 21, 2021

ASM, control flow, etc

Watch all of these https://www.youtube.com/channel/UCmf3tLU4WzOnriEQXa638Bw

Procedure control flow

https://www.youtube.com/watch?v=XbZQ-EonR_I&t=354s

call label // push return address to stack, jump to label
ret // pop return address from stack, jump to address

instructions: conditional if, while, for... and unconditional (break, continue)

How conditional codes work?
There are the following code which store results for conditional operations:

  • CF cary flag
  • ZF zero flag
  • SF sign flag
  • OF overflow flag

These are set implicitly after add and other operations for example
And they are used as operands to some jump instructions:

  • jmp lavel -- doesn't use codes
  • je label -- uses ZF -- jumps if some variable was set to 0 before
  • jl lable -- if less (uses SF^OF)

32bits for registers:
%esp -- stack pointer (data) (for register %rsp)
%eip -- instruction pointer
%eax -- used for returned value of the function
So e is for 32 and r is for 64.

some asm notes

movl dst, src (for x86-64)

operands types: (1) immediate (0x400), register (%eax), memory ( -4(%eax))
Cannot do mem-mem in one operation -- need to use intermediate register
Memory addressing modes:

  • indirect: movl (%ecx), %eax
  • displacement: movl 8(%ebp), %edx

Suffix l for mov means 4bytes, for 8bytes use q
For function call args 16 registers are used, all the other args are on stack

Different address manipulations. Assume that edx = 0xf000, ecx=0x100

  • 0x8(%edx) = 0xf000 + 0x8
  • (%edx, %exc) = 0xf000 + 0x100
  • (%edx,%ecx, 4) = 0xf000 + 4 * 0x100

leal dst, src -- instruction which calculates the effective address (by doing this affine arithmetics) without accessing the memory

TODO https://web.stanford.edu/class/archive/cs/cs107/cs107.1222/assign1/ -- assignments on stanford

@KirillLykov
Copy link
Author

KirillLykov commented Oct 29, 2021

лекции чувака по потокам https://www.youtube.com/watch?v=36LMjAFbFfg&list=PL4_hYwCyhAvb7P8guwSTaaUS8EcOaWjxF&index=2

  • если два потока и они не взаимодействуют между собой (каждый обрабатывает свой сокет например) -- какие shared variables они имеют? аллокация памяти -- она в юзер спейсе
  • race condition -- когда поведение программы зависит от непредсказуемого порядка исполнения
  • data race -- когда несколько потоков читают пишут в память которая не покрыта локами
  • mutual exclusion -- mutex. между вызовами lock/unlock может находиться только один поток
  • clang поддерживает аннотацию переменных GUARDED_BY(mutex name) и методов REQUIRES(mutex):
Mutex mu;
int bla GUARDED_BY(mu);

void withdraw(int amount) REQUIRES(mu) {}
  • свойства мутекса:
  1. liveness -- mutual exclusion
  2. safety -- no mutual block
  • deadlock -- нельзя выйти при любом поведении планировщика, livelock -- можно выйти если планировщик верно распланирует
  • starvation -- вам долго не везет в плане планировщика
  • atomics operations: store(write), fetch_add(read, modify, write), exchange(read, write), load(read)
  • contention -- ситуация когда много потоков борятся за один ресурс. busy waiting -- использует ресурсы но ничего не делаем. можно решить на уровне CPU или OS. На уровне CPU PAUSE инструкция а на уровне OS -- sched yield -- отдаем сигнал планировщику что мы уступаем ядро
  • futex -- очередь потоков в ядре ОС привязанный к ячейке памяти. FutexWait(addr, value) -- если *addr == value припарковать потом в очереди , FutexWake(addr, count) -- разбудить count потоков в очереди связанной с ячейкой по аддресу addr. В С++20 atomic::wait/wake

@KirillLykov
Copy link
Author

some note from модели памяти II

  • program order -- partial order of data accesses in one thread (intrustions orders)
  • sync with -- partial order of r/w in different threads (atomics for instace can give this )
    => happens-before -- transitive clouse of these two
    data race -- two data access which are not ordered by relation happens-before

So happens-before is out intuition about how the program should execute (sequentially).
For example if there are two threads

T1 -- x = 5 -- myatomic.store(tre)
T2 ---------- if (myatomic == true) -- print x ---

than x is ordered by happens before

@KirillLykov
Copy link
Author

KirillLykov commented May 24, 2022

Reader/Writer problem and it's solution

https://gclub.com/when-to-use-reader-writer

As shown above, the writer uses the atomic_store_explicit function with memory_order_release to set the guard variable. This ensures that the write to data completes before the write to guard variable completes while allowing for later memory operations to hoist above the barrier.

The reader uses the atomic_load_explicit function with memory_order_acquire to load the guard variable. This ensures that the load of the data starts only after the load of the guard variable completes while allowing for the earlier operations to sink below the barrier.

atomic_store_explicit and atomic_load_explicit functions ensure that the operations are atomic and enforce the required compiler barriers implicitly.

@KirillLykov
Copy link
Author

KirillLykov commented Sep 11, 2023

Concurrency models

Why: to reduce incidental complexity of the multi-threading code
How: by introducing handful abstractions

This article introduces 4 models:

  1. Parallel workers (use mutex)
  2. Assembly line (channels or actors)
  3. Functional parallelism (stateless)

Another introduces 4

  1. Synchronization primitives for sharing memory (e.g., sync.Mutex)
  2. Synchronization via communicating (e.g., channels)
  3. Immutable data
  4. Data protected by confinement

here they just claim two:

  1. shared state (mutex)
  2. no shared state (channels)

This talk

  1. cas (atomics and so on -- lock-free data structures)
  2. agents (like channels)
  3. STM (MVCC) persitent data structures and functional programming
  4. locks

@KirillLykov
Copy link
Author

KirillLykov commented Sep 11, 2023

Disruptor pattern

https://www.infoq.com/presentations/LMAX/
Fowlers description https://martinfowler.com/articles/lmax.html

  • important to model the problem

Basically, the idea is to avoid locks, use only sometimes CAS but mostly memory barrier. They achieve this by using their ring buffer

ABA problem in CAS https://youtu.be/_qaKkHuHYE0?t=1623
In this talk they used lock-free ring buffer but lost to mutex :'-)

@KirillLykov
Copy link
Author

KirillLykov commented Sep 15, 2023

Alternative way of proceeding mgeier/rtrb#39

but as of the moment crossbeam-queue is as efficient as his SPSC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment