Skip to content

Instantly share code, notes, and snippets.

Created February 16, 2023 19:38
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
Star You must be signed in to star a gist
What would you like to do?

Systems Performance 2nd edition

See synthesized write-up here

  • Do a quick performance check in 60 seconds

  • Use a number of different tools available in unix

  • Use flamegraphs of the callstack if you have access to them

  • Best performance winds are elimiating unnecessary wrok, for example a thread stack in a loop, eliminating bad config

  • Mantras: Don't do it (elimiate); do it again (caching); do it less (polling), do it when they're not looking, do it concurrently, do it more cheaply

  • Latency is an essential performance metric - the time for an operation to complete

  • Operation request
  • Database query
  • File system operation
  • We can improve latency by decreasing disk reads, aka caching

Actionable Chain of Events

Counter --> Statistics --> Metrics --> Alerts

Profiling tools allow us to take simple measures of CPUs, including flamegraphs, which show us CPU footprint.

The x-axis shows the stack profile population, sorted alphabetically (it is not the passage of time), and the y-axis shows stack depth, counting from zero at the bottom. Each rectangle represents a stack frame. The wider a frame is is, the more often it was present in the stacks. The top edge shows what is on-CPU, and beneath it is its ancestry. Original flame graphs use random colors to help visually differentiate adjacent frames. Variations include inverting the y-axis (an "icicle graph"), changing the hue to indicate code type, and using a color spectrum to convey an additional dimension.

Tracing - Event-based recording where data is saved for later analysis.

Linux 60-second checklist

Also here: if you only have a bit of time to profile your system.

In 60 seconds you can get a high level idea of system resource usage and running processes by running the following ten commands. Look for errors and saturation metrics, as they are both easy to interpret, and then resource utilization. Saturation is where a resource has more load than it can handle, and can be exposed either as the length of a request queue, or time spent waiting

Don't only use top because you don't know other tools, creates a streetlight effect.

dmesg | tail
vmstat 1
mpstat -P ALL 1
pidstat 1
iostat -xz 1
free -m
sar -n DEV 1
sar -n TCP,ETCP 1

High-level terminology

  • IOPS - input/output per second, data trasnfer
  • Latency - measure of time of operations spent waiting
  • Saturation - Degree which a resource has been queued
  • Hit ratio: number of times needed data is found in cache versus total access (hits+ misses)

Performance tradeoffs:

Good -- Fast -- Cheap ; high-performance -- Ontime -- inexpensive

File system size: small records perform better for I/O; larger record sizes will improve streaming workloads

Types of caches

  • Performance tuning is most effective when done closest to the work performed

  • **MRU **- most recently used

  • LRU - least recently used

  • MFU - most frequently used

  • LFU - least recently used

Cold cache - empty, populated with unwanted data. Hit ratio is zero as it begins to warm up. Warm cache - populated with useful data but doesn't have a large enough hit ratio

Cold --> Warm --> Hot
Ratio improving

Cache tuning: Aim to cache as high in the stack as possible, closer to where the work is, performed directly reduces the operational overload of cache hits.

p. 61: performance Mantras

State the goals of the study and define system boundaries
List system services and possible outcomes
Select performance metrics
List system and workload parameters
Select factors and their values
Select the workload
Design the experiments
Analyze and interpret the data
Present the results
If necessary, start over

Disk Utilization (p. 65)

Disk utilization can become a problem even before it hits 100%. To find the bottleneck:

  1. Measure rate of server requests, monitor this rate over tme
  2. Measure hardware and software resource usage
  3. Express server requests in terms of resource used
  4. Extrapolate severer requests for each resource


**Hardware: **

  • CPU Utilization
  • Memory Usage
  • Disk IOPS
  • Disk Throughput
  • Disk Capacity

**Software: **

  • Virtual memory usage
  • Proess/tasks
  • File descriptions

Sharding - a common strategy for databases where data split into logical components, each managed by its own database

p. 106 - CPU versus IO bound:

  • CPU: Performing heavy compute like science and math
  • IO-bound: performing io like web servers and file servers, low latency is important
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment