pankkor/linux_perf_tools.md

## linux_perf_tools.md

      
    Raw
  

              linux_perf_tools.md
            
          
    Linux Perf Tools

Materials used to write this gist.


Linux Performance Analysis by Brendan Gregg
perf tutorial
perf examples

TL;DR

uptime
dmesg | tail
vmstat 1
mpstat -P ALL 1
pidstat 1
iostat -xz 1
free -m
sar -n DEV 1
sar -n TCP,ETCP 1
top
perf

Tools

uptime for hight level load average.

Number of processes wanting to run. Includes processes blocked in uninterruptible IO.
dmesg system messages/errors.

Look for errors that can cause performance problems. "Out of memory", "TCP: ... dropping request", etc.
dmesg | grep oom-killer

vmstat for virtual memory statistics.


r: Number of processes running or waiting. Doesn't include IO. IF r > num_cpu THEN saturation.
b: Number of processes blocked by IO.
free: Free memory in kb
buffers: buffer cache, used for block device I/O.
cached: page cache, used by file systems.
si, so: Swap-ins and swap-outs. IF si,so != 0 THEN out_of_memory.
us, sy, id, wa, st: CPU time average

us: user time.
sy: system time. IF sy > 20% THEN kernel processing IO inefficeitnly.
id: idle.
wa: IO wait.
st: stolten time, time spent by hypervisor for other VMs.


mpstat -P ALL 1 fur CPU time breakdown per CPU.

Look out for signle hot CPU
pidstat 1 rolling process summary.

Like top but prints a rolling summary instead of clearing the screen.
iostat -xz 1 disk usage analysis.


r/s: delivered reads per second
rkB/s: kB read per second
w/s: delivered writes per second
wkB/s: kB write per second
await: the average time for the IO in milliseconds. Includes both time queued and time being serviced.
IF high THEN device_saturation | device_problems
avgqu-sz: the average number of requests issued to the device.
IF avgqu-sz > 1 THEN could_be saturation. Still Multilple back-end disk devices can operate on requests in parallel.
%util: device utilization (busy %).
IF %util > 60% THEN poor_performance (double check with await).
IF %util ~= 100% THEN saturation

free -m -s 1 (-m display in MB, -h - display human readable).


buffers: buffer cache, used for block device I/O.
cached: page cache, used by file systems.
buff/cache: sum of buffers and cached.
available: used for caches but could be quickly made available for the application.
IF buffers or cached ~= 0 THEN higher disk IO

sar -n DEV 1 for network interface throughput.


rxpck/s: number of packets received per second.
txpck/s: number of packets transmitted per second.
rxkB/s: number of kilobytes received per second.
txkB/s: number of kilobytes transmitted per second.
%ifutil: utilization  percentage of the network interface. Could be unreliable.

sar -n TCP,ETCP 1 for summarized view of some key TCP metrics.


active/s: number of locally-initiated TCP connections per second (e.g., via connect()). (~ outbound)
passive/s: number of remotely-initiated TCP connections per second (e.g., via accept()). (~ inbound)
retrans/s: number of TCP retransmits per second. Sign of network or server issue.

perf sampling profiler.

You'll need a call graph:

--call-graph lbr - aka Last Branch Record utilizes special hardware registers to store some limited call graph of last branching instruction (you can expect aroudn ~32 entries). Very fast, but requires modern hardware >Haswell >ARMv9.2-A.
--call-graph fp - use frame pointer to determine call graph, use if your binary is built with frame pointer (-fno-omit-frame-pointer)
--call-graph dwarf - saves 8k of call stack to be analyzed later together with debug info. Produces large perf.data records, which are extremely slow to perf report. Practically unuseful with high sampling rate, therefore limit sampling rate to 99 Hz with -F99.

Example of comamnds:
Attach to running process to sample it for 10 seconds with 1000 Hz sample rate and LBR call-graph. Creates perf.data record.
perf record -p <pid> --call-graph lbr -F1000 -- sleep 10

Sample all the system for 10 secods with dward debug info, limiting samling rate to 99 Hz
perf record -a --call-graph dwarf -F99 -- sleep 10

If run on remote system, pack all necessary information to be analyzed later on a host system. This will create a .tar archive. Copy it together with perf.data to the host system.
perf archive

On the host system unpack the .tar archive. This will extract .tar archive to ~/.debug
perf archive --unpack

You can later generate a report with perf.data from remote system
perf report

perf-archive for Ubuntu

perf-archive is missing from all the Ubuntu perf packages. Get one from Linux source:
mkdir /usr/libexec/perf-core/
wget -O /usr/libexec/perf-core/perf-archive https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/tools/perf/perf-archive.sh
chmod +x /usr/libexec/perf-core/perf-archive