Brendan Gregg brendangregg

## dockerpsns.sh
#!/bin/bash
#
# dockerpsns - proof of concept for a "docker ps --namespaces".
#
# USAGE: ./dockerpsns.sh
#
# This lists containers, their init PIDs, and namespace IDs. If container
# namespaces equal the host namespace, they are colored red (this can be
# disabled by setting color=0 below).
#

## chaintest.py
#!/usr/bin/python
#
# chaintest   Summarize off-CPU time by kernel stack + 2 waker stacks
#             WORK IN PROGRESS. For Linux, uses BCC, eBPF.
#
# USAGE: chaintest [-h] [-u] [-p PID] [-i INTERVAL] [-T] [duration]
#
# PLEASE DO NOT RUN THIS IN PRODUCTION! This is a work in progress, intended to
# explore chain graphs on Linux, using eBPF capabilities from a particular
# kernel version (4.3ish). This tool will eventually get much better.

## gist:eebe3455fd8e528bb14d193a93d43b59
Using my perf-tools just to wrap ftrace:

# ./perf-tools/bin/kprobe 'p:tcp_v4_connect skc_dport=+2(%si):u16'
Tracing kprobe tcp_v4_connect. Ctrl-C to end.
          telnet-9723  [001] d... 62326244.175951: tcp_v4_connect: (tcp_v4_connect+0x0/0x480) skc_dport=1600
          telnet-9725  [001] d... 62326246.502760: tcp_v4_connect: (tcp_v4_connect+0x0/0x480) skc_dport=1700
          telnet-9726  [001] d... 62326247.861937: tcp_v4_connect: (tcp_v4_connect+0x0/0x480) skc_dport=100
          telnet-9727  [001] d... 62326249.220740: tcp_v4_connect: (tcp_v4_connect+0x0/0x480) skc_dport=e803

Now a crappy ntohs() to process the dport string:

## sysbench-analysis.txt
Some rough notes from an analysis of sysbench cpu on x86 vs ARM, which showed it was 2.6x faster on ARM only
because of a faster div instruction, which did not translate to a production win. The benchmark was misleading.
I also talked about this topic in my IntelON 2021 talk.

m6g.4xl
=======
test bgregg-focal-arm-v000 us-east-1 i-0d6f5a0062ee66c4e
(root) ~ # sysbench --max-requests=10000000 --max-time=60 --test=cpu --cpu-max-prime=100000 run
WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options.
WARNING: --max-requests is deprecated, use --events instead

## decade.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              3 stars
            
          
                brendangregg
                / decade.md
            
            
              Last active
              April 15, 2022 04:08
            
              
                decade quick benchmarks
              
          
    There are some quick benchmarks for the "Decade of Wasted Cores" patches on Linux 4.1.
I had to add "extern int sched_max_numa_distance;" to arch/x86/kernel/smpboot.c for Linux 4.1 to compile.
Brief analysis during the benchmarks using time(1) and mpstat(1) to check runtimes, usr/sys time, and per-CPU balance; iostat(1) to check for disk bottlenecks.
Summary: no significant difference seen in these tests.
c3.8xlarge (32 CPU) PV 1-node NUMA

The patch shouldn't make a difference to this 1-node system, but I felt it worth checking, especially since most of our systems are 1-node.

  
## gist:f8ed5345cfc903599a60

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              4 stars
            
          
                brendangregg
                / gist:f8ed5345cfc903599a60
            
            
              Created
              August 5, 2014 01:08
            
              
                dynamic tracing of ZFS on Linux, on Linux
              
          
    So I just found ZFS on my test Linux ubuntu system, and gave my perf-tools (https://github.com/brendangregg/perf-tools) a spin.
Per-second zfs* calls:
# ./funccount -Ti 1 'zfs*'
Tracing "zfs*"... Ctrl-C to end.

Tue Aug  5 00:51:41 UTC 2014
FUNC                              COUNT


## fsmicrobench.md

      
              1 file
            
          
              2 forks
            
          
              1 comment
            
          
              14 stars
            
          
                brendangregg
                / fsmicrobench.md
            
            
              Last active
              February 16, 2022 08:25
            
              
                some FS micro-benchmarks
              
          
    F1. FS 128k streaming writes

Benchmark: fio write
Command: fio --name=seqwrite --rw=write --bs=128k --size=4g --end_fsync=1 --loops=4 # aggrb tput
Rationale: Measure the performance of a single threaded streaming write of a reasonably large file. The aim is to measure how well the file system and platform can sustain a write workload, which will depend on how well it can group and dispatch writes. It's difficult to benchmark buffered file system writes in both a short duration and in a repeatable way, as performance greatly depends on if and when the pagecache begins to flush dirty data. As a workaround, an fsync() at the end of the benchmark is called to ensure that flushing will always occur, and the benchmark also repeats four times. While this provides a much more reliable measurement, it is somewhat worst-case (applications don't always fsync), providing closer to a minimum rate – rather than a maximum rate – that you should expect.
F2. FS cached 4k random reads


## usdt
#!/bin/bash
#
# usdt - trace user statically defined tracepoints. User-level dynamic tracing.
#        Written using Linux ftrace. Experimental.
#
# WARNING: This is a proof of concept for USDT tracing from Linux ftrace, and
# is not safe to use in production environments. In particular, the -i option
# sets memory semaphores by piping the output of printf through dd and then
# to process memory via /proc/PID/mem. Yes, this program pipes the output of
# the shell directly over top of live process memory. If you don't understand

## cpuunclaimed.py
#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# cpuunclaimed   Sample CPU run queues and calculate unclaimed idle CPU.
#                For Linux, uses BCC, eBPF.
#
# This samples the length of the run queues and determine when there are idle
# CPUs, yet queued threads waiting their turn. Report the amount of idle
# (yet unclaimed by waiting threads) CPU as a system-wide percentage.
#

## biosnoop.py
#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# biosnoop  Trace block device I/O and print details including issuing PID.
#       For Linux, uses BCC, eBPF.
#
# This uses in-kernel eBPF maps to cache process details (PID and comm) by I/O
# request, as well as a starting timestamp for calculating I/O latency.
#
# Copyright (c) 2015 Brendan Gregg.
	#!/bin/bash
	#
	# dockerpsns - proof of concept for a "docker ps --namespaces".
	#
	# USAGE: ./dockerpsns.sh
	#
	# This lists containers, their init PIDs, and namespace IDs. If container
	# namespaces equal the host namespace, they are colored red (this can be
	# disabled by setting color=0 below).
	#
	#!/usr/bin/python
	#
	# chaintest Summarize off-CPU time by kernel stack + 2 waker stacks
	# WORK IN PROGRESS. For Linux, uses BCC, eBPF.
	#
	# USAGE: chaintest [-h] [-u] [-p PID] [-i INTERVAL] [-T] [duration]
	#
	# PLEASE DO NOT RUN THIS IN PRODUCTION! This is a work in progress, intended to
	# explore chain graphs on Linux, using eBPF capabilities from a particular
	# kernel version (4.3ish). This tool will eventually get much better.
	Using my perf-tools just to wrap ftrace:

	# ./perf-tools/bin/kprobe 'p:tcp_v4_connect skc_dport=+2(%si):u16'
	Tracing kprobe tcp_v4_connect. Ctrl-C to end.
	telnet-9723 [001] d... 62326244.175951: tcp_v4_connect: (tcp_v4_connect+0x0/0x480) skc_dport=1600
	telnet-9725 [001] d... 62326246.502760: tcp_v4_connect: (tcp_v4_connect+0x0/0x480) skc_dport=1700
	telnet-9726 [001] d... 62326247.861937: tcp_v4_connect: (tcp_v4_connect+0x0/0x480) skc_dport=100
	telnet-9727 [001] d... 62326249.220740: tcp_v4_connect: (tcp_v4_connect+0x0/0x480) skc_dport=e803

	Now a crappy ntohs() to process the dport string:
	Some rough notes from an analysis of sysbench cpu on x86 vs ARM, which showed it was 2.6x faster on ARM only
	because of a faster div instruction, which did not translate to a production win. The benchmark was misleading.
	I also talked about this topic in my IntelON 2021 talk.

	m6g.4xl
	=======
	test bgregg-focal-arm-v000 us-east-1 i-0d6f5a0062ee66c4e
	(root) ~ # sysbench --max-requests=10000000 --max-time=60 --test=cpu --cpu-max-prime=100000 run
	WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options.
	WARNING: --max-requests is deprecated, use --events instead
	#!/bin/bash
	#
	# usdt - trace user statically defined tracepoints. User-level dynamic tracing.
	# Written using Linux ftrace. Experimental.
	#
	# WARNING: This is a proof of concept for USDT tracing from Linux ftrace, and
	# is not safe to use in production environments. In particular, the -i option
	# sets memory semaphores by piping the output of printf through dd and then
	# to process memory via /proc/PID/mem. Yes, this program pipes the output of
	# the shell directly over top of live process memory. If you don't understand
	#!/usr/bin/python
	# @lint-avoid-python-3-compatibility-imports
	#
	# cpuunclaimed Sample CPU run queues and calculate unclaimed idle CPU.
	# For Linux, uses BCC, eBPF.
	#
	# This samples the length of the run queues and determine when there are idle
	# CPUs, yet queued threads waiting their turn. Report the amount of idle
	# (yet unclaimed by waiting threads) CPU as a system-wide percentage.
	#
	#!/usr/bin/python
	# @lint-avoid-python-3-compatibility-imports
	#
	# biosnoop Trace block device I/O and print details including issuing PID.
	# For Linux, uses BCC, eBPF.
	#
	# This uses in-kernel eBPF maps to cache process details (PID and comm) by I/O
	# request, as well as a starting timestamp for calculating I/O latency.
	#
	# Copyright (c) 2015 Brendan Gregg.