Skip to content

Instantly share code, notes, and snippets.

@brendangregg
Last active August 29, 2015 14:17
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save brendangregg/899851d03ed4bf543303 to your computer and use it in GitHub Desktop.
Save brendangregg/899851d03ed4bf543303 to your computer and use it in GitHub Desktop.
sample perf problem answers

G'Day, I didn't have an email addr so pasted into gist. Just wanted to explain things a bit better than could on twitter. Good luck!

1. CPU-bound

Depends what you mean by CPU bound: bound by availability or speed.

If the CPUs themselves are hot, then this is easy. "mpstat -P ALL 1" will show the hot CPUs. If single threads are causing it, then "pidstat -t 1" will identify them (although, it could be a large thread pool competing). This approach identifies if something is resource constrained by CPUs, but not bound by their speed.

Imagine a single CPU system running at 10% utilization, with an application processing 1 request per second, which is taking 100ms, all CPU time. The application's performance is CPU (speed) bounded, but the system looks mostly idle. This can be identified using walltime vs CPU time for the request. Most languages have a way to get the CPU time (something getrusage() related). Eg, if you were able to measure that the request took 100 ms, and, 100 ms of CPU time was consumed, then you knew that it was on-CPU the entire time.

2. waiting for slow network responses

This gets tricky (it shouldn't be, we should fix the kernel, but anyway).

Sometimes it might be easy. Blocked synchronously during connect()/send()/receive(). If the app isn't doing a high rate of syscalls, then one could even use strace to analyze it, but I'd generally avoid that approach (until strace is fixed to use a tracing interface instead of ptrace()).

Sometimes you can just take a series of stacks (pstack, jstack) and notice that often threads are waiting on network I/O. That might be enough of a clue.

I like to look at thread blocking events by tracing context switches, since it's generic and matches everything. Off-CPU Analysis. On Linux, that's "perf record -e cs -a -g -- sleep 10", or something similar. That command will give you counts and stacks, but not time spent while blocked.

I wrote a couple of blog posts recently to show how to measure time-while-blocked, for the off-CPU approach:

Linux: http://www.brendangregg.com/blog/2015-02-26/linux-perf-off-cpu-flame-graph.html FreeBSD: http://www.brendangregg.com/blog/2015-03-12/freebsd-offcpu-flame-graphs.html

Both of these have overheads that I think are too high, in general. We'll get better (on Linux, eBPF, on FreeBSD, just need to fix that symbol resolution path).

Those approaches work for simple applications and thread pools. Modern architectures using event worker threads (eg, Node.js, rxNetty) need a different approach, since network I/O is sent asynchronously by I/O threads, which don't block. So you won't see the time in either syscalls or thread blocking. For that, you'll need to go into the application and instrument it there, or into the kernel and instrument it by socket.

3. writing a lot to disk

Many of the same off-CPU blocking approaches in (2) works the same here, which would measure whether the application is suffering or not. A lot of disk writes (depends on open flags) would be asynchronous, and flushed to disk later, so wouldn't hurt application performance directly.

To get a general idea of which application is doing writes, "pidstat -d 1" can be helpful. Plus Linux delay accounting has a type for block I/O, which might work for writes (I'd need to test); I had an example on page 130 of the sysperf book, which used the sample getdelays program from the kernel source.

Brendan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment