brendangregg/gist:899851d03ed4bf543303

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    G'Day, I didn't have an email addr so pasted into gist. Just wanted to explain things a bit better than could on twitter. Good luck!
1. CPU-bound

Depends what you mean by CPU bound: bound by availability or speed.
If the CPUs themselves are hot, then this is easy. "mpstat -P ALL 1" will show the hot CPUs. If single threads are causing it, then "pidstat -t 1" will identify them (although, it could be a large thread pool competing). This approach identifies if something is resource constrained by CPUs, but not bound by their speed.
Imagine a single CPU system running at 10% utilization, with an application processing 1 request per second, which is taking 100ms, all CPU time. The application's performance is CPU (speed) bounded, but the system looks mostly idle. This can be identified using walltime vs CPU time for the request. Most languages have a way to get the CPU time (something getrusage() related). Eg, if you were able to measure that the request took 100 ms, and, 100 ms of CPU time was consumed, then you knew that it was on-CPU the entire time.
2. waiting for slow network responses

This gets tricky (it shouldn't be, we should fix the kernel, but anyway).
Sometimes it might be easy. Blocked synchronously during connect()/send()/receive(). If the app isn't doing a high rate of syscalls, then one could even use strace to analyze it, but I'd generally avoid that approach (until strace is fixed to use a tracing interface instead of ptrace()).
Sometimes you can just take a series of stacks (pstack, jstack) and notice that often threads are waiting on network I/O. That might be enough of a clue.
I like to look at thread blocking events by tracing context switches, since it's generic and matches everything. Off-CPU Analysis. On Linux, that's "perf record -e cs -a -g -- sleep 10", or something similar. That command will give you counts and stacks, but not time spent while blocked.
I wrote a couple of blog posts recently to show how to measure time-while-blocked, for the off-CPU approach:
Linux: http://www.brendangregg.com/blog/2015-02-26/linux-perf-off-cpu-flame-graph.html
FreeBSD: http://www.brendangregg.com/blog/2015-03-12/freebsd-offcpu-flame-graphs.html
Both of these have overheads that I think are too high, in general. We'll get better (on Linux, eBPF, on FreeBSD, just need to fix that symbol resolution path).
Those approaches work for simple applications and thread pools. Modern architectures using event worker threads (eg, Node.js, rxNetty) need a different approach, since network I/O is sent asynchronously by I/O threads, which don't block. So you won't see the time in either syscalls or thread blocking. For that, you'll need to go into the application and instrument it there, or into the kernel and instrument it by socket.
3. writing a lot to disk

Many of the same off-CPU blocking approaches in (2) works the same here, which would measure whether the application is suffering or not. A lot of disk writes (depends on open flags) would be asynchronous, and flushed to disk later, so wouldn't hurt application performance directly.
To get a general idea of which application is doing writes, "pidstat -d 1" can be helpful. Plus Linux delay accounting has a type for block I/O, which might work for writes (I'd need to test); I had an example on page 130 of the sysperf book, which used the sample getdelays program from the kernel source.
Brendan