hashbrowncipher/how-flamegraphs.md

## how-flamegraphs.md

      
    Raw
  

              how-flamegraphs.md
            
          
    But actually, how do flamegraphs work?

Q: So when I make a flamegraph, how does that work?

The specific type of flamegraph you're seeing is an on-cpu flamegraph. It is
synthesized by periodically stopping the CPU, asking "what are you doing right
now", and storing the answer. The flamegraph itself is a convenient
representation of the data produced by doing that very frequently (see the -F
argument to perf record) over some period of time.
Q: How does perf stop the CPU?

perf calls perf_event_open(2). On amd64, the syscall
configures the processor's performance monitoring unit (PMU) to send an NMI
after some number of clock cycles elapse. In the CPU, there are three
different ways to configure this (see "Table 20-2. Association of
Fixed-Function Performance Counters with Architectural Performance Events" in
the Volume 3B of the Intel SDM):

instructions retired (INST_RETIRED.ANY)
actual clock cycles (CPU_CLK_UNHALTED.CORE)
"clock cycles" as measured against a fixed reference clock (CPU_CLK_UNHALTED.REF_TSC)

The fixed reference clock runs at the rate of the CPU timestamp counter.
Figuring out that rate involves reading a lot of model-specific registers, but
I just had the turbostat command do it for me (sudo turbostat --num_iterations 1 --interval 1 2>&1 | sed -ne '/^TSC:/p;/^TSC:/q').
The actual configuration is just a wrmsr instruction on
the appropriate register. It becomes quite obvious that perf is being used,
because the NMI and PMI ("Performance monitoring interrupt") fields of
/proc/interrupts start incrementing like crazy.
The kernel installs a handler to process the NMI produced by the PMU. Each NMI it
receives becomes a stack sample.
Q: What happens while the CPU is stopped?

The kernel copies and persists various data into a ring buffer of memory pages
that are readable by the profiler (often perf). Since we're using
flamegraphs, our profiler must have passed PERF_SAMPLE_STACK_USER to
perf_event_open. This induces the kernel to persist entire stacks into the ring buffer. In this way, the kernel can collect many
stack+register samples from the profiled process, without having to context
switch into the profiler itself (as one would with a ptrace-based
profiler).
Q: What does the profiler do?

The profiler usually sleeps on the file descriptor created by perf_event_open
with one of the poll/select/epoll syscalls. When it wakes up, it copies data
out of the memory pages populated by the kernel. If it fails to process data
fast enough, segments of the ring buffer will get overwritten and events will
be dropped.
To make a flamegraph, the profiler must unwind each stack, determining the name
of the function being executed. This job is the same as any other debugger
(e.g. gdb) does, except perf is restricted to reading only data present in the
stack: it cannot refer to other data structures in the program's memory. This
makes it less meaningful to profile the state of coroutine or event-loop based
programs, because much of their state resides on-heap.
The profiler (or some combination of helper programs) then passes a list of
annotated stacks to a flamegraph generator program, which
outputs an SVG.
Q: Is it fast?

It seems like the fastest form of profiling available. The profiled process
continues running as normal; it just experiences a higher-than-normal quantity
of CPU interrupts, just as-if the system were receiving network traffic. The actual data collection doesn't require:

sending/receiving any signals (e.g. SIGPROF)
waking up a separate process (as with ptrace)
performing any syscalls

The profiler never blocks the profiled process, can run on an entirely different CPU, and only needs to wake up occasionally to process or persist the collected data.