Skip to content

Instantly share code, notes, and snippets.

@stsquad
Created September 18, 2018 20:56
Show Gist options
  • Save stsquad/41794ab268013b2eb0d304df14744eaf to your computer and use it in GitHub Desktop.
Save stsquad/41794ab268013b2eb0d304df14744eaf to your computer and use it in GitHub Desktop.

My other machine is virtual

Introduction

Who Am I

  • Alex Bennée
  • Senior Virtualisation Engineer @ Linaro
  • QEMU’s TCG (emulation)/KVM
  • IRC: stsquad/ajb-linaro

Notes on code

  • using Python
  • <<noweb inclusions>>
  • source for slides on request

Why use VMs for experiments?

Measurement

generated/usb-logic-analyser.jpg

Control

external/neo-examine-bullet.png

Analysis

generated/analysis.jpg

Example Tools

What about QEMU?

“QEMU is a generic and open source machine emulator and virtualizer.” - from https://www.qemu.org

generated/processor_logos.png

Linux User Emulation

      +------------------------------+
      | cBLU                         |
      | +--------------------------+ |
      | | cYEL                     | |
      | |     Guest Linux App      | |
      | |    (ARM/Linux binary)    | |
      | |                          | |
      | +--------------------------+ |
      |                              |
      | +--------------------------+ |
      | | cPNK                     | |
      | |       JIT Code (x86)     | |
      | |                          | |
      | +--------------------------+ |
      |                              |
      | +--------------------------+ |
      | | cGRE                     | |
      | |         QEMU             | |
      | |                          | |
      | +--------------------------+ |
      +------------------------------+
                     ^
--=------------------|-------------------=----
                     V
      +------------------------------+
      | cBLU                         |
      |    Host Linux Kernel (x86)   |
      |                              |
      +------------------------------+

Full System Emulation

       +------------------------------+
       | cBLU                         |
       | +--------------------------+ |
       | | cRED                     | |
       | |      Guest Memory        | |
       | |          (ARM)           | |
       | | +---------+  +---------+ | |
       | | |cYEL     |  |cYEL     | | |
       | | |   App   |  |   App   | | |
       | | |         |  |         | | |
       | | +---------+  +---------+ | |
       | |------=-------------------| |
       | | +----------------------+ | |
(User) | | |cBLU                  | | |
       | | |     Guest Kernel     | | |
       | | |                      | | |
       | | +----------------------+ | |
       | +--------------------------+ |
       |                              |
       | +--------------------------+ |
       | | cPNK                     | |
       | |       JIT Code (x86)     | |
       | |                          | |
       | +--------------------------+ |
       |                              |
       | +--------------------------+ |
       | | cGRE    QEMU             | |
       | +--------------------------+ |
       +------------------------------+
 ----------------------=-----------------------
       +------------------------------+
       | cBLU                         |
 (Kern)|    Host Linux Kernel (x86)   |
       |                              |
       +------------------------------+

Simple Tracing with logs

:header-args+: :dir ~/lsrc/qemu.git :header-args+: :var QEMU=”./aarch64-linux-user/qemu-aarch64” :header-args+: :var BINARY=”./aarch64-linux-user/tests/sha1” :header-args+: :eval no-export :header-args+: :noweb no-export

Very Simple Trace

:header-args+: :var trace=”trace.out”

Running

echo "BINARY=${BINARY}"
${QEMU} -d cpu -D $trace ${BINARY}

Gives

BINARY=./aarch64-linux-user/tests/sha1
SHA1=15dd99a1991e0b3826fede3deffc1feba42278e6

Trace File

head -n 12 $trace
 PC=00000000004003f0 X00=0000000000000000 X01=0000000000000000
X02=0000000000000000 X03=0000000000000000 X04=0000000000000000
X05=0000000000000000 X06=0000000000000000 X07=0000000000000000
X08=0000000000000000 X09=0000000000000000 X10=0000000000000000
X11=0000000000000000 X12=0000000000000000 X13=0000000000000000
X14=0000000000000000 X15=0000000000000000 X16=0000000000000000
X17=0000000000000000 X18=0000000000000000 X19=0000000000000000
X20=0000000000000000 X21=0000000000000000 X22=0000000000000000
X23=0000000000000000 X24=0000000000000000 X25=0000000000000000
X26=0000000000000000 X27=0000000000000000 X28=0000000000000000
X29=0000000000000000 X30=0000000000000000  SP=00000040008515a0
PSTATE=40000000 -Z-- S EL0t

Data Points

Running:

echo $(du -h $trace)
echo $(ag -i "PC=" $trace | wc -l) PCs

Gives

65K trace.out
799 PCs

Data Analysis

:header-args+: :var trace=”trace.out” :header-args+: :noweb no-export :header-args+: :var heatmap=generated-filepath(file=”graphs/default-execution-heatmap.png”)

Load Trace Data

import numpy as np
from re import compile

pcs = []
extract_re = compile("\s+PC=([0-9A-Fa-f]{4,16})")
with open(trace, "r") as data:
    for line in data:
        match = extract_re.match(line)
        if match and match.group(1):
            pc = int(match.group(1), 16)
            pcs.append(pc)

array = np.array(pcs)
print ("Collected %d samples from %s" % (len(pcs), trace))

Result

Collected 799 samples from trace.out

Examine Data

<<load-trace-data>>

base = np.amin(array, axis=0)
top = np.amax(array, axis=0)

base_page = base >> 12
top_page = top >> 12

print("Active region is %d pages from %#x to %#x" %
      (top_page - base_page, base, top))

Result

Collected 799 samples from trace.out
Active region is 97 pages from 0x4001d8 to 0x461f50

Examine Frequency

<<examine-data>>

counts = np.bincount(array)
mf = counts.argmax()
print("Most frequent PC = %#x (%d hits)" % (mf, counts[mf]))

Result

Collected 799 samples from trace.out
Active region is 97 pages from 0x4001d8 to 0x461f50
Most frequent PC = 0x41b600 (12 hits)

Setup MatplotLib

import matplotlib as mpl
mpl.use('Agg')
import matplotlib.pyplot as plt

Prepare Data

<<examine-frequency>>

from math import sqrt

# trim low addresses and pad array to page size
active = counts[base_page<<12:]
extra = 4096 - (len(active) % 4096)
padded = np.pad(active, (0, extra),
                'constant', constant_values=(0))

# reduce the map (4 insns/bucket)
bytes_bucket = 16
reduced = np.add.reduce(padded.reshape((-1, bytes_bucket)), 1)

# format as a 2d grid
side = int(sqrt(len(reduced)))
result = reduced[0:side ** 2].reshape((-1, side))

print("Reduced to %d x %d grid (%d bytes/bucket)"
      % (side, len(result), bytes_bucket))

Plot

<<setup-matplotlib>>
<<prepare-data>>

ax = plt.subplot(111)
im = plt.imshow(result, cmap="hot")

<<tweak-labels>>

# add "temp" bar to right
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.05)
plt.colorbar(im, cax=cax)

plt.savefig(heatmap)

Execution Heatmap

generated/graphs/default-execution-heatmap.png

TCG Run Loop

:header-args+: :dir ~/org/presentations
           Run Loop
     qemu_tcg_cpu_thread_fn()
+--------------+
|              |
|              V
|    +--------------------+
|    | Find Next Block    |
|    |cGRE                |
|    |   +----------------+
|    |   | Generate Block |
|    |   |cRED            |             Add new code to buffer
|    |   |                +--------------------------------------=---+
|    |   |                |                                          |
|    |   +----------------+                                          |
|    |                    |                                          |
|    |                    |                                          |
|    +--------------------+         +---------------------------+    |
|    | Run Block          +-------->|cBLU                       |    |
|    |cGRE                |         |  Translated Code Buffer   |    |
|    |                    |         |                           |    |
|    |                    |         |                           |    |
|    |                    |<--------+                           |    |
|    +---------+----------+         +------------+--------------+    |
|              |                    :                           :    |
|              |                    :                           :<---+
+--------------+                    :                           :
                                    :                           :
                                    :                           :

TCG Operation

:header-args+: :dir ~/org/presentations
                 Guest Instructions                    TCG Micro Ops                     Host Instructions
                        (ARM)                                                                  (x86)
            +--------------------------+    +------------------------------+     +----------------------------+
            | cYEL                     |    | cGRE                         |     |cPNK                        |
Guest PC -->| ldr q0, [x21, x0]        | +->| mov_i64 tmp2,x21             |  +->| movl     -0x14(%r14), %ebp |
            | ldr q1, [x20, x0]        | |  | mov_i64 tmp3,x0              |  |  | testl    %ebp, %ebp        |
            | fmul v0.4s, v0.4s, v1.4s | |  | add_i64 tmp2,tmp2,tmp3       |  |  | jl       0x5555559b87af    |
            | str q0, [x19, x0]        | |  | movi_i64 tmp7,$0x8           |  |  | movq     0xe8(%r14), %rbp  |
            | add x0, x0, #0x10 (16)   | |  | add_i64 tmp6,tmp2,tmp7       |  |  | movq     0x40(%r14), %rbx  |
            | cmp x0, #0x400000        | |  | qemu_ld_i64 tmp4,tmp2,leq,0  |  |  | addq     %rbx, %rbp        |
            | b.ne loop                | |  | qemu_ld_i64 tmp5,tmp6,leq,0  |  |  | leaq     8(%rbp), %r12     |
            |                          | |  | st_i64 tmp4,env,$0x898       |  |  | movq     (%rbp), %rbp      |
            +------------+-------------+ |  | st_i64 tmp5,env,$0x8a0       |  |  | movq     (%r12), %r12      |
                         |               |  | ...                          |  |  | movq     %rbp, 0x898(%r14) |
                         |               |  | ...                          |  |  | movq     %r12, 0x8a0(%r14) |
                         |               |  |                              |  |  | movq     0xe0(%r14), %rbp  |
                         |               |  +----+-------------------------+  |  | addq     %rbx, %rbp        |
                         |               |       |                            |  | leaq     8(%rbp), %rbx     |
                         v               |       v                            |  | movq     (%rbp), %rbp      |
                  +------+-------+       |  +------------+    +------------+  |  | movq     (%rbx), %rbx      |
                  |{io}      cGRE|       |  |{io}    cGRE|    |{io}    cGRE|  |  | movq     %rbp, 0x8a8(%r14) |
                  |              |       |  |            |    |            |  |  | ...                        |
                  |  TCG Disas   +-------+  |  TCG Opt   +--->|  TCG Gen   +--+  | ...                        |
                  |              |          |            |    |            |     |                            |
                  |              |          |            |    |            |     |                            |
                  +--------------+          +------------+    +------------+     +----------------------------+

Block Chaining

:header-args+: :dir ~/org/presentations

generated/diagrams/block-chaining.svg

Disable Chaining

:header-args+: :var trace=”trace-nochain.out”
${QEMU} -d cpu,nochain -D $trace ${BINARY}

Result

36M trace-nochain.out
859943 PCs

Analyse No Chain

Collected 859943 samples from trace-nochain.out
Active region is 97 pages from 0x4001d8 to 0x461f50
Most frequent PC = 0x417e40 (64003 hits)

No Chain Heatmap

:header-args+: :var heatmap=generated-filepath(file=”graphs/nochain-execution-heatmap.png”)

generated/graphs/nochain-execution-heatmap.png

Generate (log) heatmap

:header-args+: :var heatmap=generated-filepath(file=”graphs/nochain-execution-heatmap-log.png”)
<<setup-matplotlib>>
<<prepare-data>>

from matplotlib.colors import LogNorm
log = LogNorm(vmin=1, vmax=np.max(result))

ax = plt.subplot(111)
ax.set_facecolor("black")

im = plt.imshow(result, cmap="hot", norm=log)

<<tweak-labels>>

# add "temp" bar to right with 2n ticks
from matplotlib.ticker import LogLocator, LogFormatterSciNotation
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.05)
locator = LogLocator(base=2)
formatter = LogFormatterSciNotation(base=2)
plt.colorbar(im, cax=cax, ticks=locator, format=formatter)

plt.savefig(heatmap)

No Chain Heatmap (log)

generated/graphs/nochain-execution-heatmap-log.png

Single Step

:header-args+: :var trace=”trace-singlestep.out”
${QEMU} -singlestep -d cpu,nochain -D $trace ${BINARY}

-ENOSPC

dfilter

:header-args+: :var trace=”trace-nochain-dfilter.out”
${QEMU} -d cpu,nochain -D $trace -dfilter 0x400580..0x402fdc ${BINARY}

Result

Collected 589537 samples from trace-nochain-dfilter.out
Active region is 2 pages from 0x400580 to 0x402fcc
Most frequent PC = 0x400580 (64001 hits)

dfilter Heat Map (log)

:header-args+: :var heatmap=generated-filepath(file=”graphs/dfilter-execution-heatmap.png”)

generated/graphs/dfilter-execution-heatmap.png

Summary of Log Options

-d cpu-d cpu,nochain-singlestep -d cpu,nochain
Timequickslowvery slow
Sizesmallmediumhuuuuge
Covrtransexec (tb)insn exec

Using QEMU Trace Facility

:header-args+: :dir ~/lsrc/qemu.git :header-args+: :var QEMU=”./aarch64-linux-user/qemu-aarch64” :header-args+: :var BINARY=”./aarch64-linux-user/tests/sha1” :header-args+: :eval no-export :header-args+: :noweb no-export

Tracepoint API

exec_tb(void *tb, uintptr_t pc) "tb:%p pc=0x%"PRIxPTR
trace_exec_tb(tb, tb->pc);

Trace Output

  • Plain text (textual log file)
  • Simple (binary log file)
  • dtrace/ftrace
  • ust (LTTNG)
  • syslog
./configure --enable-trace-backends=simple,ftrace,log

Why Trace Files?

  • compact files
  • faster output
  • lots of interesting trace points
  • TCG integration

Using Log Trace

:header-args+: :var trace=”trace-event-nochain.out”
${QEMU} -d nochain,trace:exec_tb -D $trace ${BINARY}

Gives

8.4M	trace-event-nochain.out

Using Binary Trace

:header-args+: :python python3 :header-args+: :var trace=”trace-memory-events.out” :header-args+: :var BINARY=”./aarch64-linux-user/tests/fcvt”
${QEMU} -trace enable=guest_mem_before_exec,file=$trace ${BINARY}

Examine Binary Trace

./scripts/simpletrace.py trace-events-all $trace | head -n 3
guest_mem_before_exec 0.000 pid=29877 __cpu=0xptr vaddr=0x4000851590 info=0x3
guest_mem_before_exec 3.981 pid=29877 __cpu=0xptr vaddr=0x400420 info=0x3
guest_mem_before_exec 0.798 pid=29877 __cpu=0xptr vaddr=0x400428 info=0x3

Define Python Analyser API

import sys
sys.path.append("scripts/") # in QEMU_SRC
import simpletrace

class MemAnalyser(simpletrace.Analyzer):

    def __init__(self):
        self.stores = []
        self.scounts = { 0:0, 1:0, 2:0, 3:0 }
        self.loads = []
        self.lcounts = { 0:0, 1:0, 2:0, 3:0 }

    def guest_mem_before_exec(self, ts, cpu, vaddr, info):
        store = info & (1 << 5)
        size = info & 0x7
        if store:
            self.stores.append((ts, vaddr, size))
            self.scounts[size] += 1
        else:
            self.loads.append((ts, vaddr, size))
            self.lcounts[size] += 1

Analyse Memory Patterns

<<define-memtrace-analyser>>

# Gather data from the trace
mema = MemAnalyser()
simpletrace.process("trace-events-all", trace, mema)

print("Loads: %d -> %s" % (len(mema.loads), mema.load_counts))
print("Stores: %d -> %s" % (len(mema.stores), mema.store_counts))

Gives

Loads: 4064403 -> {0: 1082887, 1: 5561, 2: 646870, 3: 2329085}
Stores: 2377689 -> {0: 179729, 1: 4970, 2: 462047, 3: 1730943}

Dynamic Tracing via HMP/Monitor

(qemu) info trace-events bdrv_co_p*
bdrv_co_pwrite_zeroes : state 0
bdrv_co_pwritev : state 0
bdrv_co_preadv : state 0
(qemu) trace-event bdrv_co_pwritev on
(qemu) trace-event bdrv_co_preadv on
(qemu) trace-file set blocks.out

Analyse Block Events

<<simpletrace-import>>

class BlockAnalyser(simpletrace.Analyzer):
    def __init__(self):
        self.rtotal = 0
        self.wtotal = 0
        self.reads = []
        self.writes = []

    def bdrv_co_pwritev(self, ts, pid, bs, offset, nbytes, flgs):
        self.wtotal += nbytes
        self.writes.append((ts, self.wtotal))

    def bdrv_co_preadv(self, ts, pid, bs, offset, nbytes, flgs):
        self.rtotal += nbytes
        self.reads.append((ts, self.rtotal))

cold = BlockAnalyser()
hot = BlockAnalyser()
simpletrace.process("trace-events-all", "coldblocks.out", cold)
simpletrace.process("trace-events-all", "hotblocks.out", hot)

Graph Block Events

:header-args+: :var plot=generated-filepath(file=”hot-cold-build.png”)
<<get-block-data>>

plt.subplot(2, 1, 1)
plt.plot(cold_reads[:,0], cold_reads[:,1], color="blue")
plt.plot(cold_writes[:,0], cold_writes[:,1], color="red")
plt.title('QEMU Compile')
plt.ylabel('Cold Cache')
plt.yticks([])
plt.xticks([])
plt.legend(labels=("read", "write"))

plt.subplot(2, 1, 2)
plt.plot(hot_reads[:,0], hot_reads[:,1], color="blue")
plt.plot(hot_writes[:,0], hot_writes[:,1], color="red")
plt.ylabel('Hot Cache')
plt.yticks([])
plt.xlabel('time')
plt.xticks([])
plt.legend(labels=("read", "write"))

plt.savefig(plot)

Hot vs Cold Cache Compile

generated/hot-cold-build.png

Timing Issues

:header-args+: :dir ~/lsrc/qemu.git :header-args+: :eval no-export :header-args+: :noweb no-export
  • latency (TCG code is slower)
  • icount pegs time to insns
    • not cycle accurate!
    • system emulation only
${QEMU} ${QEMU_ARGS} -icount 1

Collect Data

:header-args+: :var QEMU=”./aarch64-softmmu/qemu-system-aarch64” :header-args+: :var QEMU_MACHINE=”-machine virt,graphics=on,gic-version=3,virtualization=on -cpu cortex-a53 -m 4096 -serial none -smp 1 -display none” :header-args+: :var QEMU_NETWORK=”-nic user,model=virtio-net-pci,hostfwd=tcp::2222-:22” :header-args+: :var QEMU_DISK=”-device virtio-scsi-pci -drive file=/dev/zvol/hackpool-0/debian-stretch-arm64,id=hd0,index=0,if=none,format=raw,snapshot=on -device scsi-hd,drive=hd0”

With icount

Analyse Exit Data

<<simpletrace-import>>

class MyAnalyser(simpletrace.Analyzer):
    def __init__(self):
        self.irqs = []
        self.reads = []
        self.rtotal = 0

    def tcg_irq(self, ts, cpu, cpu_id):
        self.irqs.append(ts)

    def bdrv_co_preadv(self, ts, pid, bs, offset, nbytes, flgs):
        self.rtotal += nbytes
        self.reads.append((ts, self.rtotal))

tickless = MyAnalyser()
ticked = MyAnalyser()
simpletrace.process("trace-events-all", "tickless-irqs.out", tickless)
simpletrace.process("trace-events-all", "ticked-irqs.out", ticked)

ic_tickless = MyAnalyser()
ic_ticked = MyAnalyser()
simpletrace.process("trace-events-all", "tickless-icount-irqs.out", ic_tickless)
simpletrace.process("trace-events-all", "ticked-icount-irqs.out", ic_ticked)

IRQ events default vs icount

generated/graphs/ticks-vs-tickless-icount.png

In Development

Reverse Debugging

  • builds on record/replay
  • uses gdbstub
  • see {PATCH v6 00/25} Fixing record/replay and adding reverse debugging

Plugins

  • in-situ analysis
    • per instruction
  • debate around API/ABI scope
  • see {Qemu-devel} {RFC PATCH v2 0/7} QEMU binary instrumentation prototype
  • better memory logging
  • icount for user-mode
  • block execution counts
  • ThreadSanitizer?

Summary

Why QEMU?

  • Cross-platform
  • System Emulation
  • Open platform

Tracing API

  • Trace points > debug logs
  • Active Development

Feedback?

  • Use cases
  • Questions?

Extra Slides

Tweaking the Labels

Code to tweak the ticks to match the actual address of the hit.

from mpl_toolkits.axes_grid1 import make_axes_locatable

ylabels=ax.get_yticks().tolist()
addr_labels = []
for l in ylabels:
    offset = l * side * bytes_bucket
    addr_labels.append("%#x" % ((base_page << 12) + offset))
ax.set_yticklabels(addr_labels)
ax.set_xticks([])

Ephemera

Helper Functions

(concat default-directory "generated/" file)

Publish

publish_org.sh

Graph Memory Patterns

:header-args+: :var plot=generated-filepath(file=”load-store-scatter.png”)
<<setup-matplotlib>>
<<analyse-memory-pattern>>
import numpy as np

loads = np.array(analyser.loads)
laddrs = loads[:,1]
counts = np.bincount(laddrs)
mf = counts.argmax()
print("Most frequent addr = %#x (%d hits)" % (mf, counts[mf]))


#plt.savefig(plot)
#return plot
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment