Skip to content

Instantly share code, notes, and snippets.

@captivus
Created May 28, 2026 19:31
Show Gist options
  • Select an option

  • Save captivus/9913ceda5fb3a9c25329d8afc5fd19f2 to your computer and use it in GitHub Desktop.

Select an option

Save captivus/9913ceda5fb3a9c25329d8afc5fd19f2 to your computer and use it in GitHub Desktop.
Handy memory leak bugfix methodology

A Methodology for Identifying and Fixing Hard Bugs in Complex Multi-Process Software

A reproducible procedure for diagnosing bugs that resist normal debugging: bugs that span process boundaries, manifest over hours, or have no obvious cause from logs alone. Worked example throughout: a sustained memory leak in WebKitWebProcess on Linux in a Tauri 2 desktop application (issue #1279 in the cjpais/Handy repository), eventually narrowed to WebKit C++ allocations driven by high-frequency JS event traffic, then fixed with a three-part patch validated by bisect.

The procedure is built around a single core idea: build a complete map of the system's execution flow before investigating the bug, then use the bug's symptoms as constraints against that map to generate candidates, then narrow with iterative measurement, then validate with bisect controls. It deliberately resists the more common "form a hypothesis, write a test, hope it confirms" pattern, which works well for simple bugs and fails silently for hard ones.

This document is self-contained. A reader unfamiliar with the case study can use it as a procedure for their own project. The example uses Linux, Rust, and Tauri/WebKitGTK as concrete tooling; the methodology is tool- and stack-agnostic. A substitution table appears at the end.


Contents


0. When to use this methodology

This methodology is for hard bugs in complex software, characterized by at least one of:

  • Manifests over long time scales (hours, days) or only under specific configurations, making fast reproduction impossible.
  • Spans multiple processes, threads, or runtime boundaries; logs from any single component don't tell the full story.
  • The reporter (or you) has a hypothesis but cannot confirm or refute it with the tools currently in hand.
  • The system has accumulated enough emergent complexity (multiple subsystems, multiple runtimes, IPC layers, external dependencies) that no individual maintainer holds the whole behavior in their head.
  • The fix, once landed, will become a load-bearing claim in a public artifact (PR description, blog post, post-mortem) and therefore needs to be defensible under scrutiny.

If your bug is "the test fails because a typo," this methodology is overkill. If your bug is "users intermittently report OOMs after long sessions and we can't reproduce locally," it is the right shape.

Total effort: the worked example below took roughly 2-3 days of focused work, including ~6 hours of CPU time across captures. For a sustained-quality investigation of a similarly complex bug, plan for a comparable scale. The investment pays back in (a) a defensible fix and (b) instrumentation infrastructure that's reusable for the next hard bug.

Prerequisites:

  • Access to a development workstation that can run the target software under instrumentation. (Cloud / production access is rarely sufficient — instrumentation overhead and intrusive observation typically rule it out.)
  • A version-controlled checkout of the project's source.
  • Permission to install instrumentation tooling locally (you will set up reversible install scripts in Phase 2, so this is short-term).
  • The bug report or symptom set is concrete: at minimum, observed behavior, measured numbers if available, and the affected configuration.

1. Phase 1 — Understand the project before instrumenting

Why this matters

The instinct to "just attach a profiler" skips the foundation that makes profiler output interpretable. A hot function in your profile means nothing if you don't know what subsystem it belongs to, what it's supposed to do, or how it relates to the bug's surface.

Worse: if you treat the project as a black box and start measuring, you will measure the wrong things. You won't know which thread is which. You won't know which IPC channel carries the event you care about. You won't know whether a path that's not firing in your capture is broken (your trigger missed) or correct (the path is conditional and the condition wasn't met).

Mechanics

Read, in order:

  1. The project's top-level documentation — README, AGENTS.md or CONTRIBUTING.md, BUILD.md or equivalent. These tell you what the project IS, the dependency stack, and the conventions.
  2. The architecture / overview document if one exists, or reconstruct it from the code organization. Map:
    • The major subsystems (modules, services, layers)
    • The threading model (which subsystem runs in which threads)
    • The process model (which subsystems are in which OS processes)
    • The IPC / communication layer (how subsystems talk)
    • The lifecycle (startup, steady state, shutdown)
  3. The relevant source code for the area the bug touches. Read it, don't just grep. Comments and structure matter.
  4. The bug report itself, twice. First pass for the symptom set; second pass after you have the architectural model, to see what the reporter's observations imply about which subsystems are involved.

For each major subsystem, you should be able to answer:

  • What threads does it run in?
  • What state does it own (in-memory, on-disk, in shared memory)?
  • What other subsystems read or write its state?
  • What events does it produce? Consume?
  • What lifecycle hooks does it have (init, shutdown, error recovery)?

Output

A working mental model — or, better, a written one — that contains:

  • A subsystem diagram or list, even if rough
  • A thread/process map
  • A note on the IPC mechanisms in use
  • A list of the subsystems most likely involved in the bug, with reasoning

Failure modes to avoid

  • Skipping straight to instrumentation. You will measure the wrong things and waste hours debugging measurement artifacts. Resist.
  • Reading only the file the bug mentions. Bugs span boundaries; the fix often lives several modules away from the symptom.
  • Trusting comments more than code. Comments can rot. Validate by reading the call sites.
  • Assuming the architecture document is current. It might be a year out of date. Cross-check with the code.

Worked example: Handy

The Handy project is a desktop speech-to-text application built with Tauri 2 (Rust backend, React/TypeScript frontend served by an embedded WebKit/WebView2 webview, with native shell binding via wry). The AGENTS.md in the repository describes the architecture in roughly two pages: a Rust process with four "manager" modules (audio, model, transcription, history), a global-shortcut listener thread, an audio-callback thread driven by the cross-platform audio library cpal, and a Tauri IPC bridge to one or more WebKit subprocesses on Linux (one per Tauri window).

We used an exploration agent to walk the source and produce a more detailed architectural report than the docs offered, but the bones were in AGENTS.md. The key facts that ended up mattering for the bug:

  • On Linux, every Tauri window is hosted in its own WebKitWebProcess subprocess (separate OS process from the Rust main).
  • The recording_overlay window is created at boot regardless of whether the user has the overlay enabled — it's just hidden by default.
  • An audio-callback closure inside cpal::build_stream<T> emits mic-level events at the audio-frame rate; this closure body is not visible to function-level tracers because closures don't have stable symbols.
  • Tauri's emit method on a WebviewWindow is a global broadcast in Tauri 2, not a window-targeted send.

The architectural model was load-bearing for everything that followed.


2. Phase 2 — Build reversible tooling

Why this matters

You are about to install dependencies, modify source files, spawn background processes, and create state on the user's system (or yours) that the system did not have before. Some of that state can survive your workflow's exit. Some can survive a reboot. Some, if left behind, can subtly damage other software the user runs.

The cost of "I'll clean it up later" is silent contamination: a dependency you forgot about gets shadowed by a future install; a process you forgot about consumes resources for weeks; a configuration you forgot about pollutes the next measurement and you reach a wrong conclusion based on dirty data.

Reversibility is therefore not a nice-to-have. It is a prerequisite.

The snapshot-and-diff pattern

The foundational pattern: before you change the system, snapshot its state. To uninstall, snapshot again, diff, and remove only the delta.

This is simple in concept and the implementation is small. The script that takes a snapshot captures everything you might modify:

  • Installed packages (apt-mark showmanual, dpkg-query -W -f=...)
  • Tool directories you might create (~/.cargo, ~/.bun, ~/.uv, ~/.local/bin, etc. — list with EXISTS / ABSENT markers)
  • Hashes of files you might edit (shell rc files, source files you may patch, configuration)
  • Lists of process-level state you might leave behind (loaded audio modules, virtual displays, background daemons)
  • A machine identifier (/etc/machine-id on Linux) so you can detect if a snapshot was taken on a different host than where you're uninstalling — a real failure mode if snapshots get copied with the project tree.

The uninstall script reads the pre-install snapshot, takes a current snapshot, computes the delta, and reverses ONLY the items in the delta. The diff is the source of truth; you never hard-code "things to remove" because that grows stale.

The load-bearing safety property

For every reversible-tooling setup, identify the one piece of state whose silent persistence would cause the worst harm. Name it explicitly. Build extra defense around it.

The cost of ranking failures lets you prioritize. Disk-space leaks hurt; runtime process leaks hurt more (they keep consuming resources); configuration leaks hurt most when they silently break OTHER software.

Example: in our case, the instrumentation procedure loaded a module-null-sink into PipeWire so we could inject deterministic test audio. If we left it loaded, the system's audio routing manager (WirePlumber) would cache the routing for our test sink. The next time the user launched their installed copy of the application, WirePlumber would try to route its microphone to a sink that no longer exists, and voice-to-text would silently fail with no error message.

That's a category of harm — silently breaking unrelated software — that matters more than any other failure mode in this workflow. We named it explicitly in script comments, gave the cleanup script three layers of defense for that one resource, and made the verify-clean script flag a hard FAIL on any residual.

Layers of defense

For load-bearing state, use three defenses:

  1. Trap on EXIT in the run script. The first line of every script that creates ephemeral state: trap "cleanup '$RUN_DIR'" EXIT INT TERM. Fires on every exit path including crash, signal, manual interrupt.
  2. A standalone cleanup script invocable manually. When the trap fails (e.g., the process is killed in a way that bypasses traps, like kill -9), the user can run the cleanup script by hand. It reads state files written by the run script to know exactly what to reverse. It is idempotent: safe to run repeatedly.
  3. A final-gate verify-clean script. Re-snapshots the system, diffs against the pre-install snapshot, and FAILs if anything load-bearing remains.

For non-load-bearing state, two defenses are usually enough (trap + manual cleanup). The verify-clean script also covers it but doesn't need to FAIL on it — it can report as INFO.

Per-run unique resource names

Where the tooling creates state with names (a PA module, a network port, a temp directory, a Docker container), use per-run-unique names that include a timestamp or random suffix. This:

  • Lets cleanup scripts identify their own resources unambiguously
  • Prevents collisions with concurrent or prior runs
  • Makes leaks visible (anything matching the pattern that you didn't create is an old leak)

Verify-clean as a binary gate

The verify-clean script is a yes/no: "is the system back to pre-install state for everything load-bearing?" If yes, exit 0. If no, list what's different, exit non-zero.

Do not soften the failure mode. A "warning" verify-clean is a passing verify-clean that lies.

Output

A small directory of scripts (4-6 files, ~500-1500 lines of bash) that collectively provide:

  • snapshot.sh <dir> — captures system state
  • install.sh — installs dependencies, takes pre-install snapshot first
  • cleanup-run.sh [<run-dir>] — reverses per-run state, idempotent
  • uninstall.sh — diff-based, removes only what was added
  • verify-clean.sh — final gate, FAILs on load-bearing residuals
  • activate.sh — adds the tooling to the current shell's PATH without modifying any persistent shell rc files

Failure modes to avoid

  • Modifying shell configuration files (.zshrc, .bashrc). Shell rc files are often symlinks into a user's version-controlled dotfiles repo. Letting an installer append source X lines dirties versioned config. Solution: install with --no-modify-path flags, provide an activate.sh that sets PATH for the current shell only.
  • Hard-coding the list of things to remove in uninstall. It grows stale. Always compute from a snapshot diff.
  • Skipping the load-bearing audit. If you don't know what the worst leak is, you can't defend against it. Pause and identify it before you write any script.
  • Carrying snapshots across machines. A snapshot is valid only on the host where it was taken. The machine-id check prevents silent reuse.

Worked example: Handy

In our case, we built a small env/ directory with:

  • snapshot.sh — captured apt manual+all packages, tool dirs (cargo/rustup/bun/uv), shell config hashes, dotfiles git status, source file SHAs, PA modules, Xvfb processes (validated via /proc/<pid>/exe to avoid false-positive matches on shells whose argv mentions "Xvfb"), /etc/machine-id.
  • install.sh with a --instr flag that adds the instrumentation tooling on top of the base build deps.
  • cleanup-run.sh — six numbered steps including the three-layer defense for PA modules: primary unload by recorded module ID, fallback scan-and-unload by name pattern, final assertion that no matching modules remain.
  • verify-clean.sh — explicit OK/FAIL messaging per category, INFO on artifacts.

The load-bearing safety property — the WirePlumber stream-restore cache hazard — was documented in the cleanup script's header comment and in the env README. Three layers of defense around PA module unload were built deliberately for it.

The reversible-tooling investment took about 4 hours of focused work. It saved more than that in the first instrumented run, when a script crash mid-capture would have left state behind if not for the trap.


3. Phase 3 — Build a complete execution-flow map of the system

Why a map first, not hypotheses first

The traditional "hypothesize then test" debugging pattern works for simple bugs where you can guess the cause. For hard bugs, your guess is probably wrong, and proving it wrong costs as much as proving the right cause right. You end up debugging your wrong guess.

The map flips this. You build, once, a complete picture of every code path that fires in the system's lifecycle. Then bug investigation becomes: "which of these paths is the wrong one for the symptoms I'm seeing?" That's a constraint-satisfaction problem against a finite search space, not a creative-guessing exercise.

You also get a side benefit: the map is reusable across bugs. The investment amortizes.

Multi-tool composition: no single tool sees everything

A complete map of a modern application's execution requires multiple observation tools. Each sees a different slice. No single tool covers the whole surface.

You will use, at minimum, three categories of tool:

Function-level tracing — every function entry and exit in your language's runtime. This is the spine of the map.

  • Linux Rust/C/C++: uftrace with mcount instrumentation.
    • For Rust, this requires nightly toolchain and RUSTFLAGS="-Z instrument-mcount". On stable, the flag doesn't exist.
    • For C/C++, compile with -pg or -finstrument-functions.
  • macOS: dtrace, Instruments (slower but works for shipped binaries).
  • JVM: async-profiler, JFR.
  • Go: runtime/pprof, gops.
  • Python: cProfile, py-spy, viztracer.

The function tracer gives you:

  • Call counts per function over the capture window
  • Caller→callee edges (the call graph)
  • Total time and self time per function
  • Cross-thread timing

What it cannot see:

  • Closure bodies that don't have their own symbol
  • Functions inlined away by the optimizer
  • C/C++ internals of dynamically-linked libraries (unless you explicitly opt in to library calls, which floods the trace)

Syscall and IPC observation — everything that crosses the kernel boundary. The function tracer sees in-process activity; the syscall tracer sees inter-process activity.

  • Linux: strace -f -yy -s 16384.
    • -f follows forks. Without it, you miss child processes spawned by the target.
    • -yy annotates file descriptors with their type and peer. Critical for identifying which socket carries which IPC channel.
    • -s 16384 sets the per-syscall string-capture limit. The default is so small (32 bytes) that IPC payloads are truncated past recognition.
  • macOS: dtruss.
  • Windows: procmon.

The syscall tracer gives you:

  • Every execve (every subprocess invocation)
  • Every sendmsg/recvmsg on Unix sockets (the most common IPC primitive for Linux desktop applications)
  • Every file open, read, write
  • Every mmap

What it cannot see:

  • Pre-syscall in-process work
  • Userspace-only computation

UI / runtime observation for embedded webviews. If the application embeds a web runtime (WebView2 on Windows, WebKitGTK on Linux, WKWebView on macOS), the function tracer doesn't see what happens inside the web runtime. You need the web runtime's remote inspector.

  • WebKitGTK: WEBKIT_INSPECTOR_HTTP_SERVER=host:port (NOT WEBKIT_INSPECTOR_SERVER — the latter is a different protocol with a binary handshake that doesn't accept HTTP).
  • Chromium / WebView2: --remote-debugging-port=....
  • WKWebView: Safari's Develop menu.

The inspector gives you:

  • Timeline events (rendering frames, JS execution, GC)
  • ScriptProfiler samples (which JS functions consumed time)
  • Heap domain (snapshots, GC events, allocation tracking)
  • Console messages

What it cannot see:

  • Native code inside the web runtime process (DOM/style/layout in C++)
  • Memory below the JS heap (the native side of the runtime)

Source-level counters — for paths none of the above can see (closures without symbols, inlined helpers, paths where you need a specific event count and the surrounding function fires for many other reasons).

  • Add a static AtomicU64 for each path you want to count.
  • Bump it from inside the path.
  • Print periodically with timestamps from a background thread.
  • Gate behind a feature flag so shipping builds don't pay the cost.

Source counters are the most-intrusive of the four tools. Use them sparingly and only when the other three tools provably can't reach the path.

Static enumeration: the coverage map

Before you capture anything, walk the source tree and build a coverage map — a CSV (or equivalent) listing every code path you expect to fire during the bug's lifecycle. Columns:

  • Function name (fully qualified)
  • File and line
  • Lifecycle phase(s) it fires in
  • Which observation tool can see it
  • Notes (especially: "closure body, only counter can see")

Why bother with a static map when you're about to capture? Because a capture-only methodology has a fatal blind spot: paths that didn't fire in your one capture are invisible in your results. You cannot distinguish:

  • "This path is correct and the condition that triggers it wasn't met in this capture" (e.g., feature flag off, alternative branch taken)
  • "This path should have fired but didn't because the bug or your trigger broke it"

The map makes this distinction visible. You compare your capture against the map; missing paths are accounted for one way or the other.

You generate the map by:

  • Walking every source file and listing every function definition with its file/line
  • Running grep for cross-cutting patterns: event emissions, event listeners, IPC commands, thread spawns, subprocess invocations, timer registrations, async tasks, custom URI scheme handlers
  • Assigning each path to one or more lifecycle phases (boot, idle, trigger, steady-state action, finalization, etc.)
  • Marking each path with the tool(s) that could observe it

The map's size depends on the project. For a desktop app of moderate complexity, expect 500-2000 rows. For a microservice, 100-500. For a library, 50-200.

Synchronized capture

Run all your observation tools simultaneously, against one execution of the system, through a real workflow. The capture must produce:

  • One raw artifact per tool (uftrace.data, strace.log, inspector JSON, counter log, application stdout)
  • A wall-clock-anchored phase timeline — a file where each significant boundary in the workflow is logged with a Unix timestamp. This is what lets you cross-correlate the tools.
  • Process-level state (PIDs, FDs) captured at known points, so the syscall tracer's per-PID lines can be mapped to subsystems.

The launch wrapping matters. If you nest the tools wrong (e.g., the function tracer outside the syscall tracer instead of inside), the outer tool will inspect the inner tool's binary and fail, never reaching the target. The general rule is: the most-intrusive tool should be innermost so it sees the target's actual exec.

For the audio injection portion (if needed) — replaying a known input to drive the system deterministically — pre-stage the input data before triggering the workflow, then trigger via a method that does not spawn a second instance of the process under test. On a system with single-instance enforcement, a CLI flag that hands off via single-instance launches a second process that then exits, polluting the trace. Prefer signal-based or in-process triggers.

Validation contracts

Every captured artifact has a quantitative validation contract — a quick check that confirms the artifact is non-empty, well-formed, and content-rich enough to be useful.

For the function tracer: size threshold (megabytes per minute of capture), minimum distinct symbol count, non-zero call count for specific known functions.

For the syscall tracer: size threshold, presence of specific syscall names, presence of literal strings from known event names.

For the inspector: minimum record count, presence of specific event-method names (e.g., at least one Timeline.eventRecorded, at least one Heap.garbageCollected).

For the application log: presence of a known boot-complete marker.

For the phase timeline: monotonically increasing timestamps, expected phase boundary count.

A "PASS" without a measured property is not really a PASS. If you record a phase as PASS because "no errors were observed," you have recorded zero observations as success — which they aren't. The validation contract forces you to name what success looks like quantitatively.

Gap audits

Document what your capture composition CANNOT capture. This is not weakness; it is honesty. Categories typically include:

  • C/C++ internals of dynamically-linked libraries (need different tools: heaptrack, perf with frame pointers, or a debug build of the dylib)
  • Web-runtime native code (the C++ side of the embedded web runtime)
  • Kernel below the syscall boundary (need eBPF / bpftrace)
  • Allocation patterns inside the JS engine's heap profiler that the inspector doesn't surface in standard responses
  • Sub-millisecond work that misses the sampling profiler's tick

A gap audit serves two purposes:

  • Future-you knows what to add next if a follow-up question requires it
  • Reviewers of your work know what your claims can and cannot support

Synthesis

After capture, post-process the raw artifacts into normalized form:

  • execution-trace.csv — every observation joined into one table (tool, phase, function/event/syscall name, count, total time, etc.)
  • call-graph.csv — caller→callee edges with counts (from the function tracer's chrome-trace dump)
  • coverage.csv — every row of the coverage map annotated with fired/not-fired and a reason
  • A sequence diagram (Mermaid is convenient) showing the high-level flow across actor lanes

These derived artifacts are what you actually use during bug investigation. The raw artifacts are the audit trail; the synthesis is the working dataset.

Output

After Phase 3 you have:

  • A pre-validated procedure (something like an INSTRUMENTATION.md)
  • A coverage map
  • One synchronized capture's worth of raw artifacts
  • Synthesized CSVs and a diagram
  • A gap audit
  • A status summary showing which phases PASSed with measured evidence

Failure modes to avoid

  • Skipping the coverage map. Without it, your capture has invisible blind spots.
  • Using the wrong tool composition. Function tracer outside the syscall tracer: function tracer inspects the wrong binary and fails. Inspector connected via the wrong env var (WEBKIT_INSPECTOR_SERVER instead of WEBKIT_INSPECTOR_HTTP_SERVER, on WebKitGTK): silent protocol mismatch, no records collected.
  • Trusting bun run tauri dev (or equivalent dev-server mode) as representative of the production build. Dev servers (Vite, Webpack, etc.) inject runtime machinery — HMR clients, source-map references, dev-mode fetch interceptors — that don't ship to users. Your dev capture is not your shipping capture. Build the production bundle and capture against that.
  • Trusting completed status without verifying activity. A long-running capture may report intermediate status that gets mistakenly framed as final completion. Before invoking any destructive cleanup, verify the run is actually finished by checking observable artifacts (file mtimes, process tree, growing log files).
  • Truncated string capture. The syscall tracer's default string-capture length is too small to see IPC payloads. Set -s 16384 (or larger).
  • Filtering by PID at capture time. If you strace -p $PID, you miss forks before that PID was discovered. Launch the target under strace from the start.

Worked example: Handy

Our project came with a pre-validated procedure document (INSTRUMENTATION.md, 1685 lines) that specified the four-tool composition exactly: uftrace for Rust function tracing, strace for syscall + IPC, WebKit Remote Inspector for the JS-side activity in each of the two WebKit subprocesses (one per Tauri window), and source-level counters for the cpal audio-callback closure body (invisible to uftrace because it has no symbol).

We executed Phase 3 across multiple runs:

  • Run 1: First successful synchronized capture, three 10-second recording cycles. Coverage map ~1080 rows. Caught three procedural bugs in the documented procedure during execution (-C instrument-mcount is wrong on stable Rust — needs -Z on nightly; the mcount-symbol verification regex was too strict; a grep -q under set -uo pipefail triggers SIGPIPE false-fatals). Documented all three in the procedure for future runs.
  • Run 2: Refined harness with the procedural fixes and a corrected WebKit Inspector target-name filter (WebKit exposes the human- readable window title, not the Tauri internal window label). Reclassified nine misclassified paths in the coverage map. Recording phase coverage rose from 73.5% to 100%.
  • Run 3: Discovered the build was loading JS from Vite (dev server) instead of the bundled dist/ — pitfall warned about in the procedure but the agent's build path triggered it. Diagnosed root cause: missing custom-protocol Cargo feature on the Tauri dependency. With that fixed, the binary embedded the production JS and the captures were valid for production-behavior analysis.

Each run produced a complete set of synthesis artifacts. Across them we accumulated: ~134 MB of function trace data, ~73 MB of syscall trace, ~33 MB of WebKit Inspector records, ~50 KB of RSS-growth samples, a Mermaid execution-flow diagram cross-validated against the data, a gap audit listing six categories of observation the composition couldn't reach.

The coverage map and synthesis artifacts became the search space for bug investigation. Without them, what followed would have been guesswork.


4. Phase 4 — Frame the bug from the data

Why this matters

You have the bug report. You have the execution-flow map. You are tempted to start with the reporter's hypothesis. Resist.

The reporter's hypothesis is, by definition, a claim about the system made WITHOUT access to the tools you just spent Phase 3 building. It may be right. It is almost as likely to be wrong. Anchoring on it biases your search toward confirming it rather than identifying the actual cause. This is a recurring failure mode in debugging work and has its own name: inherited frames. An inherited frame is any claim about the system that arrived in your context from somewhere else — a bug report, a paper, a prior session, your own past self ten minutes ago. Until tested, it is a hypothesis. Treat it as such.

This is doubly true when YOU are the reporter. Careful reporters hedge ("I don't know that this is the cause, but..."). The hedge is authoritative. Honor it.

Mechanics

Read the bug report twice. First pass: what is observed. Second pass: what does the reporter think it means. Separate these two categories rigorously.

Symptoms are facts. Things that have been measured or observed directly. Examples:

  • "Memory grew from 249 MB at 09:15 to 1,230 MB at 11:07."
  • "Growth is in Private_Dirty anonymous memory per smaps_rollup."
  • "OOM killer terminated the process at ~9.5 GB."

Hypotheses are claims. Things the reporter or you have inferred about the cause. Examples:

  • "The high-frequency emit_levels function is the cause."
  • "JavaScript-side event listener accumulation is the mechanism."
  • "Reducing the event rate would fix it."

Symptoms and conditions go into the constraint set. Hypotheses go into the candidate set, treated as one option among many.

Build the constraint set

For each symptom or condition, identify the dimension it constrains and the value it pins:

  • Process of residence — where (in which OS process) the symptom manifests
  • Memory class — kind of allocation that grows (heap? mmap? shared memory? specific region per smaps_rollup?)
  • Temporal correlation — what events correlate with the growth (specific actions, specific time-of-day, idle vs. active?)
  • Idle behavior — does the symptom decrement, hold flat, or continue during periods of inactivity?
  • Scale — quantitative rate of the symptom
  • Configuration — what settings, OS, hardware are required to reproduce
  • Cross-system signal — has this been reported on other platforms? In other versions? In other applications using the same dependency?

Each constraint is a filter you can apply against the execution-flow map.

Build the candidate set from the map

Walk the execution surface. For each code path in the map, ask of each constraint:

  1. Is this path's firing pattern consistent with the constraint?
  2. Could this path's mechanism cause the symptom?

Paths that survive all the filters form the candidate set. Paths that fail any filter are eliminated. The reporter's hypothesis enters this list like any other candidate.

A useful sanity check: at the end of this walk, you should be able to say of each ELIMINATED candidate why it was eliminated (which constraint failed). And of each SURVIVING candidate, which constraints it satisfies and what data would distinguish it from the others.

Rank the candidates

The candidate set is unordered until you have a sense of which paths best match the data. Rough ranking is fine. The goal is to know which candidate the cheapest next measurement would distinguish.

Output

After Phase 4 you have:

  • A constraint-set.md listing every constraint pulled from the bug report with its value
  • A candidates.md listing surviving candidates with:
    • Why each survives (which constraints it satisfies)
    • What measurement would confirm or refute it
    • A rough rank

This document is the working basis for Phase 5.

Failure modes to avoid

  • Anchoring on the reporter's hypothesis. The most common failure. Reframe explicitly: "the reporter says X; the data says Y; X is one of N candidates, Y is the actual constraint set."
  • Conflating symptoms with hypotheses. "Memory grows during transcription" is a symptom. "emit_levels is the cause" is a hypothesis. Don't fold them.
  • Filtering candidates by the reporter's named mechanism. If the reporter says "JS heap accumulation," do NOT pre-filter to only JS-heap candidates. Test that hypothesis on its merits, alongside others.
  • Pre-deciding which fix to apply. The fix follows the cause; the cause follows the data. Don't decide on a fix before the measurements support it.

Worked example: Handy

The bug report (issue #1279) described:

Symptoms:

  • WebKitWebProcess memory accumulates during transcription on Linux
  • Private_Dirty anonymous memory specifically (per smaps_rollup)
  • ~6.8 to 12.8 MB per minute during active transcription
  • Holds flat during idle, never decrements
  • Reaches ~9.5 GB after ~36 hours, OOM kills
  • Pattern: staircase (grow then hold)
  • Config: overlay_position: none, always_on_microphone: false

Hypotheses (from the reporter, explicitly hedged):

  • emit_levels fan-out at ~94 Hz is fueling the leak
  • Double-emit pattern (one global emit + one window-scoped emit, both global in Tauri 2) doubles the effect
  • The overlay window is created at boot regardless of overlay_position, so the leak occurs even when the user has the overlay disabled
  • Three suggested fixes: throttle, single targeted emit, skip when overlay disabled

Constraint set (the filtering tool):

  1. Process of residence: WebKitWebProcess (eliminates everything Rust-side — audio toolkit, transcription, history, paste pipeline, model manager)
  2. Memory class: Private_Dirty anonymous (rules out file-backed accumulation, shared-memory growth)
  3. Temporal correlation: transcription activity, not wall time (rules out periodic background work)
  4. Idle behavior: holds (rules out short-lived allocations that GC would reclaim quickly)
  5. Scale: 6.8-12.8 MB/min during active transcription

Candidate set (8 candidates after applying constraints):

  • JSC compiled-script cache from per-emit unique JS sources
  • Tauri JS-side event registry / payload retention
  • React reconciler / fiber accumulation
  • WebKit CSS transition tracker
  • wry IPC bridge state
  • GLib GTask result accumulation (ruled out by reporter's standalone test)
  • GTK widget pool / cached DOM render objects
  • Console message buffer (only in dev mode, ruled out for production)

The reporter's hypothesis (emit_levels chain) was structurally consistent with candidates 1, 2, 3, 4, 7 (any of them could be driven by the emit_levels chain even if the leak itself was elsewhere in the dependency stack). We carried all 8 candidates forward without prioritizing the reporter's framing.


5. Phase 5 — Iteratively narrow with measurement

Why this matters

The candidate set after Phase 4 is too large to fix all at once and too small to ignore. You need a sequence of measurements that each either confirm a candidate, refute a candidate, or narrow the search. This is the bulk of investigation time.

The shape of the loop is invariant:

  1. Pick the cheapest measurement that distinguishes among current candidates
  2. Run it
  3. Update the candidate set with the result
  4. Stop when one candidate remains, OR when the cost of further narrowing exceeds the cost of validating multiple candidates simultaneously (Phase 6)

Cheap measurements first

In rough order from cheapest to most expensive:

  • Re-reading data you already have (free)
  • Running a different query against existing data (cheap)
  • Running a code grep or static read (cheap)
  • Adding a quick logging line and re-running a known-cheap capture
  • Running a new capture with the same instrumentation (medium)
  • Building an instrumented variant and capturing (expensive)
  • Running a long-duration capture (expensive)
  • Setting up a parallel measurement methodology (very expensive)

Always pick the cheapest measurement that will distinguish among current candidates. If you have data in hand that you haven't mined, mine it. Don't capture again until existing data is exhausted.

Following the data when it contradicts your framing

Data is ground truth. Your framing is an inference about the data. When the two conflict, the data wins.

Patterns that signal "your framing is wrong":

  • You expected a quantity to be Q based on your hypothesis; the measurement says it's Q/10 or Q*10.
  • A class of objects you predicted would dominate turns out to be a small fraction of the observed activity.
  • A subsystem you thought was implicated turns out to be uninvolved.
  • A subsystem you thought was uninvolved turns out to be dominant.

When this happens:

  • Resist the urge to salvage your framing with epicycles
  • Acknowledge the new constraint
  • Re-rank the candidate set
  • Pick the next cheapest measurement

Distinguishing warmup from sustained state

A common failure: confusing one-time initialization growth with a sustained leak. The same process can:

  • Allocate X MB of state during startup (Whisper model load, Tokio worker pool, audio buffer pools, JIT compile caches) — front-loaded
  • Hold flat thereafter
  • Add Y MB/min of leaked state during steady-state activity — sustained

A measurement that averages across both windows reports (X + Y*T) / T, which depends on T and conflates the two phenomena.

The fix: explicitly separate warmup from sustained-phase analysis. After the system reaches steady state (typically a few cycles of representative work), compute growth rates from that point forward, not from boot. Run the workload long enough that the sustained-phase window dominates.

Matching data type to claim type

Be careful about substituting one measurement for another:

  • "The JS heap is leaking" → measure the JS heap (inspector Heap.snapshot).
  • "Memory grows" → measure RSS over time (smaps_rollup).
  • "The CPU is busy" → measure CPU utilization, not request count.

Substituting a related-but-different measurement is a category error. When the related measurement appears to support your claim, double-check that the claim is actually about the related measurement, not the one you set out to test.

Output

A narrowing-log.md (or per-iteration narrowing notes) that documents:

  • Each measurement run
  • What it measured
  • What the result was
  • How the candidate set changed
  • What the next measurement would be

This is your working journal and your audit trail. When the eventual fix lands, this log is the evidence chain for the claim that the fix addresses the right cause.

Failure modes to avoid

  • Confirmation bias. Designing measurements that, regardless of outcome, can be read as supporting your favored hypothesis. Test for this by asking: "If hypothesis X is wrong, what would this measurement show?" If the answer is "the same as if it's right," pick a different measurement.
  • Conflating warmup with sustained behavior. Discussed above.
  • Single-tool myopia. Drawing strong conclusions from one tool's view when other tools could cross-check.
  • Stopping too early. Declaring one candidate confirmed because it passed one filter, when several remaining candidates also pass that filter.
  • Stopping too late. Continuing to narrow when you have enough to validate; iteration costs.
  • Acting on a sub-agent's metric without translating it. A metric computed by a delegated subagent satisfies that agent's narrow brief, not necessarily the question driving your investigation. If the metric is "per-event accuracy" but the question is "user- visible quality over a session," restate the metric in user-meaningful terms before letting it shape conclusions.

Worked example: Handy

The narrowing took five iterative measurements:

Measurement 1 (mine existing data): Cross-referenced the function tracer call counts, syscall counts, and inspector record counts against the candidate set. Confirmed:

  • emit_levels actual rate is ~24 Hz, not the reporter's predicted 94 Hz
  • Double-emit is real (1430 emit calls per 30s of recording = exactly 2× the emit_levels call count of 715)
  • Rust allocator is balanced (alloc within 2% of dealloc — no Rust-side leak)
  • Listeners::emit cumulative time is 296 ms / 30s = ~10 ms/sec — real work per emit

This didn't distinguish among the 8 candidates but did eliminate Rust-side leaks and confirmed the volume mathematics of the reporter's chain.

Measurement 2 (long-duration capture with RSS sampling, 50 cycles): Built a 30-minute capture with a /proc/<pid>/smaps_rollup sampler running every 5 seconds against the Handy main process and the two WebKit subprocesses.

The agent's initial framing claimed "the Rust main process is the dominant grower at 8.76 MB/min, contradicting the issue's framing." This was based on full-run averages including warmup.

Independent recomputation against the sustained-phase window (cycle 10 to cycle 50) told a different story:

  • handy_main: 466 KB/min sustained (95% of its 146 MB total growth happened in cycles 1-10 — pure warmup)
  • webkit_2 (overlay): 3,017 KB/min sustained
  • webkit_1 (settings): flat (no leak)

True conclusion: the leak is in the recording_overlay WebKitWebProcess, sustained, matching the reporter's qualitative report. The Rust process growth is one-time initialization.

Several candidates pruned. Remaining: candidates that operate in the overlay WebKit subprocess.

Measurement 3 (Heap.snapshot diff at baseline, cycle 10, cycle 50): Added three heap snapshots to the harness. After 50 cycles, computed class-by-class growth.

Warmup-phase (baseline → cycle 10): FunctionCodeBlock grew +414 instances / +1.93 MB, dominating ~94% of total heap growth. This is JSC's compiled bytecode for JS functions — strong support for the candidate "JSC compiled-script cache growth from per-emit unique JS sources."

But: sustained-phase (cycle 10 → cycle 50): FunctionCodeBlock grew only +32 instances / +60 KB. Plateau, ~130× reduction in rate from warmup.

Total JS heap growth in sustained phase: ~437 KB across 40 cycles (11 KB/cycle, ~36 KB/min). The overlay process RSS grew at 3,017 KB/min in the same window.

JS heap accounts for 1.2% of the observed leak.

This refuted ALL JS-side candidates as the primary mechanism. The remaining 98.8% of the leak must be in native memory not visible to the JS heap inspector.

Conclusion at end of Phase 5: the leak lives in WebKit's C++ allocations in the recording_overlay subprocess, driven by the high- frequency mic-level event traffic. Specific C++ mechanism unknown (would require heaptrack on the WebKit subprocess to identify further), but the path of causation is clear: emit_levels at 24 Hz → double broadcast → IPC packets to overlay → WebKit processes events → DOM mutations and style recalculations → C++ allocations in the WebKit process.

The reporter's three suggested fixes all reduce the upstream event volume that drives the C++ allocation rate. They were structurally right even though the mechanism turned out to be different from the hypothesized JS-side accumulation.


6. Phase 6 — Write the fix and validate with bisect controls

Why this matters

A fix that "seems to make things better" is not enough. You need:

  • Quantitative evidence that the leak rate dropped
  • Confidence the drop is from the fix, not coincidence
  • For multi-component fixes, confidence about WHICH part did what

A fix without bisect validation is a hope. A fix WITH bisect validation is a defensible claim — exactly the kind of claim that needs to be defensible because it's going into a public PR.

Co-specify validation with the fix

Before writing fix code: state the success criterion in measurable terms. "The bug is fixed" is not measurable. "The sustained-phase RSS growth rate of the affected process drops below X KB/min under workload Y, measured by the same harness used to characterize the bug" is measurable.

You will frequently want a control reproduction too: "the unfixed baseline measures Z KB/min under the same workload."

Design the fix to be bisect-friendly

If your fix has multiple components, build them so each component is individually toggleable at runtime. Options:

  • Env-var flags. Each component checks an env var at startup. Flags default to "fix enabled"; setting any flag disables that one component, reproducing original behavior for it.
  • Feature flags. Compile-time toggles (Cargo features, build variables). More work to bisect (one rebuild per variant) but cleaner isolation.
  • Runtime configuration. A settings file with toggles. Easiest to flip at runtime but requires a settings-layer change.

Env-vars are usually best: one build, runtime variants. Cache the env-var reads once at startup to avoid per-call overhead.

The bisect-friendly machinery is for validation only. Strip it for the PR (or feature-gate it behind a --features bisect-toggles Cargo flag) so the shipped code is clean.

The control variant

Run a variant with all fix components disabled. This should reproduce the original behavior under the same instrumentation. The control's measured rate is your comparison anchor.

If the control DOESN'T match the original (pre-patch) measurement within ~10%, your bisect mechanism is broken — the env-var toggles don't fully disable, or some other factor differs. Don't proceed until the control reproduces.

Per-component isolated variants

Run a variant with only ONE fix component enabled, others disabled. Repeat for each component. The measured rate per variant tells you that component's isolated effect.

The combined variant

Run a variant with all components enabled. This is the production PR behavior. Compare against the control AND against the sum of isolated effects.

If combined < control by less than the sum of isolated reductions, there's an interaction (components mask each other). If combined > sum, components compound (positive interaction). Both are informative; neither is necessarily a problem.

Predicting before measuring

For each variant, write down the expected effect BEFORE running. After measurement, compare expectation to observation.

  • Match within 10%: model is correct
  • Off by 50%: model is approximately correct
  • Off by 10x: model is wrong, investigate

Predictions that match build confidence in the model. Predictions that miss reveal model bugs you can fix.

Output

A bisect-analysis.md containing:

  • Each variant's name, configuration, measured rate
  • Each component's isolated effect (percentage reduction from control)
  • The combined effect
  • Comparison of measured to predicted per variant
  • Decision: which components are essential, which are nice-to-have

Failure modes to avoid

  • Skipping the control. Without a control reproduction, you cannot attribute the rate change to the fix.
  • Bisecting too early. If you bisect before the bug's mechanism is understood, you measure side effects of multiple changes at once.
  • Trusting a sub-agent's aggregate rate. Independently recompute per-variant rates from the raw data. The agent's framing may conflate warmup and sustained, or apply different window boundaries per variant.
  • Declaring victory based on full-run averages. Use sustained-phase windows so warmup growth doesn't dominate.
  • Shipping the bisect-toggle code. Strip or feature-gate before the PR.

Worked example: Handy

The fix has three components corresponding to the reporter's three suggested fixes (which we adopted not because the reporter suggested them but because the data confirmed they all reduce the upstream event volume driving the WebKit C++ leak):

  1. Skip emit when overlay disabled. Cache overlay_position != None in an AtomicBool, updated from the settings-change handler. Skip the emit entirely when the cache says overlay is disabled. For users with overlay_position: none (the Linux default), this eliminates the leak's driver completely.

  2. Use targeted emit_to instead of dual broadcast. Replace app_handle.emit("mic-level", levels) + overlay_window.emit(...) with a single app_handle.emit_to("recording_overlay", "mic-level", levels). Halves the IPC traffic and the evaluate_script calls in WebKit.

  3. Throttle to 20 Hz. Capture a last_emit_ms in an AtomicU64; skip emits arriving sooner than 50 ms after the last one. Caps the per-event WebKit work irrespective of audio callback rate.

Each component is gated by an env var (HANDY_LEAK_FIX_DISABLE_SKIP_WHEN_NONE, _DISABLE_EMIT_TO, _DISABLE_THROTTLE). All env vars read once at first call via OnceLock, zero per-call overhead.

The bisect harness runs five variants:

  • bisect-control: all three disabled — should reproduce the baseline
  • bisect-fix1, bisect-fix2, bisect-fix3: each component in isolation
  • bisect-all: no env vars — the PR behavior

Each variant runs the same 50-cycle workload as the bug- characterization measurement (Run 5). Measured: sustained-phase overlay WebKitWebProcess RSS growth rate (cycle 10 → cycle 50 window).

Predictions:

  • Control: 3,017 KB/min (matches Run 5's measured baseline ±10%)
  • Fix 1 only: ~0 KB/min (eliminates the emit path under user's config)
  • Fix 2 only: ~1,500 KB/min (50% of control — halves the per-emit WebKit work)
  • Fix 3 only: ~2,500 KB/min (20% reduction — caps 24 Hz to 20 Hz)
  • All three: ~0 KB/min for overlay_position: none (Fix 1 dominates)

[The bisect run is in progress at time of writing this document; the five-variant comparison will populate the methodology's worked example after completion.]


7. Phase 7 — Adversarial review before publishing

Why this matters

The fix you're about to publish will become a load-bearing claim in a public artifact: a PR description, an issue comment, an internal write-up. Once published, the claim outlives its reasoning — readers will treat it as fact without the caveats that produced it.

This is the moment to test the claim adversarially before it leaves your context. Mistakes caught now are cheap. Mistakes caught after publication propagate.

Mechanics

For each load-bearing claim in your draft PR description, ask:

  • Is this claim a measurement or an inference? If a measurement, cite the data. If an inference, name the underlying measurements and the reasoning step from data to claim.
  • What would invalidate this claim? If you can't articulate a failure mode, the claim is too vague or too broad.
  • Have I measured every quantity I'm asserting? Or am I rounding, approximating, or substituting a related-but-different measurement?
  • Does the data I'm citing actually distinguish my claim from alternative hypotheses? Or is it consistent with multiple hypotheses (in which case the citation is weaker than implied)?
  • What's the smallest, most testable, most-likely-to-be-wrong claim I'm making? Stress-test that one first.

Reverse engineer the claim

A useful exercise: read your draft as a hostile reviewer would. Look for:

  • Strongest claim (the one a skeptical reader will challenge first)
  • Weakest evidence (the one the skeptic will find first)
  • Mismatch (a claim whose strength exceeds its evidence)

Fix the mismatches before publishing. Either soften the claim, or strengthen the evidence.

Output

A pr-description-draft.md with claims labeled by evidence type:

  • [MEASURED] — direct measurement
  • [INFERRED] — reasoning step from measurements
  • [ESTIMATED] — order-of-magnitude calculation
  • [ASSERTED] — claim without explicit support (rare; if you have any of these, ask why)

The reviewer can use these labels to evaluate the claim chain.

Worked example: Handy (anticipated)

The PR description will claim, roughly: "The recording overlay's mic-level event traffic causes a sustained ~3 MB/min memory leak in the WebKitWebProcess on Linux. Reducing the event volume mitigates the leak proportionally; fully gating emission on overlay_position != None eliminates it for users who have the overlay disabled. Measured directly via smaps_rollup over controlled-workload captures; bisect validation isolates each fix's contribution."

Stress-testing:

  • The "3 MB/min" number is from a single 17-minute capture against an instrumented binary. Is one capture enough? Counter: we have consistent rates across multiple captures with similar workloads.
  • "Reducing the event volume mitigates proportionally" — the bisect results will either support or refute this.
  • "Eliminates" is a strong word — would 99% reduction satisfy the claim, or must it literally go to zero? The bisect variant with overlay_position: none + Fix 1 measures the answer.
  • "WebKit C++ allocations" is unsourced beyond inferring from "JS heap accounts for 1.2%". Could mention this is supported by Heap. snapshot diff data showing the JS-side bound.

Each claim gets a label and a citation. The hostile reviewer sees the evidence chain and can challenge specific links.


8. Cross-cutting principles

These appear repeatedly across the phases. Calling them out:

Reversibility is a property of the tooling, not a hope

If your tooling requires "I'll remember to clean it up later," it will fail. Build the cleanup automatically. Verify it ran.

Symptoms vs. hypotheses

Symptoms are measurements or direct observations. Hypotheses are inferences. Keep them in separate columns.

Inherited frames are claims to test

Every framing that arrived from outside your context — a bug report, a paper, a prior session's notes, your own previous inference — is a claim about the system, not a fact. Test it. The same applies when the source is authoritative (a senior engineer, a vendor, a famous paper).

Follow the data when it contradicts your framing

When measurement says X and your model says Y, the model is wrong. Resist the urge to add epicycles to save the model. Update it.

Match the data type to the claim type

"The JS heap is leaking" requires JS-heap measurement. "Memory grows" requires RSS measurement. Don't substitute.

Distinguish warmup from sustained

One-time initialization growth is not a leak. Compute rates from a sustained-phase window, not a full-run average.

Distinguish process of residence

A leak in process A is not a leak in process B even if the symptom looks similar. Identify which process is growing before identifying which subsystem.

A "PASS" without measured properties is not a PASS

Validation requires evidence. If your phase report says PASS because "no errors were observed," count that as a FAIL until a quantitative property is verified.

Cheap measurements first

Always pick the cheapest distinguishing test next. Sunk capture data is free to re-mine.

Bisect to validate multi-component fixes

A fix that works overall but whose components you don't understand is fragile. Isolate each.

Document mistakes when they happen

When you make a mistake — terminate a healthy run, misread a status notification, anchor on a wrong hypothesis — name it explicitly, document the failure mode, save a feedback memory or note. The cost is small; the next investigation benefits.

Be honest about what's not yet known

The end of an investigation has known root cause AND known unanswered questions. Document both. Future you (or future others) will need to know what was left open.


9. Adapting to other stacks

The methodology is invariant; the tools change. Substitutions:

Function-level tracing

Stack Tool
Rust on Linux uftrace + nightly + -Z instrument-mcount; or perf with framepointers
C/C++ on Linux uftrace with -pg or -finstrument-functions; perf
C/C++ on macOS dtrace; Instruments
JVM async-profiler; JFR (Java Flight Recorder)
Go runtime/pprof; gops; trace
Python cProfile; py-spy; viztracer
Node.js clinic.js; --inspect protocols; V8 profiler
.NET dotnet-trace; PerfView

Syscall and IPC observation

OS Tool
Linux strace -f -yy -s 16384
macOS dtruss; opensnoop
Windows procmon; WPR/WPA
Containers Same as host; mount /proc as needed

UI / web-runtime observation

Runtime Tool
WebKitGTK WEBKIT_INSPECTOR_HTTP_SERVER + WS protocol
Chromium (Electron, WebView2, CEF) --remote-debugging-port=NNNN
WKWebView (Safari, macOS apps) Safari Develop menu
Firefox-based about:debugging
Native (no webview) This category doesn't apply

Memory profiling beyond what the runtime exposes

Process kind Tool
Native C/C++ allocations heaptrack; valgrind massif; jemalloc profiler
JS heap Inspector Heap domain (per runtime)
JVM heap JProfiler; YourKit; jcmd GC.heap_dump
Go heap pprof
Python heap tracemalloc; memray
Memory at OS level smaps_rollup (Linux); vmmap (macOS); RAMMap (Windows)

IPC observation specifically

Mechanism Tool
Unix sockets strace -e trace=sendmsg,recvmsg -yy
Windows pipes / RPC procmon; ETW
HTTP/gRPC mitmproxy; tcpdump; Wireshark; or the framework's tracing
Shared memory strace -e trace=mmap; lsof
D-Bus dbus-monitor

Reversibility scripts

The pattern is the same on every OS: snapshot, install, diff, uninstall, verify-clean. The implementation language varies:

  • Linux/macOS: bash, with a snapshot dir of plain text files
  • Windows: PowerShell, with snapshot files in JSON
  • Cross-platform: Python with platform-specific shells

When you can't reproduce the exact procedure

Some projects can't run the exact procedure (e.g., production-only state, hardware-specific behavior). Adaptations:

  • Production-only state: spend more effort on the synchronized capture, less on the static enumeration. Production capture is expensive; make it count.
  • Hardware-specific: identify the exact hardware the bug requires, acquire access (cloud GPU instances; loan equipment), or scope the investigation to the part of the bug that's hardware-independent.
  • Closed-source dependencies: gap-audit them aggressively; identify what observation tool reaches inside (heaptrack often does for closed-source native dylibs).

10. When this methodology is the wrong tool

This is a heavy methodology. Don't apply it when a lighter one suffices.

Use a lighter approach when:

  • The bug reproduces in under a minute and has a clear repro recipe → just run the repro under a debugger
  • The bug is a single-process logic error → unit tests + targeted logging
  • The fix is obvious and trivial (a typo, a missing null check, an off-by-one) and the only question is verification → write the test, write the fix, verify
  • You don't yet know if the bug is real (single report, no reproduction by anyone else, no logs) → reproduce first, then decide

Use a HEAVIER approach when:

  • This methodology gets you to a candidate set but you can't distinguish among the survivors → escalate to specialized tools (heaptrack, eBPF, kernel tracing, custom dylib builds with debug symbols)
  • The bug requires reproducing user input you don't have access to (audio recordings, video, behavioral patterns) → capture from real users with their consent
  • The bug requires a specific combination of timing, concurrency, or hardware that you can't reliably reproduce → chaos engineering, fuzzing, or moving the investigation to production with appropriate safeguards

11. Closing

The methodology in this document is a procedure, not a recipe. Apply the principles; adapt the mechanics. The goal is not to follow these steps mechanically; it is to investigate hard bugs with the rigor that hard bugs require.

The investment in Phase 2 (reversible tooling) and Phase 3 (the execution-flow map) is substantial — typically a day or two of focused work. It pays back the first time you avoid a wasted day chasing the wrong hypothesis, and it amortizes across every subsequent hard bug in the same project.

The investment in Phase 4 (framing) and Phase 5 (narrowing) is the research core. It is where the bug is actually found. Treat it as research, not as engineering. The pace will feel slow; the pace is right.

The investment in Phase 6 (fix and validate) and Phase 7 (adversarial review) is what makes the fix defensible after it leaves your context. Skipping these is the most common way good investigations produce bad PRs.

Above all: the data is ground truth. Every step of this methodology exists to remove the layers of inference and framing that stand between you and the data. When you do that successfully, hard bugs become tractable.


Appendix: minimum-viable instrumentation directory structure

For a fresh project where you want to start applying this methodology today, here's a directory skeleton:

your-project/
├── env/                        # Reversible tooling (Phase 2)
│   ├── snapshot.sh
│   ├── install.sh
│   ├── uninstall.sh
│   ├── cleanup-run.sh          # Load-bearing safety script
│   ├── verify-clean.sh
│   ├── activate.sh
│   ├── README.md               # Names the load-bearing safety property
│   └── snapshots/
│       └── pre-install/        # Created by install.sh on first run
├── INSTRUMENTATION.md          # Phase 3 procedure (project-specific)
├── METHODOLOGY.md              # This document (or an adapted version)
└── instrumentation-output/     # Phase 3 capture artifacts
    └── <UTC-timestamp>/        # One dir per capture run
        ├── state/              # Per-run state files (cleanup inputs)
        ├── harness/            # Run script + post-processors
        ├── heap-snapshots/     # If applicable
        ├── coverage-map.csv    # Static enumeration
        ├── uftrace.data/       # Function trace raw data
        ├── strace.log          # Syscall trace
        ├── *.json              # Inspector records
        ├── handy.stdout.log    # Application output
        ├── phase-timeline.log  # Wall-clock anchors
        ├── execution-trace.csv # Synthesis
        ├── call-graph.csv      # Synthesis
        ├── coverage.csv        # Synthesis
        ├── execution-flow.mmd  # Mermaid diagram
        ├── gap-audit.md        # What we can't see
        └── status-summary.md   # Per-tool validation contracts

Adapt the file extensions and naming to your stack. The structure is the methodology made concrete.


End of document.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment