Skip to content

Instantly share code, notes, and snippets.

@devdanzin
Created April 23, 2026 08:48
Show Gist options
  • Select an option

  • Save devdanzin/06c2e34dac411d9a5f3cf4525d26a03d to your computer and use it in GitHub Desktop.

Select an option

Save devdanzin/06c2e34dac411d9a5f3cf4525d26a03d to your computer and use it in GitHub Desktop.
Feasibility of making the tools standalone

Good question — I've been thinking about this across the frozendict / couchbase / VTK reviews. Honest breakdown of where the value actually came from:

What mechanizes cleanly (60-70% of findings across reviews so far):

  • Private API / _Py* grep, deprecated API grep, non-limited macro grep, static PyTypeObject detection, direct struct access (->ob_refcnt, ->tp_*), CPython-internal header includes — the entire stable-abi-checker and version-compat-scanner output is just table lookups.
  • Refcount: new/borrowed/stolen lifecycle for known APIs, missing NULL checks, unchecked allocations, return-without-exception.
  • Borrowed-ref-across-call, PyErr state at return, lock/unlock CFG matching — these are proper dataflow problems our Tree-sitter scripts approximate; they'd be better in a real DFA framework, not worse.
  • Missing Py_MOD_GIL_NOT_USED, missing traverse/clear for types with PyObject* fields, dead pre-3.9 version guards.
  • Cluster aggregation — "2,257 static types → 3 emission sites" is mechanical once you have the AST.

What genuinely needed the agents (the remaining ~30%):

  • Novel reproducer construction (T16, T26-T29, F1 on 3.14t, F16 EvilSeq, F25 EvilBool) — crafting these requires "what if…" reasoning and Python-runtime intuition. A compiler can't do this.
  • Cross-finding synthesis — recognizing F8+F9+F10 as the PyVTKObject heap-type trio, or diagnosing PEP 683 immortal objects as what's masking an F23/F24 refcount signal. This is reading-with-imagination, not pattern matching.
  • Triage on the scanners' 20-40% FP rate — the "is this deref actually protected by a check 40 lines up" question. A real DFA framework shrinks this substantially but doesn't close it.
  • Report prose, classification rationale, stakeholder-facing synthesis — turning findings into something Ben wants to read.
  • Git-history correlation — "did we miss any similar bugs after fix commit X?" requires diff-diffing, not AST.

GCC -fanalyzer feasibility

High feasibility for most of the mechanical 60-70%, because David Malcolm has been building the foundation. -fanalyzer already has:

  • Symbolic execution + region-based memory model (the hard part)
  • Python C API awareness at a basic level — refcount and state-machine tracking exists, it just needs API-table expansion
  • CFG/error-path infrastructure

What we could contribute directly (highest ROI first):

  1. Our API tables (api_tables.json, stable_abi.json, limited_api_headers.json, deprecated_apis.json) as analyzer rule inputs. Ours has ~100 refcount entries; -fanalyzer needs ~2000 for full coverage. Our tables are a starting corpus.
  2. Our verified-bug test corpus — every FIX from the frozendict / couchbase / zstandard / VTK reviews becomes a regression test. We have real-world positives, not synthetic.
  3. Specific checks missing from -fanalyzer — static-type-under-Py_LIMITED_API, PyErr_Clear in exception-carrying context, Py_MOD_GIL_NOT_USED correctness under free-threading, dead pre-3.N version guards.

What doesn't port to GCC (keep in the LLM tier):

  • Reproducer generation
  • Synthesis / reports / classification rationale
  • Novel findings from "what if the user passes an iterable that raises"
  • Git-history fix-propagation detection
  • Cross-TU generator-vs-consumer reasoning (e.g., "this fprintf in vtkWrapPython*.c emits the bug into 1,830 bindings") — possibly doable with whole-program -fanalyzer but hard
  • Runtime/dynamic confirmation (libfiu OOM, 3.14t race hammering)

Recommendation

Three-tier, not two-tier:

  1. GCC -fanalyzer plugin — the cheap mechanical checks. Runs at build time, zero per-check cost, surfaces in IDEs. This is where most of the "every PR" defensive work should live long-term.
  2. Our Tree-sitter toolkit — stays relevant for: codebase-wide surveys, cluster/hotspot analysis, git-history correlation, cross-TU reasoning the compiler doesn't do, and anything where we want to ship a plugin that works without a GCC build.
  3. LLM agents — promoted to the genuinely-scarce top of the stack: reproducer construction, synthesis, report writing, maintainer-facing communication.

Practical next step for Ben's question: rather than try to port our entire stack, contribute targeted checks + API tables + the verified-bug regression corpus upstream to -fanalyzer. That's a realistic 3-6 month effort (one check category at a time), keeps us relevant as the orchestration layer, and makes detection actually cheap for every Python C extension on the planet — not just the ones we review. The toolkit shifts from "find bugs" toward "drive reviews using cheap GCC output + add the novel/synthesis work on top."

Rough ROI: if -fanalyzer catches 60% of our findings at build time, our per-audit LLM cost drops ~60% and maintainers get immediate feedback instead of waiting for review. That's a better multiplier than trying to make our scripts more precise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment