Good question — I've been thinking about this across the frozendict / couchbase / VTK reviews. Honest breakdown of where the value actually came from:
What mechanizes cleanly (60-70% of findings across reviews so far):
- Private API /
_Py*grep, deprecated API grep, non-limited macro grep, staticPyTypeObjectdetection, direct struct access (->ob_refcnt,->tp_*), CPython-internal header includes — the entire stable-abi-checker and version-compat-scanner output is just table lookups. - Refcount: new/borrowed/stolen lifecycle for known APIs, missing NULL checks, unchecked allocations, return-without-exception.
- Borrowed-ref-across-call, PyErr state at return, lock/unlock CFG matching — these are proper dataflow problems our Tree-sitter scripts approximate; they'd be better in a real DFA framework, not worse.
- Missing
Py_MOD_GIL_NOT_USED, missing traverse/clear for types withPyObject*fields, dead pre-3.9 version guards. - Cluster aggregation — "2,257 static types → 3 emission sites" is mechanical once you have the AST.
What genuinely needed the agents (the remaining ~30%):
- Novel reproducer construction (T16, T26-T29, F1 on 3.14t, F16
EvilSeq, F25EvilBool) — crafting these requires "what if…" reasoning and Python-runtime intuition. A compiler can't do this. - Cross-finding synthesis — recognizing F8+F9+F10 as the PyVTKObject heap-type trio, or diagnosing PEP 683 immortal objects as what's masking an F23/F24 refcount signal. This is reading-with-imagination, not pattern matching.
- Triage on the scanners' 20-40% FP rate — the "is this deref actually protected by a check 40 lines up" question. A real DFA framework shrinks this substantially but doesn't close it.
- Report prose, classification rationale, stakeholder-facing synthesis — turning findings into something Ben wants to read.
- Git-history correlation — "did we miss any similar bugs after fix commit X?" requires diff-diffing, not AST.
High feasibility for most of the mechanical 60-70%, because David Malcolm has been building the foundation. -fanalyzer already has:
- Symbolic execution + region-based memory model (the hard part)
- Python C API awareness at a basic level — refcount and state-machine tracking exists, it just needs API-table expansion
- CFG/error-path infrastructure
What we could contribute directly (highest ROI first):
- Our API tables (
api_tables.json,stable_abi.json,limited_api_headers.json,deprecated_apis.json) as analyzer rule inputs. Ours has ~100 refcount entries; -fanalyzer needs ~2000 for full coverage. Our tables are a starting corpus. - Our verified-bug test corpus — every FIX from the frozendict / couchbase / zstandard / VTK reviews becomes a regression test. We have real-world positives, not synthetic.
- Specific checks missing from -fanalyzer — static-type-under-
Py_LIMITED_API,PyErr_Clearin exception-carrying context,Py_MOD_GIL_NOT_USEDcorrectness under free-threading, dead pre-3.N version guards.
What doesn't port to GCC (keep in the LLM tier):
- Reproducer generation
- Synthesis / reports / classification rationale
- Novel findings from "what if the user passes an iterable that raises"
- Git-history fix-propagation detection
- Cross-TU generator-vs-consumer reasoning (e.g., "this fprintf in
vtkWrapPython*.cemits the bug into 1,830 bindings") — possibly doable with whole-program -fanalyzer but hard - Runtime/dynamic confirmation (libfiu OOM, 3.14t race hammering)
Three-tier, not two-tier:
- GCC -fanalyzer plugin — the cheap mechanical checks. Runs at build time, zero per-check cost, surfaces in IDEs. This is where most of the "every PR" defensive work should live long-term.
- Our Tree-sitter toolkit — stays relevant for: codebase-wide surveys, cluster/hotspot analysis, git-history correlation, cross-TU reasoning the compiler doesn't do, and anything where we want to ship a plugin that works without a GCC build.
- LLM agents — promoted to the genuinely-scarce top of the stack: reproducer construction, synthesis, report writing, maintainer-facing communication.
Practical next step for Ben's question: rather than try to port our entire stack, contribute targeted checks + API tables + the verified-bug regression corpus upstream to -fanalyzer. That's a realistic 3-6 month effort (one check category at a time), keeps us relevant as the orchestration layer, and makes detection actually cheap for every Python C extension on the planet — not just the ones we review. The toolkit shifts from "find bugs" toward "drive reviews using cheap GCC output + add the novel/synthesis work on top."
Rough ROI: if -fanalyzer catches 60% of our findings at build time, our per-audit LLM cost drops ~60% and maintainers get immediate feedback instead of waiting for review. That's a better multiplier than trying to make our scripts more precise.