Skip to content

Instantly share code, notes, and snippets.

@YogirajA
Created March 31, 2026 18:19
Show Gist options
  • Select an option

  • Save YogirajA/1570cb87b32f98b96b396a0a7dacafe4 to your computer and use it in GitHub Desktop.

Select an option

Save YogirajA/1570cb87b32f98b96b396a0a7dacafe4 to your computer and use it in GitHub Desktop.
Lates glitched API report
_ _ _ _ _ ___
/_\ | || || | /_\ | __|
/ _ \ | \/ \/ | / _ \ | _|
/_/ \_\ \_/\_/ /_/ \_\ |_ Agent Well-Architected Framework
AWAF Assessment: glitched-api
AWAF v1.0 | 2026-03-29 | anthropic / claude-sonnet-4-6
========================================
Overall Score: 89/100 -- Production Ready
Fully ready. Variance within this band is noise.
Scale: Production Ready 85-100 | Near Ready 70-84 | Needs Work 50-69
High Risk 25-49 | Not Ready 0-24
Foundation <40 = automatic FAIL. Tier 2 pillars carry 1.5x weight.
----------------------------------------
+======================+==========+==============+==============+=========+
| Pillar | Score | Progress | Confidence | Status |
+======================+==========+==============+==============+=========+
| TIER 0 -- FOUNDATION |
+----------------------+----------+--------------+--------------+---------+
| Foundation | 90/100 | [######### ] | verified | PASS |
+======================+==========+==============+==============+=========+
| TIER 1 -- CLOUD WAF ADAPTED |
+----------------------+----------+--------------+--------------+---------+
| Op. Excellence | 100/100 | [##########] | partial | |
| Security | 86/100 | [######### ] | verified | |
| Reliability | 96/100 | [##########] | partial | |
| Performance | 83/100 | [######## ] | partial | |
| Cost Optim. | 100/100 | [##########] | verified | |
| Sustainability | 100/100 | [##########] | partial | |
+======================+==========+==============+==============+=========+
| TIER 2 -- AGENT-NATIVE (1.5x weight) |
+----------------------+----------+--------------+--------------+---------+
| Reasoning Integ. | 91/100 | [######### ] | verified | 1.5x |
| Controllability | 83/100 | [######## ] | verified | 1.5x |
| Context Integrity | 74/100 | [####### ] | verified | 1.5x |
+======================+==========+==============+==============+=========+
----------------------------------------
FILES ANALYZED: 24 files
----------------------------------------
FINDINGS (ordered by severity)
[High ] Op. Excellence Eval scheduling not evidenced:
tests/evals/layer2.py header says 'Run
nightly or before deploys' but no CI/CD
config (GitHub Actions workflow, Railway
cron, or equivalent) is provided showing
layer2.py is actually scheduled. The intent
is documented but execution is unverified.
[High ] Op. Excellence docs/slo.md and docs/postmortem_template.md
are referenced in RUNBOOK.md but not
provided as artifacts. SLO targets (p50 ≤
25s, p95 ≤ 60s) appear only inline in
RUNBOOK.md section 5, not in a canonical SLO
document. Actual SLO thresholds, error
budget policy, and burn rate alerting rules
cannot be verified.
[High ] Reliability Global daily cost counter
(_global_daily_cost in agents/_base.py) is
an in-process dict that resets on process
restart and is not shared across workers.
RUNBOOK.md explicitly documents this: 'a
restart mid-day can allow spend beyond the
daily cap for the remainder of that day.'
This means the GlobalBudgetExceededError
guard can be bypassed by a process
crash/redeploy, creating a reliability gap
in the cost-as-circuit-breaker pattern.
[High ] Reliability Checkpoint store (_checkpoint_store in
pipeline.py) is also in-process and lost on
restart. A pipeline that checkpointed
'profile' and 'paths' but not
'plan'/'happiness' will restart from scratch
after a process restart, re-spending tokens
for already-completed slices and potentially
re-triggering budget guards.
[High ] Cost Optim. Global daily budget counter
(_global_daily_cost in agents/_base.py) is
in-process only and resets on process
restart. RUNBOOK.md explicitly documents
this: a mid-day restart allows spend up to
GLOBAL_DAILY_BUDGET_USD again for the
remainder of the day. With Railway single-
worker constraint, a crash+restart can
silently double the effective daily cap.
[High ] Cost Optim. Global daily budget counter is not shared
across workers. RUNBOOK.md documents that
running N workers allows N ×
GLOBAL_DAILY_BUDGET_USD spend before any
single worker fires the alert. Railway must
be manually constrained to 1 replica --
there is no infrastructure-level enforcement
of this constraint.
[High ] Context Integrity Checkpoint replay does not validate
staleness of resumed slice data beyond a TTL
wall-clock check. In pipeline.py
_load_checkpoint(), a profile or paths
result stored up to _CHECKPOINT_TTL_S
(3600s) ago is replayed verbatim into
downstream agents. If the economic index
briefing or market conditions changed
between the original run and the resume, the
MarketAgent output embedded in the
checkpointed profile is stale but treated as
current. No re-fetch or re-validation occurs
on resume.
[High ] Context Integrity No context pruning, summarization, or
offloading before window saturation.
_base.py aborts at 90% utilisation
(ContextWindowExceededError) but takes no
graceful bounding action before that
threshold. Between the 75% warning and 90%
abort, the pipeline continues accumulating
tokens with no mitigation. For long retry
chains, this means the abort is the only
defense, resulting in a hard 503 rather than
a graceful degradation.
[Medium ] Foundation Slice boundary enforcement is informal:
slice_num is passed as a plain integer to
run_agent() and logged, but there is no
typed SliceBoundary contract, no enforcement
that slice numbers are unique or sequential,
and no documentation of what each slice
number means. If a new agent is added with a
duplicate or out-of-order slice_num, the
system will not detect it. Evidence:
agents/_base.py run_agent() signature; all
five agent files pass slice_num=1 through
slice_num=4 without a registry.
[Medium ] Foundation MarketAgent and IntakeAgent share
slice_num=1 (both pass slice_num=1 to
run_agent()), which means their per-slice
log entries and slice_scores list entries
are indistinguishable by slice number alone.
This makes post-incident forensics harder
and could cause confusion if slice_scores
are used for SLO attribution. Evidence:
agents/intake_agent.py line 'slice_num=1';
agents/market_agent.py line 'slice_num=1'.
[Medium ] Foundation The _checkpoint_store in pipeline.py uses
plain string keys ('profile', 'paths') with
no versioning or schema validation on
retrieval. A resumed pipeline could load a
stale checkpoint whose Pydantic model shape
has changed after a deploy, silently passing
an incompatible object to downstream agents.
Evidence: pipeline.py _save_checkpoint /
_load_checkpoint; no model version tag on
stored objects.
[Medium ] Op. Excellence Alert coverage is incomplete: fire_alert()
fires only on budget events (global daily
exhaustion and 80% warning). No alerts are
configured for agent retry exhaustion rate,
hallucination rate spikes
(get_hallucination_stats() is exposed on
/metrics but no threshold-based alerting),
circuit breaker state changes (logged but
not alerted), or p95 latency breaches.
RUNBOOK.md section 5 describes latency SLO
but no corresponding alert fires.
[Medium ] Op. Excellence Observability dashboard not evidenced:
/metrics endpoint exposes
hallucination_rates and
circuit_breaker_state, and /cost exposes
daily spend, but no dashboard (Datadog,
Grafana, Railway metrics, CloudWatch) is
shown aggregating these signals. Operators
must poll endpoints manually rather than
receiving proactive visibility.
[Medium ] Op. Excellence Single-worker constraint is a documented
operational risk (RUNBOOK.md section 6) with
no automated enforcement: Railway replica
count is not enforced via IaC or deployment
config. A misconfigured scale-out would
silently multiply the effective daily budget
cap without any alert firing.
[Medium ] Security HuggingFace API URL host is asserted once at
module import time (utils/economic_index.py,
_hf_parsed_host check) but the
httpx.AsyncClient in
fetch_economic_index_briefing() constructs
the request from the module-level
_HF_API_URL constant without runtime
validation. If _HF_API_URL were mutated at
runtime (e.g., via a monkey-patch in tests
or a future dynamic config path), the host
assertion would not re-fire. Additionally,
no TLS certificate pinning or response size
cap is applied to the HF fetch, meaning a
compromised or spoofed HF response could
inject arbitrarily large content into the
market agent's system prompt.
[Medium ] Security The in-process _response_cache and
_idempotency_cache in pipeline.py store full
AnalyzeResponse objects keyed by
sha256(role+want+fear) and
(api_key_prefix+idempotency_key)
respectively. The api_key_prefix used as the
idempotency cache key is only the first 8
characters of the raw API key
(x_api_key[:8]). With a 64-character hex key
space, 8 chars = 32 bits of prefix --
sufficient for collision resistance in low-
volume use, but if the key space is shorter
(e.g., a 16-char secret), prefix collisions
between different legitimate keys could
cause one user's cached response to be
served to another.
[Medium ] Security The _INJECTION_RE pattern in
utils/sanitize.py replaces matched injection
phrases with '[removed]' but does not re-
validate the sanitized string for minimum
length after replacement. A carefully
crafted input that is exactly at
_MIN_FIELD_LEN (10 chars) and consists
entirely of an injection pattern would pass
the pre-injection length check, be reduced
to '[removed]' (9 chars), and then fail the
post-sanitization length check -- raising
ValueError and returning a 422. This is the
correct outcome, but the error path leaks
that the input was detected as an injection
attempt (via the 422 detail message 'Input
could not be processed'), which could aid an
attacker in calibrating bypass attempts.
[Medium ] Reliability The circuit breaker state (_circuit_breaker
in agents/_base.py) is in-process and not
persisted. After a process restart following
a Claude API outage, the breaker resets to
CLOSED with failure_count=0, potentially
sending a burst of requests to a still-
degraded API before the breaker re-opens.
[Medium ] Reliability _parse_hf_rows() in utils/economic_index.py
raises NotImplementedError unconditionally,
meaning the live Anthropic Economic Index
fetch always fails and falls back to
STATIC_BRIEFING (Jan 2026 data). The
fallback works correctly, but the live data
path is permanently broken -- any market
intelligence dependent on current data is
silently stale.
[Medium ] Performance No context pruning between pipeline slices:
session total_tokens accumulates across all
4 agents (agents/_base.py
_session_state['total_tokens']), but there
is no mechanism to prune or summarize
earlier agent outputs before passing them to
later agents. The profile summary passed to
PathAgent, PlanAgent, and HappinessAgent
includes the full UserProfile and CareerPath
objects verbatim. At 200k context, this is
unlikely to cause truncation under normal
inputs, but the context.window_warning
threshold (75%) could be reached on verbose
inputs with retries, and there is no
proactive pruning to prevent it.
[Medium ] Performance Intake and Market agents run sequentially
despite partial independence:
run_intake_agent and run_market_agent are
called sequentially in pipeline.py (lines
~90-95). MarketAgent only needs
profile.sector, profile.seniority,
profile.transferable_skills, and
profile.primary_goal -- all of which are
available immediately after IntakeAgent
completes. However, the current design
awaits the full IntakeAgent result before
starting MarketAgent, adding one full LLM
round-trip to the critical path
unnecessarily.
[Medium ] Performance In-process response cache is exact-match
only with no semantic deduplication:
pipeline.py cache key is
sha256(role+want+fear), meaning any
character difference (capitalization,
punctuation, extra space) produces a cache
miss and a full pipeline re-run. Users who
resubmit with minor edits (e.g., fixing a
typo) will incur full cost. No semantic
similarity layer exists. This is noted as a
known limitation but has measurable cost
efficiency impact.
[Medium ] Cost Optim. Idempotency cache (_idempotency_cache in
main.py) and response cache (_response_cache
in pipeline.py) are both in-process with no
eviction beyond TTL expiry. Under sustained
load, these dicts grow unboundedly in
memory, which could cause OOM restarts --
which in turn reset the daily cost counter
(see finding 1), creating a cost-control
gap.
[Medium ] Sustainability In-process caches (_response_cache,
_idempotency_cache, _checkpoint_store,
_global_daily_cost) are not persisted across
process restarts. RUNBOOK.md explicitly
acknowledges this for the daily cost counter
(section 6), but the same constraint applies
to the response cache and checkpoints. A
Railway redeploy mid-day silently resets all
caches, eliminating deduplication benefits
and potentially allowing re-spend on
identical inputs.
[Medium ] Sustainability The economic index briefing (_cache in
utils/economic_index.py) uses a 24-hour in-
memory TTL but falls back to a static Jan
2026 snapshot when the HF fetch fails or
_parse_hf_rows raises NotImplementedError
(which it always does -- the function
unconditionally raises). This means every
process start fetches the static fallback,
and the 24h cache only persists within a
single process lifetime. The HF fetch is
effectively dead code, wasting ~5s of
network latency per cold start.
[Medium ] Sustainability No cache eviction policy exists for
_response_cache or _idempotency_cache beyond
TTL expiry on read. Under sustained load,
these dicts grow unboundedly in memory until
process restart. A long-running single-
worker process (as mandated by RUNBOOK.md
section 6) could accumulate significant
memory pressure over days.
[Medium ] Reasoning Integ. Uncertainty is captured internally
(inference_basis logged in intake_agent.py,
contradiction detection in market_agent.py)
but never surfaced to API consumers.
AnalyzeResponse has no field indicating low-
confidence inferences (e.g. seniority
inferred from thin evidence). A client
receiving a 'Junior' seniority for EC-001's
vague input has no signal that this was
inferred, not stated.
[Medium ] Reasoning Integ. Reasoning traces (thinking_blocks) are
logged at DEBUG level only and are in-
process -- they are not persisted to a
durable store or tracing backend (no
Langfuse, Arize, or equivalent). Post-
incident forensics require the Railway log
buffer to still contain the relevant trace,
which is not guaranteed for incidents
discovered hours later.
[Medium ] Reasoning Integ. Layer 2 evals (tests/evals/layer2.py)
contain unfinished stubs -- the
resolve_field_path and
evaluate_assert_fields functions are
described in docstrings but the
implementation bodies are absent (file shows
docstring-only stubs).
ADV-001/ADV-002/ADV-003 assert_fields will
not actually execute in CI, leaving
argument-accuracy regressions undetected.
[Medium ] Controllability No human-in-the-loop checkpoint before the
pipeline auto-selects path[0] and proceeds
to generate a 90-day action plan. The
path_confidence annotation is computed and
logged but never surfaced to a human
approver -- the pipeline proceeds
unconditionally even when confidence is low
(e.g. 0.8 when fit gap < 10 points).
pipeline.py lines 97-110.
[Medium ] Controllability The /cancel endpoint is unauthenticated
beyond the shared API_SECRET_KEY -- any
holder of the key can cancel any trace_id,
including traces belonging to other users.
There is no per-user or per-session
ownership check on cancellation. main.py
lines 148-155.
[Medium ] Context Integrity In-process checkpoint store
(_checkpoint_store in pipeline.py) is not
durably persisted. A process restart (crash,
Railway redeploy) silently discards all in-
flight checkpoints. A user whose pipeline
was mid-execution during a restart receives
no resume benefit and must restart from
slice 1, re-spending tokens. The RUNBOOK
documents this gap but no mitigation is
implemented.
----------------------------------------
RECOMMENDATIONS
Foundation Define a SliceRegistry (e.g. an Enum or dict in
agents/_base.py) mapping agent names to unique slice
numbers, and validate at import time that no two
agents share a slice_num. Assign intake=1, market=2,
path=3, plan=4, happiness=5 (or similar) and update
all agent files accordingly. This makes slice
boundaries explicit and detectable.
Foundation In pipeline.py _save_checkpoint(), store a
model_version tag alongside the serialized object
(e.g. the Pydantic model's __name__ and a schema
hash). In _load_checkpoint(), validate the tag matches
the current model before returning the object; if
mismatched, discard the checkpoint and re-run the
slice. This prevents silent schema-mismatch bugs after
deploys.
Foundation In pipeline.py _save_checkpoint(), serialize Pydantic
models to dict (model.model_dump()) rather than
storing the live object, and deserialize with
model_validate() on load. This makes the checkpoint
boundary explicit and ensures the stored
representation is stable across process restarts and
model changes.
Op. Excellence Add a GitHub Actions workflow (e.g.,
.github/workflows/nightly-evals.yml) that runs `python
-m tests.evals.layer2` on a cron schedule and on every
PR targeting main. Fail the workflow if any eval case
scores below its minimum_top_fit_score threshold.
Op. Excellence Provide docs/slo.md as a committed artifact defining:
p50/p95 latency targets, success rate SLO, cost-per-
run budget, AWAF first-pass rate target, error budget
policy, and burn rate alerting thresholds. Reference
it from RUNBOOK.md with a relative link.
Op. Excellence Extend fire_alert() call sites to cover: (a)
agent.all_retries_exhausted events in agents/_base.py,
(b) hallucination rate exceeding a configurable
threshold (e.g., HALLUCINATION_ALERT_RATE_PCT env var)
checked in get_hallucination_stats(), and (c) circuit
breaker state transitions in
CircuitBreaker.record_failure() when state changes to
OPEN.
Op. Excellence Add a Railway deployment config or railway.toml that
pins replicas=1 and documents this constraint inline.
Consider adding a startup assertion in main.py that
logs a Critical warning if it detects it may be
running as a non-primary instance (e.g., via a Redis
lock or environment variable set by Railway).
Op. Excellence Provide docs/postmortem_template.md as a committed
artifact so the postmortem process is self-contained
in the repository and does not depend on an external
document that may not exist.
Security In utils/economic_index.py, add a runtime host
validation inside fetch_economic_index_briefing()
before the httpx.get() call: assert
urllib.parse.urlparse(_HF_API_URL).hostname ==
_HF_ALLOWED_HOST. Also add a response size cap (e.g.,
if len(resp.content) > 500_000: raise ValueError)
before passing the parsed content to _parse_hf_rows(),
preventing oversized injected payloads from reaching
the market agent system prompt.
Security In main.py, replace the 8-character api_key prefix
used in idem_cache_key with a full HMAC-SHA256 of the
key (e.g.,
hashlib.sha256(x_api_key.encode()).hexdigest()) so the
idempotency cache namespace is cryptographically
isolated per key, eliminating any prefix collision
risk regardless of key length.
Security In utils/sanitize.py, change the 422 detail message
returned when sanitization raises ValueError from the
current generic string to a fixed, non-informative
message (e.g., 'Request could not be processed') that
does not vary based on whether the rejection was due
to injection detection, length, or repetition. This
prevents timing or message-based oracle attacks that
could help an attacker distinguish injection detection
from other validation failures.
Reliability Replace _global_daily_cost dict in agents/_base.py
with a Redis INCRBYFLOAT counter keyed by date (e.g.,
'glitched:daily_cost:2026-01-15'). This survives
restarts and is safe across multiple workers.
RUNBOOK.md already identifies this as the intended
fix.
Reliability Persist _checkpoint_store to Redis (or at minimum a
local SQLite file) so that in-flight pipeline state
survives process restarts. Key by trace_id with TTL
matching CHECKPOINT_TTL_S. This prevents double-spend
on retry after crash.
Reliability Persist circuit breaker state (failure_count, state,
last_opened_at) to Redis or a shared store so that a
restarted process inherits the degraded-API signal
rather than starting fresh. Alternatively, implement a
startup health check that probes Claude API before
accepting traffic.
Reliability Implement _parse_hf_rows() in utils/economic_index.py
or remove the live fetch path entirely and document
STATIC_BRIEFING as the intentional data source. The
current state (fetch attempted, always fails, silently
falls back) masks a broken integration and wastes an
HTTP call on every cache miss.
Performance Implement context summarization before passing profile
to downstream agents: in pipeline.py, after
run_market_agent completes, create a compact
ProfileSummary object containing only the fields each
downstream agent actually needs (PathAgent needs ~10
fields, PlanAgent needs ~8). Pass this summary instead
of the full UserProfile to reduce per-call input
tokens by an estimated 30-40% on verbose inputs. Add a
token budget check before each agent call.
Performance Parallelize IntakeAgent and MarketAgent where
possible: refactor pipeline.py to start MarketAgent as
soon as IntakeAgent returns the base profile fields
(sector, seniority, skills, primary_goal). Since
MarketAgent does not depend on core_anxiety,
profile_summary, or stated_direction, it can begin
immediately. This would reduce the sequential critical
path by approximately one LLM round-trip (~5-8s at
p50).
Performance Add a fuzzy/semantic cache layer above the exact-match
cache: implement a lightweight embedding-based
similarity check (e.g., using a local sentence-
transformers model or a cheap embedding API call)
before the sha256 exact-match lookup in pipeline.py.
Cache hits on semantically equivalent inputs (>0.95
cosine similarity) would avoid full pipeline re-runs
for near-duplicate submissions, reducing both cost and
latency for common user patterns.
Cost Optim. Replace _global_daily_cost dict in agents/_base.py
with a Redis INCRBYFLOAT counter keyed by date (e.g.
'daily_cost:2026-01-15'). This survives process
restarts and is shared across all workers, making the
daily cap reliable. RUNBOOK.md already identifies this
as the intended fix.
Cost Optim. Add a Railway service-level replica cap via
railway.toml or the Railway dashboard API to enforce
single-worker constraint programmatically, rather than
relying on operator discipline documented in
RUNBOOK.md section 6.
Cost Optim. Add a max-size cap to _response_cache and
_idempotency_cache in pipeline.py and main.py (e.g.
evict oldest entry when len > 1000) to prevent
unbounded memory growth that could trigger OOM
restarts and reset the cost counter.
Sustainability Replace _parse_hf_rows in utils/economic_index.py with
a working implementation or remove the HF fetch
entirely and document the static briefing as
intentional. The current state wastes a network round-
trip on every cold start and logs a misleading
'fetch_failed' warning. If live data is not yet
needed, set _HF_API_URL = None and skip the fetch
unconditionally.
Sustainability Add a max-size eviction policy to _response_cache and
_idempotency_cache in pipeline.py and main.py
respectively. A simple LRU with a cap of ~500 entries
(e.g. using collections.OrderedDict or
cachetools.LRUCache) prevents unbounded memory growth
on long-running single-worker deployments.
Sustainability Document the cache reset behaviour on redeploy in
RUNBOOK.md section 6 (already covers the cost counter
gap). Add a note that _response_cache and
_checkpoint_store also reset, so a redeploy during a
high-traffic period will cause a temporary spike in
Claude API calls until the cache warms up. Consider a
brief post-deploy monitoring window for cost
anomalies.
Reasoning Integ. Add a low_confidence_fields: list[str] field to
AWAFMeta or UserProfile in models.py, populated by
intake_agent.py when inference_basis.sector or
inference_basis.seniority is 'inferred'. This surfaces
uncertainty to API consumers without changing the
contract for high-confidence responses.
Reasoning Integ. Implement the resolve_field_path and
evaluate_assert_fields stubs in tests/evals/layer2.py
so that assert_fields on EvalCase actually executes.
Without this, ADV-001 through ADV-003 are
documentation, not tests. Add a CI step that runs
layer2.py against at least the adversarial cases on
every PR.
Reasoning Integ. Persist reasoning traces to a durable backend (e.g.
write to a Railway-mounted volume or POST to
Langfuse/Braintrust) rather than relying solely on in-
process DEBUG logs. The prompt_hash already provides
the correlation key -- add a
trace_store.write(prompt_hash, reasoning_trace) call
in agents/_base.py after the thinking_blocks
extraction.
Controllability Add a confidence threshold gate in pipeline.py: if
path_confidence < 0.7, emit a 'paths.low_confidence'
event and either return the paths to the caller for
explicit selection (add a /select/{trace_id} endpoint)
or include a 'requires_confirmation' flag in
AnalyzeResponse so the frontend can prompt the user
before proceeding to plan generation.
Controllability Bind trace_id ownership to the API key that created
it. Store a mapping of trace_id -> key_prefix in
_checkpoint_store at pipeline start, and in the
/cancel handler verify that the requesting key_prefix
matches the owning key_prefix before calling
cancel_trace(). This prevents cross-user cancellation
with a shared key.
Context Integrity In pipeline.py _load_checkpoint(), add a content-hash
or version tag to each checkpoint entry. On resume,
compare the current economic_index briefing hash
against the one stored at checkpoint time; if they
differ, invalidate and re-run the MarketAgent slice
rather than replaying the stale profile. This prevents
stale market data from silently propagating through
resumed sessions.
Context Integrity In agents/_base.py, implement a context bounding
strategy between the 75% warning and 90% abort
thresholds. At 75% utilisation, truncate or summarize
the retry_hint to a fixed maximum (e.g. 200 chars) and
stop accumulating per-attempt logs in the session
state. This converts the hard abort into a graceful
degradation path and reduces the frequency of
ContextWindowExceededError 503s.
Context Integrity Replace the in-process _checkpoint_store dict in
pipeline.py with a Redis HSET keyed by trace_id with a
TTL matching _CHECKPOINT_TTL_S. This makes checkpoints
survive process restarts and Railway redeployments,
enabling true resume-from-last-slice semantics. The
RUNBOOK already identifies Redis as the intended
future fix for the analogous daily cost counter gap.
----------------------------------------
TO IMPROVE THIS ASSESSMENT
Add an integration test (e.g. tests/unit/test_agent_isolation.py) that calls
each agent function directly with a mock trace_id and asserts it returns the
correct Pydantic model without invoking any other agent. This would provide
verified evidence for the independence criterion.
Add a SliceRegistry with uniqueness validation at import time -- this would
convert the slice-boundary finding from Medium to resolved and upgrade the
tally to 15/16.
Provide .github/workflows/nightly-evals.yml or equivalent CI config showing
layer2.py is scheduled -- would upgrade eval scheduling from self_reported
to verified
----------------------------------------
EVIDENCE GAPS
No architecture diagram or ADR documenting the intended slice boundary
contract -- would upgrade confidence for the slice-boundary criterion.
No test asserting that each agent can be called in isolation without the
pipeline orchestrator -- would verify the 'no structural dependency' claim
under test.
docs/slo.md not provided -- SLO targets, error budget policy, and burn rate
alerting rules unverifiable (Op. Excellence)
docs/postmortem_template.md not provided -- postmortem process structure
unverifiable (Op. Excellence)
CI/CD configuration not provided -- eval scheduling (layer2.py nightly runs)
unverifiable (Op. Excellence)
Observability dashboard configuration not provided -- aggregated metrics
visibility unverifiable (Op. Excellence)
railway.toml or deployment config not provided -- single-worker constraint
enforcement unverifiable (Op. Excellence)
No evidence of dependency scanning (Snyk, pip-audit, Dependabot) -- supply
chain risk for httpx, anthropic-sdk, pydantic is unassessed (Security
pillar)
No evidence of secrets scanning in CI (e.g., truffleHog, git-secrets) --
hardcoded credential risk in future commits is unmitigated (Security pillar)
No network egress policy evidence (Railway service networking config,
VPC/firewall rules) -- outbound calls to Anthropic API and HuggingFace are
unrestricted at the infrastructure layer (Security pillar)
No chaos engineering results or fault injection test evidence provided --
cannot verify that circuit breaker, checkpoint resume, and fallback paths
behave correctly under actual failure conditions (Reliability pillar)
No SLO compliance reports or uptime dashboards provided -- cannot verify
that p95 latency and success rate targets in docs/slo.md are actually being
met in production (Reliability pillar)
No evidence of load testing -- cannot assess whether single-worker Railway
constraint holds under concurrent request bursts (Reliability pillar)
docs/slo.md not provided -- SLO-1/SLO-2 targets referenced in RUNBOOK.md but
actual SLO document not available for verification; confidence remains
partial on latency SLO criterion
No latency dashboard or p50/p95 measurement data provided -- duration_ms is
logged but no aggregated performance data shown; cannot verify whether
current implementation meets stated SLO targets in practice
No token usage trend data provided -- cannot assess whether context
utilisation is growing over time or whether the 75% warning threshold is
being triggered in production
No external billing dashboard or cost trend chart provided -- cannot verify
that tracked costs match actual Anthropic invoice amounts (Cost Optimization
pillar: cost accuracy verification)
No evidence of ALERT_WEBHOOK_URL being set in the Railway deployment --
fire_alert() is a no-op if unset, meaning the 80% budget warning may never
fire in production (Cost Optimization pillar: alert delivery verification)
No energy or carbon reporting data provided -- cannot assess environmental
sustainability metrics (Sustainability pillar, environmental sub-criterion).
No cost trend data over time -- cannot verify whether efficiency is
improving across deployments (Sustainability pillar, efficiency trajectory
criterion).
No evidence of token budget tuning per agent (max_tokens is hardcoded at
1500 for all agents in call_claude) -- cannot confirm right-sizing at the
output token level, only at the model selection level.
No evidence of red-team or adversarial prompt injection testing against the
AWAF gate itself (would strengthen Reasoning Integrity confidence to
'verified' with no gaps)
Layer 2 LLM-as-judge rubrics (brand_voice, ha_authenticity, etc.) exist in
layer2.py but no sample run output or pass/fail history is provided --
cannot verify they execute successfully
No evidence of a /select or /confirm endpoint for human path selection --
controllability pillar gap
No runbook section covering how to pause a specific agent mid-execution
(only full pipeline cancel is documented) -- controllability pillar gap
No LangSmith, Langfuse, or equivalent context trace exports provided --
cannot verify runtime context window utilisation patterns in production
No memory architecture doc or vector DB config provided -- confirmed absent
by code review (in-process only)
HF dataset _parse_hf_rows raises NotImplementedError -- cannot assess how
live tool response outputs would be filtered if the live fetch path were
activated
----------------------------------------
Tokens: 931,371 in / 44,387 out
Estimated cost: $1.1343 USD
Generated: 2026-03-29 00:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment