Created
March 31, 2026 18:19
-
-
Save YogirajA/1570cb87b32f98b96b396a0a7dacafe4 to your computer and use it in GitHub Desktop.
Lates glitched API report
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| _ _ _ _ _ ___ | |
| /_\ | || || | /_\ | __| | |
| / _ \ | \/ \/ | / _ \ | _| | |
| /_/ \_\ \_/\_/ /_/ \_\ |_ Agent Well-Architected Framework | |
| AWAF Assessment: glitched-api | |
| AWAF v1.0 | 2026-03-29 | anthropic / claude-sonnet-4-6 | |
| ======================================== | |
| Overall Score: 89/100 -- Production Ready | |
| Fully ready. Variance within this band is noise. | |
| Scale: Production Ready 85-100 | Near Ready 70-84 | Needs Work 50-69 | |
| High Risk 25-49 | Not Ready 0-24 | |
| Foundation <40 = automatic FAIL. Tier 2 pillars carry 1.5x weight. | |
| ---------------------------------------- | |
| +======================+==========+==============+==============+=========+ | |
| | Pillar | Score | Progress | Confidence | Status | | |
| +======================+==========+==============+==============+=========+ | |
| | TIER 0 -- FOUNDATION | | |
| +----------------------+----------+--------------+--------------+---------+ | |
| | Foundation | 90/100 | [######### ] | verified | PASS | | |
| +======================+==========+==============+==============+=========+ | |
| | TIER 1 -- CLOUD WAF ADAPTED | | |
| +----------------------+----------+--------------+--------------+---------+ | |
| | Op. Excellence | 100/100 | [##########] | partial | | | |
| | Security | 86/100 | [######### ] | verified | | | |
| | Reliability | 96/100 | [##########] | partial | | | |
| | Performance | 83/100 | [######## ] | partial | | | |
| | Cost Optim. | 100/100 | [##########] | verified | | | |
| | Sustainability | 100/100 | [##########] | partial | | | |
| +======================+==========+==============+==============+=========+ | |
| | TIER 2 -- AGENT-NATIVE (1.5x weight) | | |
| +----------------------+----------+--------------+--------------+---------+ | |
| | Reasoning Integ. | 91/100 | [######### ] | verified | 1.5x | | |
| | Controllability | 83/100 | [######## ] | verified | 1.5x | | |
| | Context Integrity | 74/100 | [####### ] | verified | 1.5x | | |
| +======================+==========+==============+==============+=========+ | |
| ---------------------------------------- | |
| FILES ANALYZED: 24 files | |
| ---------------------------------------- | |
| FINDINGS (ordered by severity) | |
| [High ] Op. Excellence Eval scheduling not evidenced: | |
| tests/evals/layer2.py header says 'Run | |
| nightly or before deploys' but no CI/CD | |
| config (GitHub Actions workflow, Railway | |
| cron, or equivalent) is provided showing | |
| layer2.py is actually scheduled. The intent | |
| is documented but execution is unverified. | |
| [High ] Op. Excellence docs/slo.md and docs/postmortem_template.md | |
| are referenced in RUNBOOK.md but not | |
| provided as artifacts. SLO targets (p50 ≤ | |
| 25s, p95 ≤ 60s) appear only inline in | |
| RUNBOOK.md section 5, not in a canonical SLO | |
| document. Actual SLO thresholds, error | |
| budget policy, and burn rate alerting rules | |
| cannot be verified. | |
| [High ] Reliability Global daily cost counter | |
| (_global_daily_cost in agents/_base.py) is | |
| an in-process dict that resets on process | |
| restart and is not shared across workers. | |
| RUNBOOK.md explicitly documents this: 'a | |
| restart mid-day can allow spend beyond the | |
| daily cap for the remainder of that day.' | |
| This means the GlobalBudgetExceededError | |
| guard can be bypassed by a process | |
| crash/redeploy, creating a reliability gap | |
| in the cost-as-circuit-breaker pattern. | |
| [High ] Reliability Checkpoint store (_checkpoint_store in | |
| pipeline.py) is also in-process and lost on | |
| restart. A pipeline that checkpointed | |
| 'profile' and 'paths' but not | |
| 'plan'/'happiness' will restart from scratch | |
| after a process restart, re-spending tokens | |
| for already-completed slices and potentially | |
| re-triggering budget guards. | |
| [High ] Cost Optim. Global daily budget counter | |
| (_global_daily_cost in agents/_base.py) is | |
| in-process only and resets on process | |
| restart. RUNBOOK.md explicitly documents | |
| this: a mid-day restart allows spend up to | |
| GLOBAL_DAILY_BUDGET_USD again for the | |
| remainder of the day. With Railway single- | |
| worker constraint, a crash+restart can | |
| silently double the effective daily cap. | |
| [High ] Cost Optim. Global daily budget counter is not shared | |
| across workers. RUNBOOK.md documents that | |
| running N workers allows N × | |
| GLOBAL_DAILY_BUDGET_USD spend before any | |
| single worker fires the alert. Railway must | |
| be manually constrained to 1 replica -- | |
| there is no infrastructure-level enforcement | |
| of this constraint. | |
| [High ] Context Integrity Checkpoint replay does not validate | |
| staleness of resumed slice data beyond a TTL | |
| wall-clock check. In pipeline.py | |
| _load_checkpoint(), a profile or paths | |
| result stored up to _CHECKPOINT_TTL_S | |
| (3600s) ago is replayed verbatim into | |
| downstream agents. If the economic index | |
| briefing or market conditions changed | |
| between the original run and the resume, the | |
| MarketAgent output embedded in the | |
| checkpointed profile is stale but treated as | |
| current. No re-fetch or re-validation occurs | |
| on resume. | |
| [High ] Context Integrity No context pruning, summarization, or | |
| offloading before window saturation. | |
| _base.py aborts at 90% utilisation | |
| (ContextWindowExceededError) but takes no | |
| graceful bounding action before that | |
| threshold. Between the 75% warning and 90% | |
| abort, the pipeline continues accumulating | |
| tokens with no mitigation. For long retry | |
| chains, this means the abort is the only | |
| defense, resulting in a hard 503 rather than | |
| a graceful degradation. | |
| [Medium ] Foundation Slice boundary enforcement is informal: | |
| slice_num is passed as a plain integer to | |
| run_agent() and logged, but there is no | |
| typed SliceBoundary contract, no enforcement | |
| that slice numbers are unique or sequential, | |
| and no documentation of what each slice | |
| number means. If a new agent is added with a | |
| duplicate or out-of-order slice_num, the | |
| system will not detect it. Evidence: | |
| agents/_base.py run_agent() signature; all | |
| five agent files pass slice_num=1 through | |
| slice_num=4 without a registry. | |
| [Medium ] Foundation MarketAgent and IntakeAgent share | |
| slice_num=1 (both pass slice_num=1 to | |
| run_agent()), which means their per-slice | |
| log entries and slice_scores list entries | |
| are indistinguishable by slice number alone. | |
| This makes post-incident forensics harder | |
| and could cause confusion if slice_scores | |
| are used for SLO attribution. Evidence: | |
| agents/intake_agent.py line 'slice_num=1'; | |
| agents/market_agent.py line 'slice_num=1'. | |
| [Medium ] Foundation The _checkpoint_store in pipeline.py uses | |
| plain string keys ('profile', 'paths') with | |
| no versioning or schema validation on | |
| retrieval. A resumed pipeline could load a | |
| stale checkpoint whose Pydantic model shape | |
| has changed after a deploy, silently passing | |
| an incompatible object to downstream agents. | |
| Evidence: pipeline.py _save_checkpoint / | |
| _load_checkpoint; no model version tag on | |
| stored objects. | |
| [Medium ] Op. Excellence Alert coverage is incomplete: fire_alert() | |
| fires only on budget events (global daily | |
| exhaustion and 80% warning). No alerts are | |
| configured for agent retry exhaustion rate, | |
| hallucination rate spikes | |
| (get_hallucination_stats() is exposed on | |
| /metrics but no threshold-based alerting), | |
| circuit breaker state changes (logged but | |
| not alerted), or p95 latency breaches. | |
| RUNBOOK.md section 5 describes latency SLO | |
| but no corresponding alert fires. | |
| [Medium ] Op. Excellence Observability dashboard not evidenced: | |
| /metrics endpoint exposes | |
| hallucination_rates and | |
| circuit_breaker_state, and /cost exposes | |
| daily spend, but no dashboard (Datadog, | |
| Grafana, Railway metrics, CloudWatch) is | |
| shown aggregating these signals. Operators | |
| must poll endpoints manually rather than | |
| receiving proactive visibility. | |
| [Medium ] Op. Excellence Single-worker constraint is a documented | |
| operational risk (RUNBOOK.md section 6) with | |
| no automated enforcement: Railway replica | |
| count is not enforced via IaC or deployment | |
| config. A misconfigured scale-out would | |
| silently multiply the effective daily budget | |
| cap without any alert firing. | |
| [Medium ] Security HuggingFace API URL host is asserted once at | |
| module import time (utils/economic_index.py, | |
| _hf_parsed_host check) but the | |
| httpx.AsyncClient in | |
| fetch_economic_index_briefing() constructs | |
| the request from the module-level | |
| _HF_API_URL constant without runtime | |
| validation. If _HF_API_URL were mutated at | |
| runtime (e.g., via a monkey-patch in tests | |
| or a future dynamic config path), the host | |
| assertion would not re-fire. Additionally, | |
| no TLS certificate pinning or response size | |
| cap is applied to the HF fetch, meaning a | |
| compromised or spoofed HF response could | |
| inject arbitrarily large content into the | |
| market agent's system prompt. | |
| [Medium ] Security The in-process _response_cache and | |
| _idempotency_cache in pipeline.py store full | |
| AnalyzeResponse objects keyed by | |
| sha256(role+want+fear) and | |
| (api_key_prefix+idempotency_key) | |
| respectively. The api_key_prefix used as the | |
| idempotency cache key is only the first 8 | |
| characters of the raw API key | |
| (x_api_key[:8]). With a 64-character hex key | |
| space, 8 chars = 32 bits of prefix -- | |
| sufficient for collision resistance in low- | |
| volume use, but if the key space is shorter | |
| (e.g., a 16-char secret), prefix collisions | |
| between different legitimate keys could | |
| cause one user's cached response to be | |
| served to another. | |
| [Medium ] Security The _INJECTION_RE pattern in | |
| utils/sanitize.py replaces matched injection | |
| phrases with '[removed]' but does not re- | |
| validate the sanitized string for minimum | |
| length after replacement. A carefully | |
| crafted input that is exactly at | |
| _MIN_FIELD_LEN (10 chars) and consists | |
| entirely of an injection pattern would pass | |
| the pre-injection length check, be reduced | |
| to '[removed]' (9 chars), and then fail the | |
| post-sanitization length check -- raising | |
| ValueError and returning a 422. This is the | |
| correct outcome, but the error path leaks | |
| that the input was detected as an injection | |
| attempt (via the 422 detail message 'Input | |
| could not be processed'), which could aid an | |
| attacker in calibrating bypass attempts. | |
| [Medium ] Reliability The circuit breaker state (_circuit_breaker | |
| in agents/_base.py) is in-process and not | |
| persisted. After a process restart following | |
| a Claude API outage, the breaker resets to | |
| CLOSED with failure_count=0, potentially | |
| sending a burst of requests to a still- | |
| degraded API before the breaker re-opens. | |
| [Medium ] Reliability _parse_hf_rows() in utils/economic_index.py | |
| raises NotImplementedError unconditionally, | |
| meaning the live Anthropic Economic Index | |
| fetch always fails and falls back to | |
| STATIC_BRIEFING (Jan 2026 data). The | |
| fallback works correctly, but the live data | |
| path is permanently broken -- any market | |
| intelligence dependent on current data is | |
| silently stale. | |
| [Medium ] Performance No context pruning between pipeline slices: | |
| session total_tokens accumulates across all | |
| 4 agents (agents/_base.py | |
| _session_state['total_tokens']), but there | |
| is no mechanism to prune or summarize | |
| earlier agent outputs before passing them to | |
| later agents. The profile summary passed to | |
| PathAgent, PlanAgent, and HappinessAgent | |
| includes the full UserProfile and CareerPath | |
| objects verbatim. At 200k context, this is | |
| unlikely to cause truncation under normal | |
| inputs, but the context.window_warning | |
| threshold (75%) could be reached on verbose | |
| inputs with retries, and there is no | |
| proactive pruning to prevent it. | |
| [Medium ] Performance Intake and Market agents run sequentially | |
| despite partial independence: | |
| run_intake_agent and run_market_agent are | |
| called sequentially in pipeline.py (lines | |
| ~90-95). MarketAgent only needs | |
| profile.sector, profile.seniority, | |
| profile.transferable_skills, and | |
| profile.primary_goal -- all of which are | |
| available immediately after IntakeAgent | |
| completes. However, the current design | |
| awaits the full IntakeAgent result before | |
| starting MarketAgent, adding one full LLM | |
| round-trip to the critical path | |
| unnecessarily. | |
| [Medium ] Performance In-process response cache is exact-match | |
| only with no semantic deduplication: | |
| pipeline.py cache key is | |
| sha256(role+want+fear), meaning any | |
| character difference (capitalization, | |
| punctuation, extra space) produces a cache | |
| miss and a full pipeline re-run. Users who | |
| resubmit with minor edits (e.g., fixing a | |
| typo) will incur full cost. No semantic | |
| similarity layer exists. This is noted as a | |
| known limitation but has measurable cost | |
| efficiency impact. | |
| [Medium ] Cost Optim. Idempotency cache (_idempotency_cache in | |
| main.py) and response cache (_response_cache | |
| in pipeline.py) are both in-process with no | |
| eviction beyond TTL expiry. Under sustained | |
| load, these dicts grow unboundedly in | |
| memory, which could cause OOM restarts -- | |
| which in turn reset the daily cost counter | |
| (see finding 1), creating a cost-control | |
| gap. | |
| [Medium ] Sustainability In-process caches (_response_cache, | |
| _idempotency_cache, _checkpoint_store, | |
| _global_daily_cost) are not persisted across | |
| process restarts. RUNBOOK.md explicitly | |
| acknowledges this for the daily cost counter | |
| (section 6), but the same constraint applies | |
| to the response cache and checkpoints. A | |
| Railway redeploy mid-day silently resets all | |
| caches, eliminating deduplication benefits | |
| and potentially allowing re-spend on | |
| identical inputs. | |
| [Medium ] Sustainability The economic index briefing (_cache in | |
| utils/economic_index.py) uses a 24-hour in- | |
| memory TTL but falls back to a static Jan | |
| 2026 snapshot when the HF fetch fails or | |
| _parse_hf_rows raises NotImplementedError | |
| (which it always does -- the function | |
| unconditionally raises). This means every | |
| process start fetches the static fallback, | |
| and the 24h cache only persists within a | |
| single process lifetime. The HF fetch is | |
| effectively dead code, wasting ~5s of | |
| network latency per cold start. | |
| [Medium ] Sustainability No cache eviction policy exists for | |
| _response_cache or _idempotency_cache beyond | |
| TTL expiry on read. Under sustained load, | |
| these dicts grow unboundedly in memory until | |
| process restart. A long-running single- | |
| worker process (as mandated by RUNBOOK.md | |
| section 6) could accumulate significant | |
| memory pressure over days. | |
| [Medium ] Reasoning Integ. Uncertainty is captured internally | |
| (inference_basis logged in intake_agent.py, | |
| contradiction detection in market_agent.py) | |
| but never surfaced to API consumers. | |
| AnalyzeResponse has no field indicating low- | |
| confidence inferences (e.g. seniority | |
| inferred from thin evidence). A client | |
| receiving a 'Junior' seniority for EC-001's | |
| vague input has no signal that this was | |
| inferred, not stated. | |
| [Medium ] Reasoning Integ. Reasoning traces (thinking_blocks) are | |
| logged at DEBUG level only and are in- | |
| process -- they are not persisted to a | |
| durable store or tracing backend (no | |
| Langfuse, Arize, or equivalent). Post- | |
| incident forensics require the Railway log | |
| buffer to still contain the relevant trace, | |
| which is not guaranteed for incidents | |
| discovered hours later. | |
| [Medium ] Reasoning Integ. Layer 2 evals (tests/evals/layer2.py) | |
| contain unfinished stubs -- the | |
| resolve_field_path and | |
| evaluate_assert_fields functions are | |
| described in docstrings but the | |
| implementation bodies are absent (file shows | |
| docstring-only stubs). | |
| ADV-001/ADV-002/ADV-003 assert_fields will | |
| not actually execute in CI, leaving | |
| argument-accuracy regressions undetected. | |
| [Medium ] Controllability No human-in-the-loop checkpoint before the | |
| pipeline auto-selects path[0] and proceeds | |
| to generate a 90-day action plan. The | |
| path_confidence annotation is computed and | |
| logged but never surfaced to a human | |
| approver -- the pipeline proceeds | |
| unconditionally even when confidence is low | |
| (e.g. 0.8 when fit gap < 10 points). | |
| pipeline.py lines 97-110. | |
| [Medium ] Controllability The /cancel endpoint is unauthenticated | |
| beyond the shared API_SECRET_KEY -- any | |
| holder of the key can cancel any trace_id, | |
| including traces belonging to other users. | |
| There is no per-user or per-session | |
| ownership check on cancellation. main.py | |
| lines 148-155. | |
| [Medium ] Context Integrity In-process checkpoint store | |
| (_checkpoint_store in pipeline.py) is not | |
| durably persisted. A process restart (crash, | |
| Railway redeploy) silently discards all in- | |
| flight checkpoints. A user whose pipeline | |
| was mid-execution during a restart receives | |
| no resume benefit and must restart from | |
| slice 1, re-spending tokens. The RUNBOOK | |
| documents this gap but no mitigation is | |
| implemented. | |
| ---------------------------------------- | |
| RECOMMENDATIONS | |
| Foundation Define a SliceRegistry (e.g. an Enum or dict in | |
| agents/_base.py) mapping agent names to unique slice | |
| numbers, and validate at import time that no two | |
| agents share a slice_num. Assign intake=1, market=2, | |
| path=3, plan=4, happiness=5 (or similar) and update | |
| all agent files accordingly. This makes slice | |
| boundaries explicit and detectable. | |
| Foundation In pipeline.py _save_checkpoint(), store a | |
| model_version tag alongside the serialized object | |
| (e.g. the Pydantic model's __name__ and a schema | |
| hash). In _load_checkpoint(), validate the tag matches | |
| the current model before returning the object; if | |
| mismatched, discard the checkpoint and re-run the | |
| slice. This prevents silent schema-mismatch bugs after | |
| deploys. | |
| Foundation In pipeline.py _save_checkpoint(), serialize Pydantic | |
| models to dict (model.model_dump()) rather than | |
| storing the live object, and deserialize with | |
| model_validate() on load. This makes the checkpoint | |
| boundary explicit and ensures the stored | |
| representation is stable across process restarts and | |
| model changes. | |
| Op. Excellence Add a GitHub Actions workflow (e.g., | |
| .github/workflows/nightly-evals.yml) that runs `python | |
| -m tests.evals.layer2` on a cron schedule and on every | |
| PR targeting main. Fail the workflow if any eval case | |
| scores below its minimum_top_fit_score threshold. | |
| Op. Excellence Provide docs/slo.md as a committed artifact defining: | |
| p50/p95 latency targets, success rate SLO, cost-per- | |
| run budget, AWAF first-pass rate target, error budget | |
| policy, and burn rate alerting thresholds. Reference | |
| it from RUNBOOK.md with a relative link. | |
| Op. Excellence Extend fire_alert() call sites to cover: (a) | |
| agent.all_retries_exhausted events in agents/_base.py, | |
| (b) hallucination rate exceeding a configurable | |
| threshold (e.g., HALLUCINATION_ALERT_RATE_PCT env var) | |
| checked in get_hallucination_stats(), and (c) circuit | |
| breaker state transitions in | |
| CircuitBreaker.record_failure() when state changes to | |
| OPEN. | |
| Op. Excellence Add a Railway deployment config or railway.toml that | |
| pins replicas=1 and documents this constraint inline. | |
| Consider adding a startup assertion in main.py that | |
| logs a Critical warning if it detects it may be | |
| running as a non-primary instance (e.g., via a Redis | |
| lock or environment variable set by Railway). | |
| Op. Excellence Provide docs/postmortem_template.md as a committed | |
| artifact so the postmortem process is self-contained | |
| in the repository and does not depend on an external | |
| document that may not exist. | |
| Security In utils/economic_index.py, add a runtime host | |
| validation inside fetch_economic_index_briefing() | |
| before the httpx.get() call: assert | |
| urllib.parse.urlparse(_HF_API_URL).hostname == | |
| _HF_ALLOWED_HOST. Also add a response size cap (e.g., | |
| if len(resp.content) > 500_000: raise ValueError) | |
| before passing the parsed content to _parse_hf_rows(), | |
| preventing oversized injected payloads from reaching | |
| the market agent system prompt. | |
| Security In main.py, replace the 8-character api_key prefix | |
| used in idem_cache_key with a full HMAC-SHA256 of the | |
| key (e.g., | |
| hashlib.sha256(x_api_key.encode()).hexdigest()) so the | |
| idempotency cache namespace is cryptographically | |
| isolated per key, eliminating any prefix collision | |
| risk regardless of key length. | |
| Security In utils/sanitize.py, change the 422 detail message | |
| returned when sanitization raises ValueError from the | |
| current generic string to a fixed, non-informative | |
| message (e.g., 'Request could not be processed') that | |
| does not vary based on whether the rejection was due | |
| to injection detection, length, or repetition. This | |
| prevents timing or message-based oracle attacks that | |
| could help an attacker distinguish injection detection | |
| from other validation failures. | |
| Reliability Replace _global_daily_cost dict in agents/_base.py | |
| with a Redis INCRBYFLOAT counter keyed by date (e.g., | |
| 'glitched:daily_cost:2026-01-15'). This survives | |
| restarts and is safe across multiple workers. | |
| RUNBOOK.md already identifies this as the intended | |
| fix. | |
| Reliability Persist _checkpoint_store to Redis (or at minimum a | |
| local SQLite file) so that in-flight pipeline state | |
| survives process restarts. Key by trace_id with TTL | |
| matching CHECKPOINT_TTL_S. This prevents double-spend | |
| on retry after crash. | |
| Reliability Persist circuit breaker state (failure_count, state, | |
| last_opened_at) to Redis or a shared store so that a | |
| restarted process inherits the degraded-API signal | |
| rather than starting fresh. Alternatively, implement a | |
| startup health check that probes Claude API before | |
| accepting traffic. | |
| Reliability Implement _parse_hf_rows() in utils/economic_index.py | |
| or remove the live fetch path entirely and document | |
| STATIC_BRIEFING as the intentional data source. The | |
| current state (fetch attempted, always fails, silently | |
| falls back) masks a broken integration and wastes an | |
| HTTP call on every cache miss. | |
| Performance Implement context summarization before passing profile | |
| to downstream agents: in pipeline.py, after | |
| run_market_agent completes, create a compact | |
| ProfileSummary object containing only the fields each | |
| downstream agent actually needs (PathAgent needs ~10 | |
| fields, PlanAgent needs ~8). Pass this summary instead | |
| of the full UserProfile to reduce per-call input | |
| tokens by an estimated 30-40% on verbose inputs. Add a | |
| token budget check before each agent call. | |
| Performance Parallelize IntakeAgent and MarketAgent where | |
| possible: refactor pipeline.py to start MarketAgent as | |
| soon as IntakeAgent returns the base profile fields | |
| (sector, seniority, skills, primary_goal). Since | |
| MarketAgent does not depend on core_anxiety, | |
| profile_summary, or stated_direction, it can begin | |
| immediately. This would reduce the sequential critical | |
| path by approximately one LLM round-trip (~5-8s at | |
| p50). | |
| Performance Add a fuzzy/semantic cache layer above the exact-match | |
| cache: implement a lightweight embedding-based | |
| similarity check (e.g., using a local sentence- | |
| transformers model or a cheap embedding API call) | |
| before the sha256 exact-match lookup in pipeline.py. | |
| Cache hits on semantically equivalent inputs (>0.95 | |
| cosine similarity) would avoid full pipeline re-runs | |
| for near-duplicate submissions, reducing both cost and | |
| latency for common user patterns. | |
| Cost Optim. Replace _global_daily_cost dict in agents/_base.py | |
| with a Redis INCRBYFLOAT counter keyed by date (e.g. | |
| 'daily_cost:2026-01-15'). This survives process | |
| restarts and is shared across all workers, making the | |
| daily cap reliable. RUNBOOK.md already identifies this | |
| as the intended fix. | |
| Cost Optim. Add a Railway service-level replica cap via | |
| railway.toml or the Railway dashboard API to enforce | |
| single-worker constraint programmatically, rather than | |
| relying on operator discipline documented in | |
| RUNBOOK.md section 6. | |
| Cost Optim. Add a max-size cap to _response_cache and | |
| _idempotency_cache in pipeline.py and main.py (e.g. | |
| evict oldest entry when len > 1000) to prevent | |
| unbounded memory growth that could trigger OOM | |
| restarts and reset the cost counter. | |
| Sustainability Replace _parse_hf_rows in utils/economic_index.py with | |
| a working implementation or remove the HF fetch | |
| entirely and document the static briefing as | |
| intentional. The current state wastes a network round- | |
| trip on every cold start and logs a misleading | |
| 'fetch_failed' warning. If live data is not yet | |
| needed, set _HF_API_URL = None and skip the fetch | |
| unconditionally. | |
| Sustainability Add a max-size eviction policy to _response_cache and | |
| _idempotency_cache in pipeline.py and main.py | |
| respectively. A simple LRU with a cap of ~500 entries | |
| (e.g. using collections.OrderedDict or | |
| cachetools.LRUCache) prevents unbounded memory growth | |
| on long-running single-worker deployments. | |
| Sustainability Document the cache reset behaviour on redeploy in | |
| RUNBOOK.md section 6 (already covers the cost counter | |
| gap). Add a note that _response_cache and | |
| _checkpoint_store also reset, so a redeploy during a | |
| high-traffic period will cause a temporary spike in | |
| Claude API calls until the cache warms up. Consider a | |
| brief post-deploy monitoring window for cost | |
| anomalies. | |
| Reasoning Integ. Add a low_confidence_fields: list[str] field to | |
| AWAFMeta or UserProfile in models.py, populated by | |
| intake_agent.py when inference_basis.sector or | |
| inference_basis.seniority is 'inferred'. This surfaces | |
| uncertainty to API consumers without changing the | |
| contract for high-confidence responses. | |
| Reasoning Integ. Implement the resolve_field_path and | |
| evaluate_assert_fields stubs in tests/evals/layer2.py | |
| so that assert_fields on EvalCase actually executes. | |
| Without this, ADV-001 through ADV-003 are | |
| documentation, not tests. Add a CI step that runs | |
| layer2.py against at least the adversarial cases on | |
| every PR. | |
| Reasoning Integ. Persist reasoning traces to a durable backend (e.g. | |
| write to a Railway-mounted volume or POST to | |
| Langfuse/Braintrust) rather than relying solely on in- | |
| process DEBUG logs. The prompt_hash already provides | |
| the correlation key -- add a | |
| trace_store.write(prompt_hash, reasoning_trace) call | |
| in agents/_base.py after the thinking_blocks | |
| extraction. | |
| Controllability Add a confidence threshold gate in pipeline.py: if | |
| path_confidence < 0.7, emit a 'paths.low_confidence' | |
| event and either return the paths to the caller for | |
| explicit selection (add a /select/{trace_id} endpoint) | |
| or include a 'requires_confirmation' flag in | |
| AnalyzeResponse so the frontend can prompt the user | |
| before proceeding to plan generation. | |
| Controllability Bind trace_id ownership to the API key that created | |
| it. Store a mapping of trace_id -> key_prefix in | |
| _checkpoint_store at pipeline start, and in the | |
| /cancel handler verify that the requesting key_prefix | |
| matches the owning key_prefix before calling | |
| cancel_trace(). This prevents cross-user cancellation | |
| with a shared key. | |
| Context Integrity In pipeline.py _load_checkpoint(), add a content-hash | |
| or version tag to each checkpoint entry. On resume, | |
| compare the current economic_index briefing hash | |
| against the one stored at checkpoint time; if they | |
| differ, invalidate and re-run the MarketAgent slice | |
| rather than replaying the stale profile. This prevents | |
| stale market data from silently propagating through | |
| resumed sessions. | |
| Context Integrity In agents/_base.py, implement a context bounding | |
| strategy between the 75% warning and 90% abort | |
| thresholds. At 75% utilisation, truncate or summarize | |
| the retry_hint to a fixed maximum (e.g. 200 chars) and | |
| stop accumulating per-attempt logs in the session | |
| state. This converts the hard abort into a graceful | |
| degradation path and reduces the frequency of | |
| ContextWindowExceededError 503s. | |
| Context Integrity Replace the in-process _checkpoint_store dict in | |
| pipeline.py with a Redis HSET keyed by trace_id with a | |
| TTL matching _CHECKPOINT_TTL_S. This makes checkpoints | |
| survive process restarts and Railway redeployments, | |
| enabling true resume-from-last-slice semantics. The | |
| RUNBOOK already identifies Redis as the intended | |
| future fix for the analogous daily cost counter gap. | |
| ---------------------------------------- | |
| TO IMPROVE THIS ASSESSMENT | |
| Add an integration test (e.g. tests/unit/test_agent_isolation.py) that calls | |
| each agent function directly with a mock trace_id and asserts it returns the | |
| correct Pydantic model without invoking any other agent. This would provide | |
| verified evidence for the independence criterion. | |
| Add a SliceRegistry with uniqueness validation at import time -- this would | |
| convert the slice-boundary finding from Medium to resolved and upgrade the | |
| tally to 15/16. | |
| Provide .github/workflows/nightly-evals.yml or equivalent CI config showing | |
| layer2.py is scheduled -- would upgrade eval scheduling from self_reported | |
| to verified | |
| ---------------------------------------- | |
| EVIDENCE GAPS | |
| No architecture diagram or ADR documenting the intended slice boundary | |
| contract -- would upgrade confidence for the slice-boundary criterion. | |
| No test asserting that each agent can be called in isolation without the | |
| pipeline orchestrator -- would verify the 'no structural dependency' claim | |
| under test. | |
| docs/slo.md not provided -- SLO targets, error budget policy, and burn rate | |
| alerting rules unverifiable (Op. Excellence) | |
| docs/postmortem_template.md not provided -- postmortem process structure | |
| unverifiable (Op. Excellence) | |
| CI/CD configuration not provided -- eval scheduling (layer2.py nightly runs) | |
| unverifiable (Op. Excellence) | |
| Observability dashboard configuration not provided -- aggregated metrics | |
| visibility unverifiable (Op. Excellence) | |
| railway.toml or deployment config not provided -- single-worker constraint | |
| enforcement unverifiable (Op. Excellence) | |
| No evidence of dependency scanning (Snyk, pip-audit, Dependabot) -- supply | |
| chain risk for httpx, anthropic-sdk, pydantic is unassessed (Security | |
| pillar) | |
| No evidence of secrets scanning in CI (e.g., truffleHog, git-secrets) -- | |
| hardcoded credential risk in future commits is unmitigated (Security pillar) | |
| No network egress policy evidence (Railway service networking config, | |
| VPC/firewall rules) -- outbound calls to Anthropic API and HuggingFace are | |
| unrestricted at the infrastructure layer (Security pillar) | |
| No chaos engineering results or fault injection test evidence provided -- | |
| cannot verify that circuit breaker, checkpoint resume, and fallback paths | |
| behave correctly under actual failure conditions (Reliability pillar) | |
| No SLO compliance reports or uptime dashboards provided -- cannot verify | |
| that p95 latency and success rate targets in docs/slo.md are actually being | |
| met in production (Reliability pillar) | |
| No evidence of load testing -- cannot assess whether single-worker Railway | |
| constraint holds under concurrent request bursts (Reliability pillar) | |
| docs/slo.md not provided -- SLO-1/SLO-2 targets referenced in RUNBOOK.md but | |
| actual SLO document not available for verification; confidence remains | |
| partial on latency SLO criterion | |
| No latency dashboard or p50/p95 measurement data provided -- duration_ms is | |
| logged but no aggregated performance data shown; cannot verify whether | |
| current implementation meets stated SLO targets in practice | |
| No token usage trend data provided -- cannot assess whether context | |
| utilisation is growing over time or whether the 75% warning threshold is | |
| being triggered in production | |
| No external billing dashboard or cost trend chart provided -- cannot verify | |
| that tracked costs match actual Anthropic invoice amounts (Cost Optimization | |
| pillar: cost accuracy verification) | |
| No evidence of ALERT_WEBHOOK_URL being set in the Railway deployment -- | |
| fire_alert() is a no-op if unset, meaning the 80% budget warning may never | |
| fire in production (Cost Optimization pillar: alert delivery verification) | |
| No energy or carbon reporting data provided -- cannot assess environmental | |
| sustainability metrics (Sustainability pillar, environmental sub-criterion). | |
| No cost trend data over time -- cannot verify whether efficiency is | |
| improving across deployments (Sustainability pillar, efficiency trajectory | |
| criterion). | |
| No evidence of token budget tuning per agent (max_tokens is hardcoded at | |
| 1500 for all agents in call_claude) -- cannot confirm right-sizing at the | |
| output token level, only at the model selection level. | |
| No evidence of red-team or adversarial prompt injection testing against the | |
| AWAF gate itself (would strengthen Reasoning Integrity confidence to | |
| 'verified' with no gaps) | |
| Layer 2 LLM-as-judge rubrics (brand_voice, ha_authenticity, etc.) exist in | |
| layer2.py but no sample run output or pass/fail history is provided -- | |
| cannot verify they execute successfully | |
| No evidence of a /select or /confirm endpoint for human path selection -- | |
| controllability pillar gap | |
| No runbook section covering how to pause a specific agent mid-execution | |
| (only full pipeline cancel is documented) -- controllability pillar gap | |
| No LangSmith, Langfuse, or equivalent context trace exports provided -- | |
| cannot verify runtime context window utilisation patterns in production | |
| No memory architecture doc or vector DB config provided -- confirmed absent | |
| by code review (in-process only) | |
| HF dataset _parse_hf_rows raises NotImplementedError -- cannot assess how | |
| live tool response outputs would be filtered if the live fetch path were | |
| activated | |
| ---------------------------------------- | |
| Tokens: 931,371 in / 44,387 out | |
| Estimated cost: $1.1343 USD | |
| Generated: 2026-03-29 00:38 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment