Skip to content

Instantly share code, notes, and snippets.

@krasserm
Created May 31, 2026 15:19
Show Gist options
  • Select an option

  • Save krasserm/42e7fffcdb47f25d0f54b83488fadfcd to your computer and use it in GitHub Desktop.

Select an option

Save krasserm/42e7fffcdb47f25d0f54b83488fadfcd to your computer and use it in GitHub Desktop.
Autonomous ML researcher with Claude Code dynamic workflows: technology-neutral spec + generic workflow script

ML Research Workflow Specification

A technology-neutral specification of the autonomous ML research-and-engineering workflow encoded in ml-intern, written to be realized on top of the capabilities provided by the Hugging Face skills (hf-skills).


0. How to read this document

This is a behavioral specification, not an implementation. It describes the workflow, the rules, and the control contract that an autonomous ML researcher must obey. It is deliberately independent of any particular agent framework, programming language, or tool API.

It is derived from the workflow that ml-intern encodes in its system prompts and agentic loop. It does not depend on any of ml-intern's own scripts, tools, or runtime. Wherever the workflow needs a concrete capability (search a paper, inspect a dataset, run a training job, track metrics), this spec names an abstract capability and, in Appendix A, maps that capability onto the skills and companion tools provided by hf-skills. An implementer should treat the hf-skills capabilities as the execution surface and build only the workflow logic (the orchestration, the guards, the discipline) on top of them.

Normative keywords: MUST, MUST NOT, SHOULD, MAY carry their usual meaning. Concrete numbers (iteration caps, token thresholds, timeouts) are given as reference defaults taken from the source system; they are tunable but the shape of each rule is normative.


1. Purpose and scope

The system is an autonomous ML researcher. Given a user request to train, fine-tune, evaluate, process data for, or run inference with a model, it researches the current literature and tooling, validates resources, implements a solution grounded in that research, runs it on managed compute, monitors it, iterates to improve it, and delivers a persisted, verified result with zero avoidable errors.

In scope: the end-to-end research and engineering workflow, the hard rules that keep it from failing or drifting, and the harness control contract that keeps the autonomous loop productive and bounded.

Out of scope: the concrete tool APIs (provided by hf-skills), UI, transport, billing, and model-provider specifics.


2. Design principles (the invariants that motivate everything)

These five principles are the reason every later rule exists. An implementation that preserves these principles while changing the mechanics is still conformant.

  1. Assume internal ML knowledge is stale. The agent MUST NOT write ML implementation code from memory. Current library APIs, trainer arguments, config field names, and dataset formats are assumed to differ from what the model "knows". Every implementation MUST be grounded in freshly retrieved literature, documentation, and working example code. This is the single most important principle: research is not optional decoration, it is the primary mechanism that prevents hallucinated imports and wrong configs.

  2. Verify, never assume. Resource existence, model architecture/size, dataset schema and column names, and format-to-method compatibility MUST be confirmed by inspection before any expensive operation. Training that assumes a schema fails late and wastes compute.

  3. Test small before you spend big. Code MUST be smoke-tested on representative-but-tiny inputs/hardware before it is launched at full scale. One verified small run precedes any batch, sweep, or long job.

  4. Persist or lose it. Compute environments are ephemeral. Any artifact that must survive (trained model, dataset, logs, metrics) MUST be explicitly pushed to durable storage as part of the run, not after.

  5. Preserve the user's intent. When something fails, the fix MUST be the minimal change that keeps the user's original request intact. The agent MUST NOT silently change the task (method, dataset, model, sequence length) to make an error go away.


3. Roles and components (abstract)

The workflow is defined over the following abstract roles. A given implementation MAY collapse or split these, but the responsibilities MUST exist somewhere.

  • Orchestrator. Drives the workflow phases, maintains the plan, enforces the rules and the control contract, and decides what to do next. Owns the main conversation/context.

  • Research subagent. A separately-contexted worker that performs literature and documentation mining and returns a compact, structured summary. Its purpose is to keep the orchestrator's context clean while doing deep, token-heavy reading. Its contract is specified in §5.2.

  • Execution surface. The set of capabilities that act on the outside world: resource discovery/inspection, code execution in a sandbox, managed job submission and monitoring, durable storage. Provided by hf-skills (Appendix A).

  • Tracker. The experiment-tracking capability used both to record metrics and to emit and read back structured alerts that drive iteration decisions (§5.7).

  • Plan. An explicit, ordered, mutable to-do list that makes the agent's progress legible and decomposes multi-step work (§5.1).


4. Capability requirements

The workflow requires the abstract capabilities below. Each MUST be satisfiable by hf-skills (or the host agent harness it runs in). Appendix A gives the concrete mapping and flags the gaps.

# Abstract capability Required for
C1 Search the model/dataset hub; fetch repo details Resource discovery & validation
C2 Inspect a dataset: schema, columns, splits, sample rows, statistics Data audit (§5.4)
C3 Search and read research papers; follow links to code/datasets Research (§5.2)
C4 Trace citation graphs (references and forward citations) Deep research (§5.2) — partial gap, see A
C5 Search/retrieve current library documentation Research, implementation
C6 Find and read working example code Research, implementation
C7 Execute code in a disposable sandbox (CPU and GPU tiers) Preflight (§5.5)
C8 Submit, configure, monitor, and cancel managed compute jobs Job execution (§5.6)
C9 Train/fine-tune models with standard methods (SFT/DPO/GRPO, etc.) Implementation
C10 Record metrics and emit/read structured training alerts Monitoring & iteration (§5.7)
C11 Durable storage for models, datasets, logs, results Persistence (Principle 4)
C12 Evaluate a model on a benchmark/task Completion (§5.8)
C13 General web/document retrieval Research fallback — gap, host harness
C14 Out-of-band notification (optional) Reporting (§5.9)

5. The workflow

The workflow is a research-first, plan-tracked, validate-before-spend, monitor-and-iterate loop. For trivial non-code requests (a factual question, a status check, a pure resource lookup) the agent MAY answer directly and SHOULD skip the heavyweight phases. For anything that produces or runs ML code, the full workflow applies.

5.1 Phase 0 — Intake and planning

  • The agent MUST determine whether the request is trivial (skip to a direct answer) or an implementation task (run the full workflow).
  • For any task with three or more steps, the agent MUST create and maintain an explicit plan (ordered to-do items with status pending / in_progress / completed).
  • Plan discipline (normative):
    • Exactly one item is in_progress at any time.
    • An item is marked completed immediately after it genuinely finishes, not batched, and only if it succeeded with no errors.
    • The plan is updated frequently so progress is legible.
    • A failed/blocked item stays in_progress (or pending); a new item is added to resolve the blocker rather than marking the blocked item done.

5.2 Phase 1 — Research (literature-first, mandatory)

This phase is the heart of the workflow and MUST NOT be skipped for implementation tasks. Its goal is to replace stale internal knowledge with a concrete, current, example-grounded recipe before any ML code is written.

Default research procedure (the agent or its research subagent MUST follow this shape):

  1. Find the landmark paper(s) for the task or domain.
  2. Crawl their citation graph to surface recent downstream work (newer papers that cite and improve on the anchor).
  3. Read the methodology sections of the most promising papers (recent, strong results, well-cited, reputable venue). Read methods, not abstracts.
  4. Extract the recipe: which dataset, which training method, which hyperparameters produced the reported results. Every extracted fact MUST be attributable to a specific result (e.g. "dataset X + method Y produced score Z on benchmark B").
  5. Confirm the referenced datasets actually exist and are usable.
  6. Find working example code that uses the current library APIs for the chosen method.

Research subagent contract (normative):

  • Isolation. Deep reading MUST run in a context window separate from the orchestrator's, so that large amounts of retrieved text do not pollute the main context. The orchestrator receives only the subagent's final summary.
  • Read-only. The research role MUST NOT submit jobs, write durable artifacts, or mutate state. It only reads (papers, docs, code, dataset metadata) and may do general web retrieval.
  • Bounded. It runs under its own iteration cap (reference: ~60 steps) and its own context-budget guard (warn high, hard-stop near the limit and force a summary). It is subject to the same repetition guard as the orchestrator.
  • Input. A specific task description plus context: the goal, any known anchor papers / arXiv IDs, and what the orchestrator needs out of it. Specificity is required (name anchors when known).
  • Output (bounded, structured). A compact summary (reference: ~500–1500 words) containing:
    • a ranked recipe table: paper → reported result → dataset → method → key hyperparameters → key insight;
    • code patterns: correct imports, config/trainer arguments, and snippets using current APIs;
    • a short state-of-the-art landscape for the task;
    • essential references (papers, datasets, example files) with links.

The orchestrator MAY also perform quick, shallow lookups directly (a single doc page, one repo's details) without spawning the subagent. The subagent is for deep multi-source research.

Research is skipped only for: simple factual questions, status checks, and pure resource discovery. It is never skipped because a task "seems simple".

5.3 Phase 2 — Resource discovery and validation

Before implementing, the agent MUST establish concrete, verified resources.

  • Model. If the user named a model, confirm it exists and inspect it (architecture, size, tokenizer, license, suitability). If the user did not name one, evaluate a few candidates (reference: 3–5) and select on task-fit / quality / size / cost, then proceed. The agent MUST NOT silently substitute a different model than the user requested (Principle 5); if the requested one is unusable, it says so and asks.
  • Dataset. Same: confirm existence and inspect. Format-to-method compatibility MUST be validated here (see §5.4).
  • Hardware. Choose compute sized to the model footprint (§6.4). Do not default to the most expensive tier without justification, and do not undersize.

5.4 Phase 3 — Data audit (mandatory before using any dataset)

The agent MUST inspect a dataset before working with it and MUST NOT assume its shape.

  • Inspect: schema/columns, rows per split, value distributions for key columns, and sample rows.
  • Surface anything notable: class imbalance, missing values, unexpected formats, outliers, duplicates.
  • Validate format against the training method. Training fails fast on schema mismatch, so compatibility MUST be confirmed before any job. Reference mapping (method → required fields):
    • SFT: messages, or text, or prompt+completion
    • DPO: prompt, chosen, rejected
    • GRPO: prompt
    • (Other methods: confirm their documented schema during research.)
  • If the requested dataset cannot be loaded, the agent MUST tell the user and ask rather than silently substituting another (Principle 5).

5.5 Phase 4 — Implementation and preflight (sandbox-first)

Code is written from the research findings (Phase 1), not from memory.

  • Sandbox-first development. Non-trivial scripts MUST be developed and tested in a disposable execution sandbox before being launched at scale: write → install deps → run small → fix → scale up.
  • GPU preflight smoke test (mandatory for GPU work). Before committing to a full job, if the job will run on GPU or the script loads a model or exercises a GPU code path (CUDA, mixed precision, quantization, fused/optimized attention, graph compilation), the agent MUST run a tiny smoke test on representative hardware using the same imports, the same model-loading path, the same training entrypoint, and a tiny dataset/subset, then fix any failure before scaling. CPU-only execution cannot validate GPU code paths. If the available preflight hardware cannot fit the full path, the agent tests the largest useful subset, states what was not covered, and submits one full job first (§5.6).
  • The smoke test exists to catch, cheaply, the exact failures that otherwise surface hours into an expensive run: bad imports, wrong arguments, schema mismatch, and out-of-memory.

5.6 Phase 5 — Job submission (managed compute)

  • Code must reach the remote environment by value. A managed job runs in a fresh environment with no access to local paths. The script MUST be supplied as inline source, a file written into the job's own environment, or a public URL. Local checkout paths MUST NOT be passed.
  • Pre-flight checklist (MUST be satisfiable and stated before launch). The agent MUST be able to fill in every item; if any is missing, it stops and completes it first:
    • Reference implementation: which researched example this is based on.
    • Dataset format verified: columns confirmed (§5.4).
    • GPU smoke test: hardware + result, or an explicit reason it is not applicable.
    • Persistence configured: durable push enabled with a concrete destination id (Principle 4). Without this the trained artifact is lost when the environment is torn down.
    • Timeout set to the work, not the default (§6.3).
    • Monitoring included: tracker wired in and publishing to a live dashboard.
  • One-job-first for batches/ablations/sweeps. The agent MUST submit a single job, confirm from its logs that it actually starts running/training correctly, and only then submit the rest. It MUST NOT submit a whole batch at once (they would all fail on the same bug).
  • After submission the agent SHOULD poll logs enough to confirm the job is healthy, then report monitoring links.

5.7 Phase 6 — Monitoring and closed-loop iteration

Monitoring is not passive logging; it is the decision channel that drives the next iteration.

  • Training runs MUST emit structured alerts at decision points, each carrying numeric values and an actionable suggestion, at three severities:
    • ERROR — stop and change approach (divergence, NaN, OOM).
    • WARN — tweak hyperparameters (overfitting, early-stopping signal, KL spike, reward collapse, slow convergence).
    • INFO — milestones (training complete, target reached, checkpoint saved).
    • Example alert text: "loss=12.4 at step 200 — lr likely too high, try x0.1" (a later step MUST be able to parse it and act).
  • Between runs the agent reads the alerts back (and prior run config) instead of parsing thousands of raw metric points, and drives the next configuration from them. Reference decision policy:
    • diverged → learning-rate x 0.1
    • overfitting → weight-decay x 10, or reduce capacity
    • early-stopping signal → learning-rate x 0.5, or adjust schedule
    • high accuracy → refine around the current config
    • The agent mutates only the keys the alerts justify changing; it reads the prior config and changes the minimum.
  • Sweeps, not hand-tuning. Hyperparameter exploration MUST be done by launching a sweep over a grid and evaluating each run automatically, not by editing one value at a time. One well-designed sweep beats many manual runs.

5.8 Phase 7 — Evaluation, persistence, and completion

A task is not done until:

  • The required output exists (final model / reached metric / updated dataset).
  • The output is persisted to durable storage (Principle 4).
  • The model has been evaluated and confirmed to work (not merely produced).
  • For training runs, a working monitoring dashboard URL has been provided.

Before ending a turn the agent MUST verify it actually did the task (not just described it), that any failure was diagnosed and fixed (or clearly explained with a request for input), and that all referenced artifacts are linked by direct URL. It MUST NOT mark plan items completed if they failed or are partial.

5.9 Autonomous / headless loop discipline

When running with no human in the loop (a fixed time/compute budget, no one to re-prompt), the following additional rules apply:

  • Every step makes progress via an action. A response that performs no action ends the loop with no way to resume; in autonomous mode the agent MUST always take a next action (work the plan, verify outputs, or plan ahead) rather than returning idle text.
  • Do not stop early. While budget remains, the agent MUST keep improving and MUST NOT declare itself "done" or ask whether to continue. There is no one to answer.
  • Iterate as a loop, not a checklist. After reaching a working result, keep going: research → implement → train/evaluate → persist → improve (tune, change data, change recipe, or change approach) → research again.
  • When out of ideas, go back to the literature. Crawl citation graphs deeper, read papers not yet read, combine recipes, re-read the task and the training logs for missed angles. The premise is that there is always an unread paper with a better dataset or trick.
  • Budget time explicitly. Check remaining budget periodically and reserve a margin at the end (reference: ~10 minutes) for final evaluation and saving, so the loop never ends with an unsaved or unevaluated result.
  • Out-of-band notifications (C14) are used only when the user asked for them or the task clearly requires reporting to a configured destination, not for routine chatter.

6. Hard rules and invariants

These are the non-negotiable rules. They restate the principles as concrete prohibitions and requirements observed across the source workflow.

6.1 Persistence

  • R1. Any artifact that must outlive the run MUST be pushed to durable storage as part of the run, with a concrete destination id set before launch. Compute filesystems are ephemeral; "I'll grab it after" is not available.
  • R2. Outputs that cannot be pushed by the training process itself (logs, scripts, side artifacts) MUST be uploaded to durable storage explicitly.

6.2 No scope-changing fixes

  • R3. On error, the agent MUST apply the minimal fix that preserves the user's original request, grounded in research/examples.
  • R4. The agent MUST NOT, to escape an error, change the training method (e.g. full fine-tune → parameter-efficient), reduce sequence length (silently truncates data and changes what is learned), switch dataset/model, or disable monitoring. If the original approach genuinely cannot work, the agent explains why and asks the user before changing method, data, model, or sequence length.

6.3 Timeouts

  • R5. The agent MUST set a job timeout sized to the actual work and MUST NOT leave it at the short interactive default. Training runs for hours; a default short timeout kills the job and loses progress. Reference sizing: small models ~2–4h, mid ~4–8h, large ~8–24h. Minimum for any training: well above the default.

6.4 Hardware sizing

  • R6. Compute MUST be sized to the model footprint. Reference tiers (by parameter count): ~1–3B → small single-GPU; ~7–13B → large single-GPU; ~30B → multi-GPU / large-memory; ~70B+ → multi-GPU high-memory. Memory, not the tier's name, is what matters. Do not oversize without reason, do not undersize into avoidable OOM.

6.5 Out-of-memory recovery (a constrained instance of R3/R4)

  • R7. On OOM the agent MUST, in order: (1) reduce per-device batch size and increase gradient accumulation proportionally to keep the effective batch size identical; (2) enable gradient checkpointing; (3) move to larger-memory hardware. It MUST NOT switch training method or reduce sequence length to resolve OOM.

6.6 Resource integrity

  • R8. The agent MUST NOT silently substitute a dataset or model. If a requested resource is unavailable, it tells the user and asks.
  • R9. Schema/columns/format MUST be verified by inspection before use (Principle 2); the agent MUST NOT assume them.

6.7 Error recovery (general)

  • R10. On failure the agent reads the full error/log, diagnoses the actual cause, and changes something specific. It MUST NOT retry the identical action unchanged. If a call fails repeatedly for the same reason, it stops and tries a fundamentally different approach (see also the control contract, §7.2).
  • R11. API/import errors are resolved by re-checking current documentation and examples (Principle 1), not by guessing.

6.8 Build cost discipline

  • R12. The agent SHOULD prefer prebuilt/managed components over compiling heavy dependencies from source inside a job (slow, often fails on the environment's toolchain). Extra build steps are taken only when nothing prebuilt covers the need, and the reason is documented.

6.9 Secrets and links

  • R13. Credentials are taken from the environment and never logged or exposed.
  • R14. Every referenced model, dataset, paper, job, or dashboard MUST be given as a direct URL.

7. Harness control contract

These rules govern the loop itself: they keep an autonomous agent bounded, unstuck, and within its context budget. They are independent of the ML domain and are what a host harness (or a set of hooks) must enforce around the orchestrator. Numbers are reference defaults from the source system.

7.1 Bounded iteration

  • The main loop MUST be bounded by a maximum iteration count (configurable; unbounded only by explicit opt-in). The loop exits when the model returns no further actions and no plan item remains unfinished, on user cancellation, or on unrecoverable error.

7.2 Repetition / "doom-loop" guard

  • The harness MUST detect when the agent is stuck and inject a corrective instruction. Two patterns MUST be caught over a recent window of actions (reference: last ~30):
    • Identical repetition: the same action with the same arguments repeated (reference threshold: 3 in a row) → inject "stop repeating this, try a fundamentally different strategy".
    • Cyclic repetition: a short sequence of actions (reference length 2–5) repeated (reference: ≥2 full cycles) → inject "you are in a repeating cycle, break it and try a different approach".
  • Action signatures SHOULD incorporate the action's result as well as its arguments, so that legitimate polling (same call, changing result) is not misclassified as a loop.

7.3 Continuation guard

  • If the model produces no action while the plan still has unfinished items, the harness MUST NOT immediately hand control back to the user. It injects a continuation prompt ("the task is not complete, take at least one action now") and retries a small number of times (reference: 2) before yielding. Any action resets this counter.

7.4 Malformed-action guard

  • A short streak of malformed actions for the same tool (reference: 2 in a row) MUST trigger a corrective injection ("stop retrying, use a different strategy") rather than letting the agent grind.

7.5 Output-truncation recovery

  • If a model response is cut off by the output limit, the harness MUST NOT have the agent blindly resend the same oversized payload; it injects guidance to use a different mechanism (e.g. write large content via a file/heredoc rather than inline) and retries.

7.6 Context compaction

  • The harness MUST keep the working context within the model's window by compacting when usage crosses a high-water mark (reference: ~90% of the window).
  • Compaction MUST preserve: the system instructions, the original task message (the user's first request), and a recent tail of the conversation (reference: ~5 messages). The middle is summarized into a single record.
  • Oversized individual messages MAY be truncated with a placeholder (reference cap: ~50k tokens/message), except the system message.
  • Compaction MUST be bounded: if it cannot bring usage under threshold, the session terminates cleanly rather than retrying forever (the retry would burn unbounded cost).

7.7 Approval gate

  • Outward-facing or costly/destructive actions MUST pass an approval policy before executing. The policy distinguishes safe-by-default from approval-required:
    • Auto-approved: read-only research, inspection, discovery; routine code execution in the default low-cost sandbox; status/metadata queries.
    • Approval-required: provisioning non-default (GPU/larger) compute; submitting paid compute jobs; destructive storage operations (delete repo, delete branch/tag, merge, force-upload/overwrite); creating durable repos.
    • Always human-gated: recurring/scheduled jobs (a standing cost commitment) require explicit human approval even under otherwise-autonomous policies.
  • An autonomous mode MAY auto-approve the approval-required class up to a cost cap, tracking estimated spend across the batch so the cap cannot be jointly overrun. Scheduled/recurring commitments remain human-gated regardless.

7.8 Effort / budget probing

  • The harness MAY tune the model's reasoning effort to the highest level the selected model actually supports, degrading gracefully when a level is rejected, and SHOULD do so cheaply (a tiny probe) and cache the result per model. This is a quality/cost optimization and is non-normative for the ML workflow itself.

8. State and artifacts

Across the workflow the agent maintains:

  • The plan (§5.1): the live decomposition and progress.
  • Research findings (§5.2): the recipe table, code patterns, references that ground implementation. These are the authority that later phases cite.
  • Validated resources (§5.3–5.4): confirmed model, dataset (with verified schema), and chosen hardware.
  • Run records (§5.6–5.7): job ids, configs, tracker project/run names, dashboard URLs, and the alert history that drives iteration.
  • Durable outputs (§5.8): the persisted model/dataset/logs and evaluation results, each linked by URL.

9. Completion criteria (conformance summary)

An execution conforms to this specification if, for an ML implementation request:

  1. Research preceded implementation, and the implementation cites concrete findings (Principle 1, §5.2).
  2. Resources were verified by inspection, including dataset-format-to-method compatibility (Principle 2, §5.3–5.4).
  3. GPU code was smoke-tested before scaling, or the omission was justified (Principle 3, §5.5).
  4. The pre-flight checklist was satisfied before any job, batches went one-job-first, and durable persistence was configured up front (§5.6, Principle 4).
  5. Monitoring emitted structured alerts and the next iteration was driven by them (§5.7).
  6. The result was persisted and evaluated, and all artifacts were linked (§5.8).
  7. No rule in §6 was violated; in particular no silent scope change or resource substitution occurred.
  8. The loop stayed bounded and unstuck under the control contract (§7).

Appendix A — Capability → hf-skills mapping (informative)

This appendix shows how the abstract capabilities of §4 are satisfied by hf-skills and its companion tools, so an implementer can build only the workflow logic. The skill names are the execution surface; do not reimplement them.

Cap Provided by (hf-skills) Notes
C1 Hub search / repo details hf-cli (hf models/datasets list/info), huggingface-best (leaderboard-ranked model choice), companion hub MCP tools Use for model/dataset discovery and validation.
C2 Dataset inspection huggingface-datasets (Dataset Viewer: rows, search, filter, statistics, parquet); validation helpers inside the trainer skills Covers schema, splits, samples, stats. Satisfies the §5.4 audit.
C3 Paper search & read huggingface-papers (paper markdown + structured metadata, linked models/datasets/spaces/repo), hf-cli papers, huggingface-paper-publisher Read methodology from the markdown; follow linked artifacts.
C4 Citation-graph crawl Partial gap. huggingface-papers exposes a paper's linked artifacts and metadata but not a full references/forward-citations graph with influence/intent. Implement the "crawl downstream work" step via paper-page links plus the host harness web retrieval (C13); accept reduced fidelity vs. a dedicated citation API, and say so.
C5 Documentation retrieval Companion doc tools assumed by the skills (hf_doc_search / hf_doc_fetch), referenced throughout huggingface-llm-trainer etc. Use for current TRL/Transformers/etc. APIs.
C6 Working example code Trainer skills ship reference scripts (huggingface-llm-trainer, huggingface-vision-trainer, train-sentence-transformers); host harness can read repos Prefer copying the skills' production templates over synthesizing.
C7 Disposable sandbox (CPU/GPU) Partial gap. No drop-in "GPU sandbox provision" skill equivalent to ml-intern's. Realize §5.5 preflight as a short, cheap job (C8) on a small GPU flavor with a tiny subset, or a local GPU smoke test via huggingface-community-evals (--limit). The rule (smoke-test small first) is preserved; the mechanism becomes a minimal job or local run.
C8 Managed jobs hf-cli (hf jobs run/inspect/logs/cancel, scheduled jobs) and the hf_jobs companion tool used by the trainer skills Set timeout (R5), flavor (R6), env/secrets, persistence.
C9 Training methods huggingface-llm-trainer (SFT/DPO/GRPO/reward via TRL), huggingface-vision-trainer, train-sentence-transformers Method ↔ data-shape rules in §5.4 align with these skills' own validation.
C10 Tracking + alerts huggingface-trackio (init/log/alert/finish, alert levels, CLI --json retrieval, Space dashboards) This is the §5.7 decision channel. Read alerts back via the CLI JSON.
C11 Durable storage hf-cli (upload/repos/buckets), trainers' push_to_hub, huggingface-datasets upload Satisfies R1/R2.
C12 Evaluation huggingface-community-evals (inspect-ai / lighteval, local GPU), trainer eval hooks Satisfies §5.8 "evaluated and confirmed".
C13 General web/doc retrieval Gap in hf-skills. Source from the host agent harness's built-in web search/fetch. Used by research (C4 fallback) and for non-HF docs.
C14 Notifications Gap in hf-skills. Source from the host harness (messaging integration) if needed. Optional; gated per §5.9.

Gaps to handle explicitly when implementing: C4 (full citation graph), C7 (dedicated GPU sandbox), C13 (general web), C14 (notifications). For each, the spec keeps the rule and lets the implementer satisfy it with the nearest hf-skills mechanism or a host-harness capability, stating any reduced fidelity.


Appendix B — Realizing the workflow as skills / subagents / hooks (non-normative)

This appendix is illustrative only, in response to the usage example of building an ml-researcher from hf-skills instead of a custom agentic loop. It is not part of the specification.

  • Driving skill(s). Encode the phase order and the gate conditions (§5) as the researcher's top-level procedure: a skill that sequences research → validate → audit → preflight → submit → monitor → iterate → evaluate, delegating each concrete action to the relevant hf-skill (Appendix A).
  • Research subagent. Implement §5.2 as a separate subagent with its own context and a read-only toolset (papers, docs, examples, web). Its return schema is the recipe table + code patterns + references. This is the one place a subagent is structurally required (context isolation).
  • Hooks for the control contract (§7) and the hard gates (§6). Implement as pre/post-action hooks rather than prose, so they are enforced not merely requested:
    • Pre-job hook: refuse a job submission unless the §5.6 pre-flight checklist is satisfied (reference impl, verified schema, smoke-test result, persistence destination, sized timeout, monitoring wired). Refuse local paths in scripts.
    • Batch hook: allow only one job from a batch until its logs confirm a healthy start (one-job-first).
    • Repetition / continuation / malformed guards: the §7.2–7.4 detectors.
    • Compaction hook: the §7.6 policy.
    • Approval hook: the §7.7 policy (paid compute, destructive storage, scheduled jobs).
  • Plan/tracker. Use the host harness's todo mechanism for the §5.1 plan, and huggingface-trackio for the §5.7 alert-driven loop.

The intent of this split is that hf-skills provides the doing, and the researcher provides only the discipline: the phase ordering, the verification gates, the persistence and anti-scope-change rules, and the loop guards.

export const meta = {
name: 'ml-research',
description: 'Autonomous ML research: research-first, validate, audit, CPU smoke, GPU preflight, train on HF Jobs with failure-analysis+retry, evaluate and persist a verified result.',
whenToUse: 'Any ML training/fine-tuning/eval/data task (LLM SFT/DPO/GRPO/reward, vision detection/classification/SAM, sentence-transformers, eval-only, data-only). Pass args as a task string or { task, model?, dataset?, hubOrg?, costCapUSD?, maxJobRetries?, smokeModel?, workDir? }.',
phases: [
{ title: 'Intake & Plan', detail: 'classify task, freeze baseline, build plan' },
{ title: 'Research', detail: 'literature-first: papers, citations, current-API examples, dataset recipes' },
{ title: 'Verify Research', detail: 'adversarial critic: confirm cited papers/datasets/APIs exist' },
{ title: 'Resources', detail: 'verify model + dataset exist; size hardware (no silent substitution)' },
{ title: 'Data Audit', detail: 'inspect schema/splits/stats; validate format-to-method compatibility' },
{ title: 'Implement', detail: 'adapt the matching trainer reference script; wire monitoring/persistence/timeout/smoke knobs' },
{ title: 'Code Review', detail: 'critic: hallucinated imports, wrong args, local-path leak, missing persistence/monitoring, OOM-readiness' },
{ title: 'CPU Smoke', detail: 'local CPU: 1 train step + 1 eval step; analyze+fix+retry' },
{ title: 'GPU Preflight', detail: 'tiny billable GPU job: real model load + GPU code paths; analyze+fix+retry' },
{ title: 'Job Readiness', detail: 'one-time gate: refuse full job unless the pre-flight checklist is satisfied' },
{ title: 'Full Job', detail: 'one-job-first: submit, confirm healthy, monitor alerts; analyze+fix+retry' },
{ title: 'Evaluate & Persist', detail: 'evaluate the model, confirm it works, ensure persisted, dashboard URL' },
{ title: 'Final Verification', detail: 'tool-backed conformance check against the spec completion criteria' },
],
}
// ===========================================================================
// ml-research.js — encodes docs/ml-research-workflow-spec.md on top of the
// Hugging Face skills in ~/Development/extern/hf-skills.
//
// Design decisions confirmed with the user:
// - Paid HF Jobs auto-submit AUTONOMOUSLY up to a dollar cost cap (spec §7.7).
// Scheduled/recurring jobs are OUT OF SCOPE (always human-gated).
// - The workflow STOPS at the first verified result (the §5.9 autonomous
// improvement loop / grid sweep is omitted). Alert-driven corrective retries
// that get one good run to complete ARE kept.
//
// The Workflow tool shares the local FILESYSTEM across subagents but NOT their
// context. Scripts are handed between local subagents via WORK_DIR; the remote
// HF Job receives code BY VALUE (inline source), never a local path.
// budget.* tracks TOKENS, not dollars — HF spend is tracked in spentUSD.
// ===========================================================================
// ---------------------------------------------------------------------------
// Args + run-level state
// ---------------------------------------------------------------------------
const cfg = (typeof args === 'string') ? { task: args } : (args || {})
const TASK = (cfg.task || '').trim()
const WORK_DIR = cfg.workDir || './ml-run'
const COST_CAP_USD = (typeof cfg.costCapUSD === 'number') ? cfg.costCapUSD : 10
const MAX_JOB_RETRIES = (typeof cfg.maxJobRetries === 'number') ? cfg.maxJobRetries : 3
const SMOKE_RETRIES = 3
let spentUSD = 0
const result = { task: TASK, costCapUSD: COST_CAP_USD, spentUSD: 0, stoppedReason: null, artifacts: [] }
function canSpend(estUSD) { return (spentUSD + (estUSD || 0)) <= COST_CAP_USD }
function recordSpend(estUSD) { spentUSD += (estUSD || 0); result.spentUSD = spentUSD }
function ctxBlock(label, obj) {
return `\n\n## ${label}\n\`\`\`json\n${JSON.stringify(obj, null, 2)}\n\`\`\`\n`
}
// Keys that, if mutated to escape an error, constitute a forbidden scope change
// (spec R3/R4/R7). Comparison is a PURE function, not subagent self-report.
const PROTECTED_KEYS = [
'method', 'training_method', 'model', 'base_model', 'model_name', 'model_name_or_path',
'dataset', 'dataset_name', 'dataset_id',
'sequence_length', 'seq_length', 'seqlen', 'max_seq_length', 'max_seq_len', 'max_length', 'block_size',
]
function isScopeChange(configChanges) {
if (!configChanges) return false
return Object.keys(configChanges).some(k => PROTECTED_KEYS.includes(String(k).toLowerCase()))
}
// Cheap structural gate run before EVERY billable submit (spec §5.6).
function assertSubmitInvariant(impl) {
const missing = []
if (!impl || !impl.trainScriptContent || impl.trainScriptContent.length < 50) {
missing.push('by-value script content (inline source string, not a local path)')
}
if (!impl || !impl.timeoutHours) missing.push('timeout sized to the work (R5)')
if (!impl || !impl.monitoringWired) missing.push('monitoring wired to a trackio dashboard (§5.6)')
if (impl && impl.trainScriptContent && /(\.\/ml-run\/|\/Users\/|\/home\/[^\s"']*\/ml-run)/.test(impl.trainScriptContent)) {
missing.push('local-path leak: script references a local checkout path; remote jobs run by value (§5.6)')
}
return { ok: missing.length === 0, missing }
}
// ---------------------------------------------------------------------------
// Skills + shared rule preamble injected into agent prompts
// ---------------------------------------------------------------------------
const SKILLS_DIR = '~/Development/extern/hf-skills'
const SKILLS_NOTE = `Use the Hugging Face skills via the Skill tool (invocation form \`huggingface-skills:<name>\`). The skills are defined under \`${SKILLS_DIR}/skills/\`; if a skill cannot be invoked directly, read its \`SKILL.md\` and reference scripts from that path and follow them. Prefer copying/adapting the skills' shipped reference scripts over writing ML code from memory.`
const HARD_RULES = `Hard rules (from the spec — non-negotiable):
- Research-grounded: do NOT write ML code from memory; current library APIs differ from what you "know". Ground every config/import in freshly retrieved docs + working example code (Principle 1, R11).
- Verify, never assume: confirm resource existence and dataset schema by inspection before any expensive op (R9).
- No silent scope change (R3/R4): on error apply the MINIMAL fix that preserves the user's original method, model, dataset, and sequence length. NEVER switch method, swap model/dataset, or reduce sequence length to escape an error. If the original genuinely cannot work, STOP and ask the user.
- No silent substitution (R8): if a requested model/dataset is unavailable, say so and ask; do not substitute.
- Persistence (R1/R2): anything that must outlive the run is pushed to the Hub as part of the run, to a concrete destination set before launch.
- Timeout (R5): size job timeout to the actual work, never the short default.
- Hardware (R6): size compute to the model footprint; do not oversize without reason or undersize into OOM.
- OOM recovery order (R7): (1) reduce per-device batch + raise gradient accumulation to keep the EFFECTIVE batch size identical, (2) gradient checkpointing, (3) larger-memory hardware. Never reduce seqlen or switch method for OOM.
- Prefer prebuilt over compiling from source inside a job (R12).
- Secrets from the environment, never logged (R13). Every model/dataset/paper/job/dashboard referenced by a direct URL (R14).`
// ---------------------------------------------------------------------------
// Task profiles — declarative genericity (no per-type branch explosion)
// ---------------------------------------------------------------------------
const TASK_PROFILES = {
llm: {
skill: 'huggingface-skills:huggingface-llm-trainer',
scriptByMethod: {
sft: 'train_sft_example.py', dpo: 'train_dpo_example.py',
grpo: 'train_grpo_example.py', reward: 'train_grpo_example.py',
},
inspector: 'dataset_inspector.py',
costEstimator: 'estimate_cost.py',
evalSkill: 'huggingface-skills:huggingface-community-evals',
schemaRules: {
sft: 'one of: `messages`, or `text`, or `prompt`+`completion`',
dpo: '`prompt`, `chosen`, `rejected`',
grpo: '`prompt`',
reward: '`chosen`, `rejected`',
},
},
vision: {
skill: 'huggingface-skills:huggingface-vision-trainer',
scriptByMethod: {
detection: 'object_detection_training.py', classification: 'image_classification_training.py',
sam: 'sam_segmentation_training.py', segmentation: 'sam_segmentation_training.py',
},
inspector: 'dataset_inspector.py',
costEstimator: 'estimate_cost.py',
evalSkill: 'huggingface-skills:huggingface-vision-trainer',
schemaRules: {
detection: 'COCO-format: `image` + `objects`{`bbox`, `category`}',
classification: '`image` + `label`',
sam: '`image` + masks/prompts (`bbox` or `points`)',
},
},
embedding: {
skill: 'huggingface-skills:train-sentence-transformers',
scriptByMethod: {
biencoder: 'train_sentence_transformer_example.py',
reranker: 'train_cross_encoder_example.py',
sparse: 'train_sparse_encoder_example.py',
},
inspector: null, // sentence-transformers skill validates data shape against the chosen loss
costEstimator: null,
evalSkill: 'huggingface-skills:train-sentence-transformers',
schemaRules: {
biencoder: 'loss-dependent: (anchor,positive) pairs, triplets, or (sentence,label) — match the chosen loss',
reranker: '(query, document) pairs with a label/score',
sparse: 'same shapes as bi-encoder; SpladeLoss-compatible',
},
},
eval: {
skill: 'huggingface-skills:huggingface-community-evals',
evalSkill: 'huggingface-skills:huggingface-community-evals',
evalOnly: true,
},
data: {
skill: 'huggingface-skills:huggingface-datasets',
dataOnly: true,
},
other: {
skill: 'huggingface-skills:hf-cli',
},
}
function resolveProfile(taskType) { return TASK_PROFILES[taskType] || TASK_PROFILES.other }
// Closed failure taxonomy + fix policy injected into the analysis prompt.
const FAILURE_TAXONOMY = `Failure categories and the ONLY allowed minimal fix for each:
- oom: CUDA out of memory. Apply the R7 ladder IN ORDER, escalating only on repeat OOM (the current rung is in the carried context): (1) reduce per_device_train_batch_size and raise gradient_accumulation_steps so the EFFECTIVE batch (per_device * accum * n_gpu) is unchanged — state the arithmetic; (2) enable gradient_checkpointing; (3) move to a larger-memory flavor. NEVER reduce seqlen / switch method. set effectiveBatchPreserved.
- bad_import / module_not_found: fix the PEP-723 dependency header (add/pin the package). Re-check current docs.
- wrong_trainer_arg: TypeError on an unexpected kwarg (e.g. max_seq_length vs max_length). Re-check the CURRENT library docs (R11) and correct the kwarg. set recheckedDocs.
- dataset_schema_mismatch: KeyError / format error. Re-map columns or add a formatting_func. If the dataset genuinely lacks the required fields → this is a scopeChange/STOP (R8): ask the user, do not swap dataset.
- auth_token: 401/403 on push. Ensure HF_TOKEN is passed as a secret with write scope; never inline the token.
- quota_flavor_unavailable: flavor queued/denied or billing. Wait/retry, or use an equivalent-or-larger flavor (sizing, not method).
- build_from_source_failure: make/cmake/nvcc/flash-attn wheel error. Prefer a prebuilt wheel / disable the source build (R12).
- timeout: container killed mid-run. RAISE the timeout to a sized value (R5). Shrinking the dataset/epochs/seqlen to fit the default is a forbidden scopeChange.
- dependency_conflict: pip resolver clash. Pin compatible versions per current docs.
- hang_no_progress: stuck (e.g. eval_strategy set without an eval_dataset). Provide eval_dataset or set eval_strategy="no".
- divergence_nan: NaN/inf or non-decreasing loss (usually surfaced as an ERROR trackio alert). Corrective config change: learning_rate x0.1 (or per the alert text). This is iteration toward a working run, not a crash.
- unknown: read the full log, diagnose, change something specific; never retry unchanged (R10).`
// ---------------------------------------------------------------------------
// JSON schemas (forced structured output)
// ---------------------------------------------------------------------------
const PLAN_SCHEMA = {
type: 'object', additionalProperties: false,
required: ['isTrivial', 'taskType', 'method', 'baseline', 'skills', 'plan', 'summary'],
properties: {
isTrivial: { type: 'boolean', description: 'true only for a pure factual question / status check / resource lookup that needs no ML code' },
directAnswer: { type: 'string', description: 'if isTrivial, the answer to return to the user' },
taskType: { type: 'string', enum: ['llm', 'vision', 'embedding', 'eval', 'data', 'other'] },
method: { type: 'string', description: 'e.g. sft|dpo|grpo|reward|detection|classification|sam|biencoder|reranker|sparse, or "n/a"' },
baseline: {
type: 'object', additionalProperties: false, description: 'FROZEN user intent; the scope-change guard compares against this',
required: ['model', 'dataset', 'method', 'sequenceLength'],
properties: {
model: { type: 'string' }, dataset: { type: 'string' },
method: { type: 'string' }, sequenceLength: { type: 'string' },
},
},
candidateModels: { type: 'array', items: { type: 'string' } },
candidateDatasets: { type: 'array', items: { type: 'string' } },
skills: { type: 'array', items: { type: 'string' }, description: 'huggingface-skills:<name> to use' },
plan: {
type: 'array', items: {
type: 'object', additionalProperties: false, required: ['id', 'title', 'status'],
properties: { id: { type: 'string' }, title: { type: 'string' }, status: { type: 'string', enum: ['pending', 'in_progress', 'completed'] } },
},
},
summary: { type: 'string' },
},
}
const FINDER_SCHEMA = {
type: 'object', additionalProperties: false, required: ['findings', 'sources'],
properties: {
findings: { type: 'string', description: 'compact prose findings for this research angle' },
sources: { type: 'array', items: { type: 'object', additionalProperties: false, required: ['title', 'url'], properties: { title: { type: 'string' }, url: { type: 'string' } } } },
fidelity: { type: 'string', enum: ['full', 'reduced'], description: 'reduced if a capability gap (e.g. no citation-graph API) limited coverage' },
notes: { type: 'string' },
},
}
const RESEARCH_SCHEMA = {
type: 'object', additionalProperties: false,
required: ['recipe', 'codePatterns', 'sotaLandscape', 'references', 'citationGraphFidelity'],
properties: {
recipe: {
type: 'array', items: {
type: 'object', additionalProperties: false,
required: ['paper', 'paperUrl', 'reportedResult', 'dataset', 'method', 'keyHyperparams', 'insight'],
properties: {
paper: { type: 'string' }, paperUrl: { type: 'string' }, reportedResult: { type: 'string' },
benchmark: { type: 'string' }, dataset: { type: 'string' }, datasetUrl: { type: 'string' },
method: { type: 'string' }, keyHyperparams: { type: 'string' }, insight: { type: 'string' },
},
},
},
codePatterns: {
type: 'array', items: {
type: 'object', additionalProperties: false, required: ['summary', 'snippet'],
properties: { summary: { type: 'string' }, imports: { type: 'string' }, trainerArgs: { type: 'string' }, snippet: { type: 'string' }, apiVerifiedNote: { type: 'string' } },
},
},
sotaLandscape: { type: 'string' },
references: { type: 'array', items: { type: 'object', additionalProperties: false, required: ['type', 'url'], properties: { type: { type: 'string' }, url: { type: 'string' }, note: { type: 'string' } } } },
citationGraphFidelity: { type: 'string', enum: ['full', 'reduced'] },
unverifiedClaims: { type: 'array', items: { type: 'string' } },
},
}
const CRITIC_SCHEMA = {
type: 'object', additionalProperties: false, required: ['ok', 'severity', 'issues', 'recommendation', 'mustReblock'],
properties: {
ok: { type: 'boolean' },
severity: { type: 'string', enum: ['info', 'warn', 'error'] },
issues: { type: 'array', items: { type: 'object', additionalProperties: false, required: ['issue', 'severity'], properties: { issue: { type: 'string' }, severity: { type: 'string', enum: ['info', 'warn', 'error'] }, fix: { type: 'string' } } } },
recommendation: { type: 'string' },
mustReblock: { type: 'boolean', description: 'true if the workflow must not proceed until the issues are fixed' },
},
}
const RESOURCE_SCHEMA = {
type: 'object', additionalProperties: false, required: ['modelVerified', 'datasetVerified', 'hardware', 'needsUserDecision'],
properties: {
modelVerified: { type: 'object', additionalProperties: false, required: ['exists'], properties: { exists: { type: 'boolean' }, id: { type: 'string' }, url: { type: 'string' }, sizeParams: { type: 'string' }, license: { type: 'string' } } },
datasetVerified: { type: 'object', additionalProperties: false, required: ['exists'], properties: { exists: { type: 'boolean' }, id: { type: 'string' }, url: { type: 'string' } } },
hardware: { type: 'object', additionalProperties: false, required: ['flavor', 'rationale'], properties: { flavor: { type: 'string' }, rationale: { type: 'string' } } },
needsUserDecision: { type: 'boolean', description: 'true if a requested resource is missing/unusable (R8) — STOP and ask, never substitute' },
note: { type: 'string' },
},
}
const DATA_AUDIT_SCHEMA = {
type: 'object', additionalProperties: false, required: ['columns', 'formatCompatible', 'requiredFields'],
properties: {
columns: { type: 'array', items: { type: 'string' } },
splits: { type: 'string' }, sampleRows: { type: 'string' }, distributionNotes: { type: 'string' },
formatCompatible: { type: 'boolean', description: 'does the schema satisfy the training method per §5.4?' },
requiredFields: { type: 'string', description: 'the fields the method requires' },
mappingNeeded: { type: 'string', description: 'column remapping / formatting_func needed, if any' },
issues: { type: 'array', items: { type: 'string' } },
},
}
const IMPL_SCHEMA = {
type: 'object', additionalProperties: false,
required: ['trainScriptPath', 'evalScriptPath', 'trainScriptContent', 'referenceScriptUsed', 'persistenceDest', 'timeoutHours', 'monitoringWired', 'smokeKnobs'],
properties: {
trainScriptPath: { type: 'string' }, evalScriptPath: { type: 'string' },
trainScriptContent: { type: 'string', description: 'FULL inline source of the training script (by-value transport for the remote job)' },
referenceScriptUsed: { type: 'string', description: 'which shipped skill reference script this was adapted from' },
configSummary: { type: 'string' },
persistenceDest: { type: 'string', description: 'concrete org/name Hub destination (NOT the template "username/...")' },
timeoutHours: { type: 'number' },
monitoringWired: { type: 'boolean' },
estimatedPreflightUSD: { type: 'number' },
estimatedFullJobUSD: { type: 'number' },
smokeKnobs: { type: 'object', additionalProperties: false, required: ['maxSteps', 'evalSteps', 'cpu', 'proxyModel'], properties: { maxSteps: { type: 'number' }, evalSteps: { type: 'number' }, cpu: { type: 'boolean' }, fp32: { type: 'boolean' }, proxyModel: { type: 'string', description: 'tiny model used for CPU smoke if the real model cannot load on CPU; empty if the real model is used' } } },
},
}
const SMOKE_SCHEMA = {
type: 'object', additionalProperties: false, required: ['status', 'modelUsed', 'usedProxyModel', 'summary'],
properties: {
status: { type: 'string', enum: ['success', 'failure'] },
modelUsed: { type: 'string' }, usedProxyModel: { type: 'boolean' },
ranTrainStep: { type: 'boolean' }, ranEvalStep: { type: 'boolean' },
summary: { type: 'string' }, errorExcerpt: { type: 'string' },
},
}
const JOB_SCHEMA = {
type: 'object', additionalProperties: false, required: ['status', 'healthy'],
properties: {
status: { type: 'string', enum: ['success', 'failure'] },
healthy: { type: 'boolean', description: 'logs confirm training actually started / first metric logged' },
jobId: { type: 'string' }, dashboardUrl: { type: 'string' }, jobUrl: { type: 'string' },
estimatedUSD: { type: 'number' },
alerts: { type: 'array', items: { type: 'object', additionalProperties: false, required: ['level', 'text'], properties: { level: { type: 'string' }, title: { type: 'string' }, text: { type: 'string' }, step: { type: 'number' }, timestamp: { type: 'string' } } } },
lastAlertTimestamp: { type: 'string', description: 'cursor for the next --since poll' },
logsSummary: { type: 'string' }, errorExcerpt: { type: 'string' },
},
}
const FAILURE_ANALYSIS_SCHEMA = {
type: 'object', additionalProperties: false,
required: ['rootCause', 'category', 'evidenceFromLogs', 'minimalFix', 'configChanges', 'scopeChange', 'unrecoverable'],
properties: {
rootCause: { type: 'string' },
category: { type: 'string', enum: ['oom', 'bad_import', 'module_not_found', 'wrong_trainer_arg', 'dataset_schema_mismatch', 'auth_token', 'quota_flavor_unavailable', 'build_from_source_failure', 'timeout', 'dependency_conflict', 'hang_no_progress', 'divergence_nan', 'unknown'] },
evidenceFromLogs: { type: 'string', description: 'the actual error excerpt that justifies the diagnosis (R10)' },
minimalFix: { type: 'string', description: 'the smallest change that preserves the original request (R3)' },
configChanges: { type: 'object', description: 'map of changed config keys -> {from,to}; touching a protected key triggers the scope-change guard', additionalProperties: true },
oomLadderRung: { type: 'number', description: '1,2,3 for the R7 rung applied this attempt; 0/absent if not an OOM' },
effectiveBatchPreserved: { type: 'boolean' },
recheckedDocs: { type: 'boolean' },
scopeChange: { type: 'boolean', description: 'true if the only viable fix would change method/model/dataset/seqlen (then STOP and ask)' },
scopeChangeReason: { type: 'string' },
unrecoverable: { type: 'boolean' },
},
}
const CHECKLIST_SCHEMA = {
type: 'object', additionalProperties: false, required: ['allSatisfied', 'missing'],
properties: {
referenceImplCited: { type: 'boolean' }, datasetFormatVerified: { type: 'boolean' },
gpuSmokeOk: { type: 'boolean' }, realModelLoaded: { type: 'boolean' },
persistenceConcrete: { type: 'boolean' }, timeoutSized: { type: 'boolean' },
monitoringWired: { type: 'boolean' }, byValueTransport: { type: 'boolean' },
allSatisfied: { type: 'boolean' }, missing: { type: 'array', items: { type: 'string' } },
},
}
const EVAL_SCHEMA = {
type: 'object', additionalProperties: false, required: ['evaluated', 'modelUrl'],
properties: {
evaluated: { type: 'boolean' }, metric: { type: 'string' }, value: { type: 'string' },
evalUrl: { type: 'string' }, modelUrl: { type: 'string' }, dashboardUrl: { type: 'string' },
confirmedWorks: { type: 'boolean' }, summary: { type: 'string' },
},
}
const COMPLETION_SCHEMA = {
type: 'object', additionalProperties: false, required: ['conforms', 'missing', 'artifacts'],
properties: {
conforms: { type: 'boolean' },
outputExists: { type: 'boolean' }, persisted: { type: 'boolean' }, evaluated: { type: 'boolean' },
researchGrounded: { type: 'boolean' }, resourcesVerified: { type: 'boolean' }, smokeTested: { type: 'boolean' },
noScopeChange: { type: 'boolean' }, allArtifactsLinked: { type: 'boolean' },
missing: { type: 'array', items: { type: 'string' } },
artifacts: { type: 'array', items: { type: 'object', additionalProperties: false, required: ['type', 'url'], properties: { type: { type: 'string' }, url: { type: 'string' } } } },
summary: { type: 'string' },
},
}
// ---------------------------------------------------------------------------
// Reusable analyze -> minimal-fix -> retry helper (CPU smoke / GPU preflight / full job)
// ---------------------------------------------------------------------------
async function attemptWithRetry({ label, phaseName, maxAttempts, submit, analyze, applyFix }) {
let ctx = { attempt: 0, oomLadderRung: 0, sinceCursor: null, lastFix: null, history: [] }
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
ctx.attempt = attempt
log(`${label}: attempt ${attempt}/${maxAttempts}`)
const res = await submit(attempt, ctx)
if (res && res.status === 'success') return { ...res, attempts: attempt, stopped: false }
ctx.history.push({ attempt, status: res && res.status, error: res && res.errorExcerpt })
if (attempt === maxAttempts) {
log(`${label}: exhausted ${maxAttempts} attempts`)
return { ...(res || {}), attempts: attempt, stopped: true, stoppedReason: 'retries_exhausted' }
}
const analysis = await analyze(attempt, res, ctx)
if (!analysis) return { ...(res || {}), attempts: attempt, stopped: true, stoppedReason: 'analysis_failed' }
// Pure scope-change guard (R3/R4) — independent of the agent's self-report.
const scope = analysis.scopeChange || isScopeChange(analysis.configChanges)
if (scope) {
log(`${label}: STOP — fix would change scope (${analysis.scopeChangeReason || 'protected key touched'}). Per R3/R4 the user must decide.`)
return { ...(res || {}), attempts: attempt, stopped: true, stoppedReason: 'scope_change', analysis }
}
if (analysis.unrecoverable) {
log(`${label}: STOP — unrecoverable: ${analysis.rootCause}`)
return { ...(res || {}), attempts: attempt, stopped: true, stoppedReason: 'unrecoverable', analysis }
}
if (applyFix) await applyFix(analysis, ctx)
ctx.lastFix = analysis
if (analysis.category === 'oom') ctx.oomLadderRung = Math.max(ctx.oomLadderRung, analysis.oomLadderRung || (attempt))
if (res && res.lastAlertTimestamp) ctx.sinceCursor = res.lastAlertTimestamp
}
}
function analysisPrompt(stage, res, ctx, baseline) {
return `${HARD_RULES}
You are diagnosing a FAILED ${stage}. Read the FULL error/logs, diagnose the ACTUAL root cause, and propose the single MINIMAL fix that preserves the user's original request. Do NOT retry the identical action unchanged (R10). For API/import errors re-check the CURRENT library docs/examples (R11) — use ${SKILLS_NOTE}
${FAILURE_TAXONOMY}
The user's FROZEN baseline (never change these to escape an error — doing so is a scopeChange that must STOP and ask the user):
${JSON.stringify(baseline)}
Carried retry context (for OOM, escalate the R7 ladder beyond rung ${ctx.oomLadderRung}):
${JSON.stringify({ attempt: ctx.attempt, oomLadderRung: ctx.oomLadderRung, history: ctx.history, lastFix: ctx.lastFix && ctx.lastFix.minimalFix })}
${ctxBlock('Failure result', res)}
Return a FAILURE_ANALYSIS object. Put the exact log excerpt in evidenceFromLogs. If the only viable fix would touch method/model/dataset/sequence length, set scopeChange=true and explain.`
}
// ===========================================================================
// MAIN
// ===========================================================================
if (!TASK) {
log('No task provided. Pass args as a task string or { task, ... }.')
return { error: 'missing task', usage: meta.whenToUse }
}
// --- Phase 0: Intake & Plan -------------------------------------------------
phase('Intake & Plan')
log(`Task: ${TASK}`)
const plan = await agent(
`${HARD_RULES}
You are the intake+planner for an autonomous ML research workflow. ${SKILLS_NOTE}
User request:
"""${TASK}"""
${cfg.model ? `\nUser-specified model: ${cfg.model}` : ''}${cfg.dataset ? `\nUser-specified dataset: ${cfg.dataset}` : ''}${cfg.hubOrg ? `\nHub namespace for outputs: ${cfg.hubOrg}` : ''}
Do (read-only): classify the task. isTrivial=true ONLY for a pure factual question / status check / resource lookup needing no ML code (then put the answer in directAnswer). Otherwise pick taskType (llm|vision|embedding|eval|data|other) and method. Capture the FROZEN baseline {model,dataset,method,sequenceLength} exactly as the user expressed it (use "unspecified" where the user did not say). List candidate models/datasets if the user did not name them. List the huggingface-skills:* you will use. Produce an ordered plan (research → verify → resources → audit → implement → review → cpu-smoke → gpu-preflight → full-job → evaluate → persist → verify), first item in_progress, rest pending.`,
{ label: 'plan', phase: 'Intake & Plan', schema: PLAN_SCHEMA }
)
result.plan = plan
result.baseline = plan && plan.baseline
if (!plan) { result.stoppedReason = 'planning_failed'; return result }
if (plan.isTrivial) {
log('Trivial request — answering directly, skipping the heavyweight workflow (§5).')
result.trivial = true
result.answer = plan.directAnswer || (await agent(`Answer this directly and concisely, using ${SKILLS_NOTE}\n\n"""${TASK}"""`, { label: 'direct-answer', phase: 'Intake & Plan' }))
return result
}
const profile = resolveProfile(plan.taskType)
const baseline = plan.baseline || {}
log(`taskType=${plan.taskType} method=${plan.method} → skill ${profile.skill}`)
// --- Phase 1: Research (mandatory; read-only fan-out) -----------------------
phase('Research')
log('Plan: research in_progress. Literature-first — assume internal ML knowledge is stale (Principle 1).')
const researchCtx = ctxBlock('Task & baseline', { task: TASK, taskType: plan.taskType, method: plan.method, baseline })
const finders = await parallel([
() => agent(`${SKILLS_NOTE}\nResearch angle: LANDMARK PAPERS. Find the landmark/anchor paper(s) for this task/domain and read their METHODOLOGY sections (not abstracts). Use huggingface-skills:huggingface-papers (paper markdown + metadata) and hf-cli papers.${researchCtx}Return findings + sources.`, { label: 'research:landmark', phase: 'Research', agentType: 'Explore', schema: FINDER_SCHEMA }),
() => agent(`${SKILLS_NOTE}\nResearch angle: DOWNSTREAM / CITATION work. Crawl recent work that cites and improves the anchor papers. NOTE the capability gap: there is no full citation-graph API — use the paper pages' linked artifacts (huggingface-skills:huggingface-papers) PLUS host web search/fetch (WebSearch/WebFetch) as a fallback, and set fidelity="reduced" while saying so.${researchCtx}Return findings + sources.`, { label: 'research:citations', phase: 'Research', agentType: 'Explore', schema: FINDER_SCHEMA }),
() => agent(`${SKILLS_NOTE}\nResearch angle: WORKING EXAMPLE CODE with CURRENT APIs. Find working examples for ${plan.method} using the current library APIs. Strongly prefer the reference scripts shipped by ${profile.skill} (read its SKILL.md + scripts/ + references/). Capture correct imports and trainer/config argument names verbatim.${researchCtx}Return findings + sources.`, { label: 'research:examples', phase: 'Research', agentType: 'Explore', schema: FINDER_SCHEMA }),
() => agent(`${SKILLS_NOTE}\nResearch angle: DATASETS that produced the reported results. Identify which datasets the strong results used and CONFIRM they exist/are usable (huggingface-skills:huggingface-datasets / hf-cli).${researchCtx}Return findings + sources.`, { label: 'research:datasets', phase: 'Research', agentType: 'Explore', schema: FINDER_SCHEMA }),
])
const goodFinders = finders.filter(Boolean)
const research = await agent(
`${SKILLS_NOTE}\nSynthesize the research below into a single recipe. Every extracted fact MUST be attributable to a specific reported result (e.g. "dataset X + method Y → score Z on benchmark B"). Capture code patterns with correct imports + current trainer/config argument names. If the citation crawler reported reduced fidelity, set citationGraphFidelity="reduced". List any unverifiedClaims.${researchCtx}${ctxBlock('Finder outputs', goodFinders)}Return a RESEARCH object.`,
{ label: 'research:synthesis', phase: 'Research', schema: RESEARCH_SCHEMA }
)
result.research = research
// --- Phase 2: Verify Research (adversarial critic) --------------------------
phase('Verify Research')
const researchCritic = await agent(
`${SKILLS_NOTE}\nAdversarially verify this research recipe (READ-ONLY, tool-backed). Confirm the cited papers, datasets, and models actually EXIST (hf-cli / huggingface-papers / huggingface-datasets) and that the code patterns use REAL current APIs (not hallucinated imports/args). Cheap existence/API checks are mustReblock if they fail. "Is every result truly attributable?" is a soft flag (severity warn), NOT a reblock. Set mustReblock=true only for hard problems: a cited paper/dataset/model does not exist, or an import/API is clearly hallucinated.${ctxBlock('Research', research)}Return a CRITIC object.`,
{ label: 'verify-research', phase: 'Verify Research', agentType: 'Explore', schema: CRITIC_SCHEMA }
)
result.researchCritic = researchCritic
if (researchCritic && researchCritic.mustReblock) {
log('Research critic blocked — one corrective re-research pass.')
const reFix = await agent(`${SKILLS_NOTE}\nThe research critic flagged blocking issues. Re-research ONLY what is broken and return a corrected RESEARCH object.${ctxBlock('Critic', researchCritic)}${ctxBlock('Prior research', research)}`, { label: 'research:refix', phase: 'Verify Research', schema: RESEARCH_SCHEMA })
if (reFix) result.research = reFix
}
const RESEARCH = result.research
// --- Phase 3: Resources (verify model + dataset; size hardware) -------------
phase('Resources')
log('Plan: resource discovery + validation. No silent substitution (R8).')
const resVerify = await parallel([
() => agent(`${SKILLS_NOTE}\nVerify the MODEL. ${baseline.model && baseline.model !== 'unspecified' ? `The user requested "${baseline.model}" — confirm it exists and inspect architecture/size/tokenizer/license. If it does NOT exist or is unusable, set needsUserDecision=true (do NOT substitute, R8).` : `The user did not name a model — evaluate 3-5 candidates with huggingface-skills:huggingface-best and recommend one on task-fit/quality/size/cost.`} Also recommend a hardware flavor sized to the model footprint (R6).${ctxBlock('Plan+research', { plan, recipe: RESEARCH && RESEARCH.recipe })}Return the modelVerified + hardware portions of a RESOURCE object (leave datasetVerified.exists as your best guess).`, { label: 'resources:model', phase: 'Resources', agentType: 'Explore', schema: RESOURCE_SCHEMA }),
() => agent(`${SKILLS_NOTE}\nVerify the DATASET exists and is loadable (hf-cli / huggingface-skills:huggingface-datasets). ${baseline.dataset && baseline.dataset !== 'unspecified' ? `The user requested "${baseline.dataset}". If it cannot be loaded, set needsUserDecision=true (do NOT substitute, R8).` : `The user did not name a dataset — pick from the research recipe and confirm it loads.`}${ctxBlock('Plan+research', { plan, recipe: RESEARCH && RESEARCH.recipe })}Return the datasetVerified portion of a RESOURCE object.`, { label: 'resources:dataset', phase: 'Resources', agentType: 'Explore', schema: RESOURCE_SCHEMA }),
])
const [modelRes, dataRes] = resVerify
const resources = {
modelVerified: (modelRes && modelRes.modelVerified) || { exists: false },
datasetVerified: (dataRes && dataRes.datasetVerified) || { exists: false },
hardware: (modelRes && modelRes.hardware) || { flavor: 'unknown', rationale: '' },
needsUserDecision: !!((modelRes && modelRes.needsUserDecision) || (dataRes && dataRes.needsUserDecision)),
note: [modelRes && modelRes.note, dataRes && dataRes.note].filter(Boolean).join(' | '),
}
result.resources = resources
if (resources.needsUserDecision || !resources.modelVerified.exists || !resources.datasetVerified.exists) {
result.stoppedReason = 'resource_unavailable'
log('STOP — a requested model/dataset is unavailable or unusable. Per R8 the workflow asks the user rather than substituting.')
return result
}
// --- Phase 4: Data Audit (+ format-compatibility critic) --------------------
phase('Data Audit')
log('Plan: data audit. Validate format-to-method compatibility before any job (§5.4, R9).')
const audit = await agent(
`${SKILLS_NOTE}\nAudit the dataset "${resources.datasetVerified.id}" BEFORE any training. Inspect schema/columns, rows-per-split, value distributions for key columns, and sample rows (huggingface-skills:huggingface-datasets${profile.inspector ? ` and the skill's ${profile.inspector}` : ''}). Surface class imbalance / missing values / unexpected formats / duplicates. VALIDATE the format against the method "${plan.method}", which requires: ${(profile.schemaRules && profile.schemaRules[plan.method]) || 'the documented schema for this method (confirm during research)'}.${ctxBlock('Resources', resources)}Return a DATA_AUDIT object.`,
{ label: 'data-audit', phase: 'Data Audit', schema: DATA_AUDIT_SCHEMA }
)
result.audit = audit
const formatCritic = await agent(
`${SKILLS_NOTE}\nCritic: confirm the dataset format is compatible with method "${plan.method}" (required: ${(profile.schemaRules && profile.schemaRules[plan.method]) || 'documented schema'}). If incompatible AND it can be fixed by column remapping / a formatting_func, set ok=true with the mapping as the recommendation. If the dataset genuinely lacks the required fields, set mustReblock=true (this is an R8 stop — ask the user, do not swap dataset).${ctxBlock('Audit', audit)}Return a CRITIC object.`,
{ label: 'verify-data-format', phase: 'Data Audit', agentType: 'Explore', schema: CRITIC_SCHEMA }
)
result.formatCritic = formatCritic
if (formatCritic && formatCritic.mustReblock) {
result.stoppedReason = 'dataset_format_incompatible'
log('STOP — dataset format is incompatible with the method and cannot be remapped. Asking the user (R8).')
return result
}
// --- Early exits: eval-only / data-only -------------------------------------
if (profile.evalOnly) {
phase('Evaluate & Persist')
log('eval-only task — skipping train/smoke/preflight/full-job.')
const evalOut = await agent(`${SKILLS_NOTE}\nEvaluate model "${resources.modelVerified.id}" on the benchmark/task the user asked for, using ${profile.evalSkill} (inspect-ai/lighteval; pick the backend per the skill). Use --limit for a smoke check first, then the real eval. Report the metric, value, and a results URL.${ctxBlock('Context', { task: TASK, resources, research: RESEARCH })}Return an EVAL object.`, { label: 'eval', phase: 'Evaluate & Persist', schema: EVAL_SCHEMA })
result.eval = evalOut
return await finalize(result, plan, baseline)
}
if (profile.dataOnly) {
phase('Implement')
log('data-only task — processing/producing a dataset and persisting it.')
const dataOut = await agent(`${SKILLS_NOTE}\nPerform the data task the user asked for (build/transform/filter/upload a dataset) using huggingface-skills:huggingface-datasets and hf-cli. Persist the result to the Hub under ${cfg.hubOrg ? cfg.hubOrg + '/' : ''}a concrete repo and return the dataset URL (R1/R14).${ctxBlock('Context', { task: TASK, resources, audit, research: RESEARCH })}Return an EVAL object (use modelUrl for the dataset URL).`, { label: 'data-task', phase: 'Implement', schema: EVAL_SCHEMA })
result.eval = dataOut
return await finalize(result, plan, baseline)
}
// --- Phase 5: Implement (adapt reference script; wire monitoring/persistence) ---
phase('Implement')
log('Plan: implement from research (not memory). Sandbox-first; write train.py + eval.py to ' + WORK_DIR + '.')
const referenceScript = (profile.scriptByMethod && profile.scriptByMethod[plan.method]) || 'the closest shipped reference script'
const impl = await agent(
`${HARD_RULES}
You are the implementer. ${SKILLS_NOTE}
Adapt the shipped reference script "${referenceScript}" from ${profile.skill} (READ it from ${SKILLS_DIR}/skills/ — do NOT synthesize from memory; Principle 1). Write two scripts into ${WORK_DIR}/ (create the dir): train.py and eval.py. The scripts must:
- Use model "${resources.modelVerified.id}" and dataset "${resources.datasetVerified.id}" with the verified columns from the audit (apply any required mapping/formatting_func).
- Wire huggingface-skills:huggingface-trackio: log metrics AND emit structured ALERTS. Trackio alerts are FREE-TEXT {title,text,level,step}; put the numeric value + actionable suggestion inside text (e.g. "loss=12.4 at step 200 — lr likely too high, try x0.1"). Levels: ERROR (divergence/NaN/OOM), WARN (overfitting/early-stop/KL-spike), INFO (milestones).
- Set push_to_hub to a CONCRETE destination ${cfg.hubOrg ? `under "${cfg.hubOrg}/"` : '(a real org/name — NEVER leave the template "username/...")'}.
- Size the timeout to the work (R5) and choose the flavor "${resources.hardware.flavor}" (R6). Expose batch_size / gradient_accumulation / gradient_checkpointing knobs so the R7 OOM ladder can be applied without a rewrite.
- Support SMOKE knobs via env/argv: when SMOKE=1 → max_steps=1, eval at 1 step, a tiny slice (e.g. train[:8]), per_device_train_batch_size=1, report_to none, push_to_hub False, force CPU + fp32 (mixed precision is GPU-only and will error/no-op on CPU). If the real model is too large to load on CPU, parameterize a tiny PROXY model for the smoke and record it in smokeKnobs.proxyModel (the real model load is then first exercised on GPU preflight).
- Be SELF-CONTAINED (PEP-723 inline deps) and reference NO local paths — the remote job receives the source BY VALUE.
${ctxBlock('Plan / research recipe / resources / audit', { plan, recipe: RESEARCH && RESEARCH.recipe, codePatterns: RESEARCH && RESEARCH.codePatterns, resources, audit })}
Estimate preflight and full-job USD cost (use the skill's ${profile.costEstimator || 'cost guidance'}). Return an IMPL object; trainScriptContent MUST be the full inline source of train.py.`,
{ label: 'implement', phase: 'Implement', schema: IMPL_SCHEMA }
)
result.impl = impl
if (!impl) { result.stoppedReason = 'implementation_failed'; return result }
// --- Phase 6: Code Review (critic) ------------------------------------------
phase('Code Review')
const codeCritic = await agent(
`${HARD_RULES}\nCritic: review ${WORK_DIR}/train.py and ${WORK_DIR}/eval.py (Read them) BEFORE any execution. Check for: hallucinated imports / non-existent APIs; wrong trainer/config argument names (verify against current docs, ${SKILLS_NOTE}); LOCAL-PATH leakage (remote job runs by value); missing trackio monitoring; persistence destination left as the template "username/..." instead of a concrete org/name (a likely copy-paste bug — flag as error); missing/short timeout; OOM-readiness (batch/accum/grad-checkpoint knobs exposed for the R7 ladder); building heavy deps from source where a prebuilt wheel exists (R12); SMOKE knobs honored (max_steps=1 + CPU + fp32). mustReblock=true for any error-severity issue.${ctxBlock('Impl summary', { ...impl, trainScriptContent: '(omitted; read the file)' })}Return a CRITIC object.`,
{ label: 'code-review', phase: 'Code Review', agentType: 'Explore', schema: CRITIC_SCHEMA }
)
result.codeCritic = codeCritic
if (codeCritic && codeCritic.mustReblock) {
log('Code review blocked — applying fixes before any run.')
const fixed = await agent(`${SKILLS_NOTE}\nFix the issues the code critic raised in ${WORK_DIR}/train.py and ${WORK_DIR}/eval.py (edit the files). Return an updated IMPL object (trainScriptContent = full corrected train.py source).${ctxBlock('Critic', codeCritic)}${ctxBlock('Impl', { ...impl, trainScriptContent: '(read the file)' })}`, { label: 'code-fix', phase: 'Code Review', schema: IMPL_SCHEMA })
if (fixed) result.impl = fixed
}
const IMPL = result.impl
// --- Phase 7: CPU Smoke (free, local: 1 train step + 1 eval step) -----------
phase('CPU Smoke')
log('Plan: LOCAL CPU smoke — 1 train step + 1 eval step (free; catches import/dep/arg/schema/wiring errors before any paid GPU job).')
const smoke = await attemptWithRetry({
label: 'cpu-smoke', phaseName: 'CPU Smoke', maxAttempts: SMOKE_RETRIES,
submit: (attempt) => agent(
`${HARD_RULES}\nRun the LOCAL CPU smoke test: execute ${WORK_DIR}/train.py with SMOKE=1 forcing CPU + fp32, exactly 1 training step, then ${WORK_DIR}/eval.py for 1 eval step (use \`uv run\` so PEP-723 deps resolve). This is an "does the code execute end-to-end for one step" check — a HIGH loss is NOT a failure; an exception/import/arg/schema error IS. If the real model "${resources.modelVerified.id}" cannot load on CPU, use the proxy model "${(IMPL.smokeKnobs && IMPL.smokeKnobs.proxyModel) || '(choose a tiny same-architecture model)'}" and set usedProxyModel=true. Capture the exact error excerpt on failure.${ctxBlock('Smoke knobs', IMPL.smokeKnobs)}Return a SMOKE object.`,
{ label: `cpu-smoke:run${attempt}`, phase: 'CPU Smoke', schema: SMOKE_SCHEMA }
),
analyze: (attempt, res, ctx) => agent(analysisPrompt('LOCAL CPU smoke test', res, ctx, baseline), { label: `cpu-smoke:analyze${attempt}`, phase: 'CPU Smoke', schema: FAILURE_ANALYSIS_SCHEMA }),
applyFix: (analysis) => agent(`${SKILLS_NOTE}\nApply this MINIMAL fix to ${WORK_DIR}/train.py and/or eval.py (edit the files), preserving the user's baseline. Do not change anything else.${ctxBlock('Fix', analysis)}Return a one-line confirmation of what you changed.`, { label: 'cpu-smoke:fix', phase: 'CPU Smoke' }),
})
result.cpuSmoke = smoke
if (smoke && smoke.stopped) {
result.stoppedReason = 'cpu_smoke_' + (smoke.stoppedReason || 'failed')
log('STOP — CPU smoke could not be made to pass without violating the rules. Surfacing to the user.')
return result
}
// --- Phase 8: GPU Preflight (tiny billable job) -----------------------------
phase('GPU Preflight')
const preflightEst = (IMPL.estimatedPreflightUSD || 0.5)
if (!canSpend(preflightEst)) {
result.stoppedReason = 'cost_cap_preflight'
log(`STOP before GPU preflight — estimated $${preflightEst} would exceed the cost cap ($${COST_CAP_USD}, spent $${spentUSD}). Raise costCapUSD to proceed.`)
return result
}
const inv1 = assertSubmitInvariant(IMPL)
if (!inv1.ok) {
result.stoppedReason = 'submit_invariant_preflight'
log('STOP — pre-submit invariant failed for preflight: ' + inv1.missing.join('; '))
return result
}
log('Plan: GPU preflight — tiny subset on a small flavor; validates CUDA / mixed precision / real model load / small-scale OOM (CPU could not). Billable.')
const preflight = await attemptWithRetry({
label: 'gpu-preflight', phaseName: 'GPU Preflight', maxAttempts: MAX_JOB_RETRIES,
submit: (attempt, ctx) => agent(
`${HARD_RULES}\nSubmit a TINY GPU PREFLIGHT job via huggingface-skills:hf-cli (\`hf jobs run\`), using ${profile.skill} for the run command. Override the skill's "submit immediately / don't poll" default: pass the training script BY VALUE (inline ${WORK_DIR}/train.py source), a tiny subset, a SMALL flavor, a SHORT timeout, and HF_TOKEN as a secret. This exercises the same imports + REAL model-loading path + training entrypoint on GPU. Then POLL \`hf jobs logs\` until it reaches a terminal state; read the FULL logs. ${attempt > 1 ? 'Apply the fix already made to the local files (re-inline the current source).' : ''} Report status (success only if it completed and pushed/ran cleanly), healthy (did training actually start), jobId, jobUrl, dashboardUrl, estimatedUSD, and any trackio alerts (\`trackio list alerts --json\`). On failure capture the exact error excerpt.${ctxBlock('Impl', { ...IMPL, trainScriptContent: '(inline the current file content)', smokeKnobs: IMPL.smokeKnobs })}${ctxBlock('Resources', resources)}Return a JOB object.`,
{ label: `gpu-preflight:run${attempt}`, phase: 'GPU Preflight', schema: JOB_SCHEMA }
),
analyze: (attempt, res, ctx) => agent(analysisPrompt('GPU preflight job', res, ctx, baseline), { label: `gpu-preflight:analyze${attempt}`, phase: 'GPU Preflight', schema: FAILURE_ANALYSIS_SCHEMA }),
applyFix: (analysis) => agent(`${SKILLS_NOTE}\nApply this MINIMAL fix to ${WORK_DIR}/train.py and/or eval.py (edit the files), preserving the user's baseline and following the R7 OOM ladder order if it is an OOM.${ctxBlock('Fix', analysis)}Return a one-line confirmation.`, { label: 'gpu-preflight:fix', phase: 'GPU Preflight' }),
})
result.preflight = preflight
recordSpend(preflight && preflight.estimatedUSD ? preflight.estimatedUSD : preflightEst)
if (preflight && preflight.stopped) {
result.stoppedReason = 'gpu_preflight_' + (preflight.stoppedReason || 'failed')
log('STOP — GPU preflight could not be made to pass within the rules/retries. Surfacing to the user.')
return result
}
// --- Phase 9: Job Readiness Gate (one-time) ---------------------------------
phase('Job Readiness')
const checklist = await agent(
`${HARD_RULES}\nGate before the FULL job (§5.6 pre-flight checklist). Verify EACH item and set allSatisfied only if all hold: reference implementation cited (from research); dataset format verified (audit); GPU smoke ok AND the REAL model was loaded on GPU (if the CPU smoke used a proxy, the real load must have happened in preflight); persistence destination is a concrete org/name (not "username/..."); timeout sized to the work; monitoring wired to a live trackio dashboard; code reaches the job BY VALUE.${ctxBlock('Evidence', { referenceScript, research_has_recipe: !!(RESEARCH && RESEARCH.recipe && RESEARCH.recipe.length), audit: { formatCompatible: audit && audit.formatCompatible }, preflight: { healthy: preflight && preflight.healthy }, cpuSmoke: { usedProxyModel: result.cpuSmoke && result.cpuSmoke.usedProxyModel }, impl: { persistenceDest: IMPL.persistenceDest, timeoutHours: IMPL.timeoutHours, monitoringWired: IMPL.monitoringWired } })}Return a CHECKLIST object.`,
{ label: 'job-readiness', phase: 'Job Readiness', agentType: 'Explore', schema: CHECKLIST_SCHEMA }
)
result.checklist = checklist
if (!checklist || !checklist.allSatisfied) {
result.stoppedReason = 'preflight_checklist_unsatisfied'
log('STOP — pre-flight checklist not satisfied: ' + ((checklist && checklist.missing) || ['unknown']).join('; '))
return result
}
// --- Phase 10: Full Job (one-job-first; monitor alerts) ---------------------
phase('Full Job')
const fullEst = (IMPL.estimatedFullJobUSD || 5)
if (!canSpend(fullEst)) {
result.stoppedReason = 'cost_cap_full_job'
log(`STOP before the full job — estimated $${fullEst} would exceed the cost cap ($${COST_CAP_USD}, spent $${spentUSD}). Raise costCapUSD to proceed. The job is fully prepared and validated; only the spend is gated.`)
return result
}
const inv2 = assertSubmitInvariant(IMPL)
if (!inv2.ok) { result.stoppedReason = 'submit_invariant_full'; log('STOP — full-job invariant failed: ' + inv2.missing.join('; ')); return result }
log('Plan: FULL job — one-job-first. Submit ONE, confirm healthy from logs, monitor trackio alerts; an ERROR alert drives a bounded corrective retry. We stop at the first verified result (no grid sweep).')
const fullJob = await attemptWithRetry({
label: 'full-job', phaseName: 'Full Job', maxAttempts: MAX_JOB_RETRIES,
submit: (attempt, ctx) => agent(
`${HARD_RULES}\nSubmit and monitor the FULL training job via huggingface-skills:hf-cli (\`hf jobs run\`), driven by ${profile.skill}. ONE job only (one-job-first). Pass the script BY VALUE (inline current ${WORK_DIR}/train.py), the SIZED timeout (${IMPL.timeoutHours}h), flavor "${resources.hardware.flavor}", persistence to "${IMPL.persistenceDest}", HF_TOKEN as a secret, monitoring on. Override the skill's "don't poll" default: POLL \`hf jobs logs\` first to CONFIRM HEALTHY (a training step advancing / first metric logged) — set healthy accordingly — then keep monitoring trackio alerts via \`trackio list alerts --json\`${ctx.sinceCursor ? ` (use --since "${ctx.sinceCursor}")` : ''}. ${attempt > 1 ? 'A corrective fix was applied to the local files; re-inline the current source.' : ''} status=success only if the run completed AND the model was pushed. Return jobId/jobUrl/dashboardUrl, alerts[], lastAlertTimestamp (most recent alert timestamp, for the next poll), estimatedUSD, and on failure the exact errorExcerpt. Treat an ERROR-level alert (divergence/NaN/OOM) as a failure to be analyzed.${ctxBlock('Impl', { persistenceDest: IMPL.persistenceDest, timeoutHours: IMPL.timeoutHours, flavor: resources.hardware.flavor })}Return a JOB object.`,
{ label: `full-job:run${attempt}`, phase: 'Full Job', schema: JOB_SCHEMA }
),
analyze: (attempt, res, ctx) => agent(analysisPrompt('FULL training job (crash OR an ERROR-level divergence/NaN/OOM alert)', res, ctx, baseline), { label: `full-job:analyze${attempt}`, phase: 'Full Job', schema: FAILURE_ANALYSIS_SCHEMA }),
applyFix: (analysis) => agent(`${SKILLS_NOTE}\nApply this MINIMAL corrective fix to ${WORK_DIR}/train.py (edit the file), preserving the baseline. For divergence use the alert-suggested change (e.g. lr x0.1); for OOM follow the R7 ladder order keeping the effective batch identical.${ctxBlock('Fix', analysis)}Return a one-line confirmation.`, { label: 'full-job:fix', phase: 'Full Job' }),
})
result.fullJob = fullJob
recordSpend(fullJob && fullJob.estimatedUSD ? fullJob.estimatedUSD : fullEst)
if (fullJob && fullJob.dashboardUrl) result.artifacts.push({ type: 'dashboard', url: fullJob.dashboardUrl })
if (fullJob && fullJob.stopped) {
result.stoppedReason = 'full_job_' + (fullJob.stoppedReason || 'failed')
log('STOP — the full job did not reach a healthy completion within the rules/retries. Surfacing to the user with the diagnosis.')
return result
}
// --- Phase 11: Evaluate & Persist -------------------------------------------
phase('Evaluate & Persist')
log('Plan: evaluate the trained model and confirm it is persisted (§5.8).')
const evalOut = await agent(
`${HARD_RULES}\nEvaluate the trained model at "${IMPL.persistenceDest}" and CONFIRM it actually works (not merely produced). Use ${profile.evalSkill} (smoke with --limit first, then a real eval on the relevant benchmark/task). Verify the model is persisted on the Hub (hf-cli repo details). Report the metric+value, the eval/results URL, the model URL, and the dashboard URL.${ctxBlock('Context', { task: TASK, persistenceDest: IMPL.persistenceDest, dashboardUrl: fullJob.dashboardUrl, recipe: RESEARCH && RESEARCH.recipe })}Return an EVAL object.`,
{ label: 'evaluate', phase: 'Evaluate & Persist', schema: EVAL_SCHEMA }
)
result.eval = evalOut
return await finalize(result, plan, baseline)
// ---------------------------------------------------------------------------
// Final verification (tool-backed conformance against the spec §9)
// ---------------------------------------------------------------------------
async function finalize(result, plan, baseline) {
phase('Final Verification')
const completion = await agent(
`${HARD_RULES}\nFinal conformance check (READ-ONLY, TOOL-BACKED — verify, do not trust prior claims). Confirm against the spec's completion criteria: (1) research preceded implementation; (2) resources verified incl. dataset-format-to-method; (3) GPU code smoke-tested before scaling (or omission justified for eval/data tasks); (4) pre-flight checklist satisfied, one-job-first, persistence configured up front; (5) monitoring emitted structured alerts; (6) the result is PERSISTED (resolve the repo via hf-cli) and EVALUATED (a real metric value exists) and the dashboard URL resolves; (7) NO rule in §6 was violated — in particular no silent scope change vs the baseline ${JSON.stringify(baseline)}; (8) every artifact is a direct URL (R14). List anything missing.${ctxBlock('Run state', { plan, baseline, resources: result.resources, audit: result.audit && { formatCompatible: result.audit.formatCompatible }, cpuSmoke: result.cpuSmoke && result.cpuSmoke.status, preflight: result.preflight && { status: result.preflight.status, healthy: result.preflight.healthy }, fullJob: result.fullJob && { status: result.fullJob.status, dashboardUrl: result.fullJob.dashboardUrl }, eval: result.eval, stoppedReason: result.stoppedReason })}Return a COMPLETION object.`,
{ label: 'final-verification', phase: 'Final Verification', agentType: 'Explore', schema: COMPLETION_SCHEMA }
)
result.completion = completion
if (completion && completion.artifacts) {
for (const a of completion.artifacts) if (a && a.url) result.artifacts.push(a)
}
result.conforms = !!(completion && completion.conforms)
log(result.conforms ? 'CONFORMS — verified result delivered.' : 'DOES NOT FULLY CONFORM — see completion.missing.')
return result
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment