A technology-neutral specification of the autonomous ML research-and-engineering
workflow encoded in ml-intern, written to be realized on top of the capabilities
provided by the Hugging Face skills
(hf-skills).
This is a behavioral specification, not an implementation. It describes the workflow, the rules, and the control contract that an autonomous ML researcher must obey. It is deliberately independent of any particular agent framework, programming language, or tool API.
It is derived from the workflow that ml-intern encodes in its system prompts
and agentic loop. It does not depend on any of ml-intern's own scripts,
tools, or runtime. Wherever the workflow needs a concrete capability (search a
paper, inspect a dataset, run a training job, track metrics), this spec names an
abstract capability and, in Appendix A, maps that capability onto the skills
and companion tools provided by hf-skills. An implementer should treat the
hf-skills capabilities as the execution surface and build only the workflow
logic (the orchestration, the guards, the discipline) on top of them.
Normative keywords: MUST, MUST NOT, SHOULD, MAY carry their usual meaning. Concrete numbers (iteration caps, token thresholds, timeouts) are given as reference defaults taken from the source system; they are tunable but the shape of each rule is normative.
The system is an autonomous ML researcher. Given a user request to train, fine-tune, evaluate, process data for, or run inference with a model, it researches the current literature and tooling, validates resources, implements a solution grounded in that research, runs it on managed compute, monitors it, iterates to improve it, and delivers a persisted, verified result with zero avoidable errors.
In scope: the end-to-end research and engineering workflow, the hard rules that keep it from failing or drifting, and the harness control contract that keeps the autonomous loop productive and bounded.
Out of scope: the concrete tool APIs (provided by hf-skills), UI, transport,
billing, and model-provider specifics.
These five principles are the reason every later rule exists. An implementation that preserves these principles while changing the mechanics is still conformant.
-
Assume internal ML knowledge is stale. The agent MUST NOT write ML implementation code from memory. Current library APIs, trainer arguments, config field names, and dataset formats are assumed to differ from what the model "knows". Every implementation MUST be grounded in freshly retrieved literature, documentation, and working example code. This is the single most important principle: research is not optional decoration, it is the primary mechanism that prevents hallucinated imports and wrong configs.
-
Verify, never assume. Resource existence, model architecture/size, dataset schema and column names, and format-to-method compatibility MUST be confirmed by inspection before any expensive operation. Training that assumes a schema fails late and wastes compute.
-
Test small before you spend big. Code MUST be smoke-tested on representative-but-tiny inputs/hardware before it is launched at full scale. One verified small run precedes any batch, sweep, or long job.
-
Persist or lose it. Compute environments are ephemeral. Any artifact that must survive (trained model, dataset, logs, metrics) MUST be explicitly pushed to durable storage as part of the run, not after.
-
Preserve the user's intent. When something fails, the fix MUST be the minimal change that keeps the user's original request intact. The agent MUST NOT silently change the task (method, dataset, model, sequence length) to make an error go away.
The workflow is defined over the following abstract roles. A given implementation MAY collapse or split these, but the responsibilities MUST exist somewhere.
-
Orchestrator. Drives the workflow phases, maintains the plan, enforces the rules and the control contract, and decides what to do next. Owns the main conversation/context.
-
Research subagent. A separately-contexted worker that performs literature and documentation mining and returns a compact, structured summary. Its purpose is to keep the orchestrator's context clean while doing deep, token-heavy reading. Its contract is specified in §5.2.
-
Execution surface. The set of capabilities that act on the outside world: resource discovery/inspection, code execution in a sandbox, managed job submission and monitoring, durable storage. Provided by
hf-skills(Appendix A). -
Tracker. The experiment-tracking capability used both to record metrics and to emit and read back structured alerts that drive iteration decisions (§5.7).
-
Plan. An explicit, ordered, mutable to-do list that makes the agent's progress legible and decomposes multi-step work (§5.1).
The workflow requires the abstract capabilities below. Each MUST be satisfiable
by hf-skills (or the host agent harness it runs in). Appendix A gives the
concrete mapping and flags the gaps.
| # | Abstract capability | Required for |
|---|---|---|
| C1 | Search the model/dataset hub; fetch repo details | Resource discovery & validation |
| C2 | Inspect a dataset: schema, columns, splits, sample rows, statistics | Data audit (§5.4) |
| C3 | Search and read research papers; follow links to code/datasets | Research (§5.2) |
| C4 | Trace citation graphs (references and forward citations) | Deep research (§5.2) — partial gap, see A |
| C5 | Search/retrieve current library documentation | Research, implementation |
| C6 | Find and read working example code | Research, implementation |
| C7 | Execute code in a disposable sandbox (CPU and GPU tiers) | Preflight (§5.5) |
| C8 | Submit, configure, monitor, and cancel managed compute jobs | Job execution (§5.6) |
| C9 | Train/fine-tune models with standard methods (SFT/DPO/GRPO, etc.) | Implementation |
| C10 | Record metrics and emit/read structured training alerts | Monitoring & iteration (§5.7) |
| C11 | Durable storage for models, datasets, logs, results | Persistence (Principle 4) |
| C12 | Evaluate a model on a benchmark/task | Completion (§5.8) |
| C13 | General web/document retrieval | Research fallback — gap, host harness |
| C14 | Out-of-band notification (optional) | Reporting (§5.9) |
The workflow is a research-first, plan-tracked, validate-before-spend, monitor-and-iterate loop. For trivial non-code requests (a factual question, a status check, a pure resource lookup) the agent MAY answer directly and SHOULD skip the heavyweight phases. For anything that produces or runs ML code, the full workflow applies.
- The agent MUST determine whether the request is trivial (skip to a direct answer) or an implementation task (run the full workflow).
- For any task with three or more steps, the agent MUST create and maintain
an explicit plan (ordered to-do items with status
pending/in_progress/completed). - Plan discipline (normative):
- Exactly one item is
in_progressat any time. - An item is marked
completedimmediately after it genuinely finishes, not batched, and only if it succeeded with no errors. - The plan is updated frequently so progress is legible.
- A failed/blocked item stays
in_progress(orpending); a new item is added to resolve the blocker rather than marking the blocked item done.
- Exactly one item is
This phase is the heart of the workflow and MUST NOT be skipped for implementation tasks. Its goal is to replace stale internal knowledge with a concrete, current, example-grounded recipe before any ML code is written.
Default research procedure (the agent or its research subagent MUST follow this shape):
- Find the landmark paper(s) for the task or domain.
- Crawl their citation graph to surface recent downstream work (newer papers that cite and improve on the anchor).
- Read the methodology sections of the most promising papers (recent, strong results, well-cited, reputable venue). Read methods, not abstracts.
- Extract the recipe: which dataset, which training method, which hyperparameters produced the reported results. Every extracted fact MUST be attributable to a specific result (e.g. "dataset X + method Y produced score Z on benchmark B").
- Confirm the referenced datasets actually exist and are usable.
- Find working example code that uses the current library APIs for the chosen method.
Research subagent contract (normative):
- Isolation. Deep reading MUST run in a context window separate from the orchestrator's, so that large amounts of retrieved text do not pollute the main context. The orchestrator receives only the subagent's final summary.
- Read-only. The research role MUST NOT submit jobs, write durable artifacts, or mutate state. It only reads (papers, docs, code, dataset metadata) and may do general web retrieval.
- Bounded. It runs under its own iteration cap (reference: ~60 steps) and its own context-budget guard (warn high, hard-stop near the limit and force a summary). It is subject to the same repetition guard as the orchestrator.
- Input. A specific task description plus context: the goal, any known anchor papers / arXiv IDs, and what the orchestrator needs out of it. Specificity is required (name anchors when known).
- Output (bounded, structured). A compact summary (reference: ~500–1500
words) containing:
- a ranked recipe table: paper → reported result → dataset → method → key hyperparameters → key insight;
- code patterns: correct imports, config/trainer arguments, and snippets using current APIs;
- a short state-of-the-art landscape for the task;
- essential references (papers, datasets, example files) with links.
The orchestrator MAY also perform quick, shallow lookups directly (a single doc page, one repo's details) without spawning the subagent. The subagent is for deep multi-source research.
Research is skipped only for: simple factual questions, status checks, and pure resource discovery. It is never skipped because a task "seems simple".
Before implementing, the agent MUST establish concrete, verified resources.
- Model. If the user named a model, confirm it exists and inspect it (architecture, size, tokenizer, license, suitability). If the user did not name one, evaluate a few candidates (reference: 3–5) and select on task-fit / quality / size / cost, then proceed. The agent MUST NOT silently substitute a different model than the user requested (Principle 5); if the requested one is unusable, it says so and asks.
- Dataset. Same: confirm existence and inspect. Format-to-method compatibility MUST be validated here (see §5.4).
- Hardware. Choose compute sized to the model footprint (§6.4). Do not default to the most expensive tier without justification, and do not undersize.
The agent MUST inspect a dataset before working with it and MUST NOT assume its shape.
- Inspect: schema/columns, rows per split, value distributions for key columns, and sample rows.
- Surface anything notable: class imbalance, missing values, unexpected formats, outliers, duplicates.
- Validate format against the training method. Training fails fast on schema
mismatch, so compatibility MUST be confirmed before any job. Reference mapping
(method → required fields):
- SFT:
messages, ortext, orprompt+completion - DPO:
prompt,chosen,rejected - GRPO:
prompt - (Other methods: confirm their documented schema during research.)
- SFT:
- If the requested dataset cannot be loaded, the agent MUST tell the user and ask rather than silently substituting another (Principle 5).
Code is written from the research findings (Phase 1), not from memory.
- Sandbox-first development. Non-trivial scripts MUST be developed and tested in a disposable execution sandbox before being launched at scale: write → install deps → run small → fix → scale up.
- GPU preflight smoke test (mandatory for GPU work). Before committing to a full job, if the job will run on GPU or the script loads a model or exercises a GPU code path (CUDA, mixed precision, quantization, fused/optimized attention, graph compilation), the agent MUST run a tiny smoke test on representative hardware using the same imports, the same model-loading path, the same training entrypoint, and a tiny dataset/subset, then fix any failure before scaling. CPU-only execution cannot validate GPU code paths. If the available preflight hardware cannot fit the full path, the agent tests the largest useful subset, states what was not covered, and submits one full job first (§5.6).
- The smoke test exists to catch, cheaply, the exact failures that otherwise surface hours into an expensive run: bad imports, wrong arguments, schema mismatch, and out-of-memory.
- Code must reach the remote environment by value. A managed job runs in a fresh environment with no access to local paths. The script MUST be supplied as inline source, a file written into the job's own environment, or a public URL. Local checkout paths MUST NOT be passed.
- Pre-flight checklist (MUST be satisfiable and stated before launch). The
agent MUST be able to fill in every item; if any is missing, it stops and
completes it first:
- Reference implementation: which researched example this is based on.
- Dataset format verified: columns confirmed (§5.4).
- GPU smoke test: hardware + result, or an explicit reason it is not applicable.
- Persistence configured: durable push enabled with a concrete destination id (Principle 4). Without this the trained artifact is lost when the environment is torn down.
- Timeout set to the work, not the default (§6.3).
- Monitoring included: tracker wired in and publishing to a live dashboard.
- One-job-first for batches/ablations/sweeps. The agent MUST submit a single job, confirm from its logs that it actually starts running/training correctly, and only then submit the rest. It MUST NOT submit a whole batch at once (they would all fail on the same bug).
- After submission the agent SHOULD poll logs enough to confirm the job is healthy, then report monitoring links.
Monitoring is not passive logging; it is the decision channel that drives the next iteration.
- Training runs MUST emit structured alerts at decision points, each carrying
numeric values and an actionable suggestion, at three severities:
- ERROR — stop and change approach (divergence, NaN, OOM).
- WARN — tweak hyperparameters (overfitting, early-stopping signal, KL spike, reward collapse, slow convergence).
- INFO — milestones (training complete, target reached, checkpoint saved).
- Example alert text:
"loss=12.4 at step 200 — lr likely too high, try x0.1"(a later step MUST be able to parse it and act).
- Between runs the agent reads the alerts back (and prior run config) instead
of parsing thousands of raw metric points, and drives the next configuration
from them. Reference decision policy:
- diverged → learning-rate x 0.1
- overfitting → weight-decay x 10, or reduce capacity
- early-stopping signal → learning-rate x 0.5, or adjust schedule
- high accuracy → refine around the current config
- The agent mutates only the keys the alerts justify changing; it reads the prior config and changes the minimum.
- Sweeps, not hand-tuning. Hyperparameter exploration MUST be done by launching a sweep over a grid and evaluating each run automatically, not by editing one value at a time. One well-designed sweep beats many manual runs.
A task is not done until:
- The required output exists (final model / reached metric / updated dataset).
- The output is persisted to durable storage (Principle 4).
- The model has been evaluated and confirmed to work (not merely produced).
- For training runs, a working monitoring dashboard URL has been provided.
Before ending a turn the agent MUST verify it actually did the task (not just described it), that any failure was diagnosed and fixed (or clearly explained with a request for input), and that all referenced artifacts are linked by direct URL. It MUST NOT mark plan items completed if they failed or are partial.
When running with no human in the loop (a fixed time/compute budget, no one to re-prompt), the following additional rules apply:
- Every step makes progress via an action. A response that performs no action ends the loop with no way to resume; in autonomous mode the agent MUST always take a next action (work the plan, verify outputs, or plan ahead) rather than returning idle text.
- Do not stop early. While budget remains, the agent MUST keep improving and MUST NOT declare itself "done" or ask whether to continue. There is no one to answer.
- Iterate as a loop, not a checklist. After reaching a working result, keep going: research → implement → train/evaluate → persist → improve (tune, change data, change recipe, or change approach) → research again.
- When out of ideas, go back to the literature. Crawl citation graphs deeper, read papers not yet read, combine recipes, re-read the task and the training logs for missed angles. The premise is that there is always an unread paper with a better dataset or trick.
- Budget time explicitly. Check remaining budget periodically and reserve a margin at the end (reference: ~10 minutes) for final evaluation and saving, so the loop never ends with an unsaved or unevaluated result.
- Out-of-band notifications (C14) are used only when the user asked for them or the task clearly requires reporting to a configured destination, not for routine chatter.
These are the non-negotiable rules. They restate the principles as concrete prohibitions and requirements observed across the source workflow.
- R1. Any artifact that must outlive the run MUST be pushed to durable storage as part of the run, with a concrete destination id set before launch. Compute filesystems are ephemeral; "I'll grab it after" is not available.
- R2. Outputs that cannot be pushed by the training process itself (logs, scripts, side artifacts) MUST be uploaded to durable storage explicitly.
- R3. On error, the agent MUST apply the minimal fix that preserves the user's original request, grounded in research/examples.
- R4. The agent MUST NOT, to escape an error, change the training method (e.g. full fine-tune → parameter-efficient), reduce sequence length (silently truncates data and changes what is learned), switch dataset/model, or disable monitoring. If the original approach genuinely cannot work, the agent explains why and asks the user before changing method, data, model, or sequence length.
- R5. The agent MUST set a job timeout sized to the actual work and MUST NOT leave it at the short interactive default. Training runs for hours; a default short timeout kills the job and loses progress. Reference sizing: small models ~2–4h, mid ~4–8h, large ~8–24h. Minimum for any training: well above the default.
- R6. Compute MUST be sized to the model footprint. Reference tiers (by parameter count): ~1–3B → small single-GPU; ~7–13B → large single-GPU; ~30B → multi-GPU / large-memory; ~70B+ → multi-GPU high-memory. Memory, not the tier's name, is what matters. Do not oversize without reason, do not undersize into avoidable OOM.
- R7. On OOM the agent MUST, in order: (1) reduce per-device batch size and increase gradient accumulation proportionally to keep the effective batch size identical; (2) enable gradient checkpointing; (3) move to larger-memory hardware. It MUST NOT switch training method or reduce sequence length to resolve OOM.
- R8. The agent MUST NOT silently substitute a dataset or model. If a requested resource is unavailable, it tells the user and asks.
- R9. Schema/columns/format MUST be verified by inspection before use (Principle 2); the agent MUST NOT assume them.
- R10. On failure the agent reads the full error/log, diagnoses the actual cause, and changes something specific. It MUST NOT retry the identical action unchanged. If a call fails repeatedly for the same reason, it stops and tries a fundamentally different approach (see also the control contract, §7.2).
- R11. API/import errors are resolved by re-checking current documentation and examples (Principle 1), not by guessing.
- R12. The agent SHOULD prefer prebuilt/managed components over compiling heavy dependencies from source inside a job (slow, often fails on the environment's toolchain). Extra build steps are taken only when nothing prebuilt covers the need, and the reason is documented.
- R13. Credentials are taken from the environment and never logged or exposed.
- R14. Every referenced model, dataset, paper, job, or dashboard MUST be given as a direct URL.
These rules govern the loop itself: they keep an autonomous agent bounded, unstuck, and within its context budget. They are independent of the ML domain and are what a host harness (or a set of hooks) must enforce around the orchestrator. Numbers are reference defaults from the source system.
- The main loop MUST be bounded by a maximum iteration count (configurable; unbounded only by explicit opt-in). The loop exits when the model returns no further actions and no plan item remains unfinished, on user cancellation, or on unrecoverable error.
- The harness MUST detect when the agent is stuck and inject a corrective
instruction. Two patterns MUST be caught over a recent window of actions
(reference: last ~30):
- Identical repetition: the same action with the same arguments repeated (reference threshold: 3 in a row) → inject "stop repeating this, try a fundamentally different strategy".
- Cyclic repetition: a short sequence of actions (reference length 2–5) repeated (reference: ≥2 full cycles) → inject "you are in a repeating cycle, break it and try a different approach".
- Action signatures SHOULD incorporate the action's result as well as its arguments, so that legitimate polling (same call, changing result) is not misclassified as a loop.
- If the model produces no action while the plan still has unfinished items, the harness MUST NOT immediately hand control back to the user. It injects a continuation prompt ("the task is not complete, take at least one action now") and retries a small number of times (reference: 2) before yielding. Any action resets this counter.
- A short streak of malformed actions for the same tool (reference: 2 in a row) MUST trigger a corrective injection ("stop retrying, use a different strategy") rather than letting the agent grind.
- If a model response is cut off by the output limit, the harness MUST NOT have the agent blindly resend the same oversized payload; it injects guidance to use a different mechanism (e.g. write large content via a file/heredoc rather than inline) and retries.
- The harness MUST keep the working context within the model's window by compacting when usage crosses a high-water mark (reference: ~90% of the window).
- Compaction MUST preserve: the system instructions, the original task message (the user's first request), and a recent tail of the conversation (reference: ~5 messages). The middle is summarized into a single record.
- Oversized individual messages MAY be truncated with a placeholder (reference cap: ~50k tokens/message), except the system message.
- Compaction MUST be bounded: if it cannot bring usage under threshold, the session terminates cleanly rather than retrying forever (the retry would burn unbounded cost).
- Outward-facing or costly/destructive actions MUST pass an approval policy before
executing. The policy distinguishes safe-by-default from approval-required:
- Auto-approved: read-only research, inspection, discovery; routine code execution in the default low-cost sandbox; status/metadata queries.
- Approval-required: provisioning non-default (GPU/larger) compute; submitting paid compute jobs; destructive storage operations (delete repo, delete branch/tag, merge, force-upload/overwrite); creating durable repos.
- Always human-gated: recurring/scheduled jobs (a standing cost commitment) require explicit human approval even under otherwise-autonomous policies.
- An autonomous mode MAY auto-approve the approval-required class up to a cost cap, tracking estimated spend across the batch so the cap cannot be jointly overrun. Scheduled/recurring commitments remain human-gated regardless.
- The harness MAY tune the model's reasoning effort to the highest level the selected model actually supports, degrading gracefully when a level is rejected, and SHOULD do so cheaply (a tiny probe) and cache the result per model. This is a quality/cost optimization and is non-normative for the ML workflow itself.
Across the workflow the agent maintains:
- The plan (§5.1): the live decomposition and progress.
- Research findings (§5.2): the recipe table, code patterns, references that ground implementation. These are the authority that later phases cite.
- Validated resources (§5.3–5.4): confirmed model, dataset (with verified schema), and chosen hardware.
- Run records (§5.6–5.7): job ids, configs, tracker project/run names, dashboard URLs, and the alert history that drives iteration.
- Durable outputs (§5.8): the persisted model/dataset/logs and evaluation results, each linked by URL.
An execution conforms to this specification if, for an ML implementation request:
- Research preceded implementation, and the implementation cites concrete findings (Principle 1, §5.2).
- Resources were verified by inspection, including dataset-format-to-method compatibility (Principle 2, §5.3–5.4).
- GPU code was smoke-tested before scaling, or the omission was justified (Principle 3, §5.5).
- The pre-flight checklist was satisfied before any job, batches went one-job-first, and durable persistence was configured up front (§5.6, Principle 4).
- Monitoring emitted structured alerts and the next iteration was driven by them (§5.7).
- The result was persisted and evaluated, and all artifacts were linked (§5.8).
- No rule in §6 was violated; in particular no silent scope change or resource substitution occurred.
- The loop stayed bounded and unstuck under the control contract (§7).
This appendix shows how the abstract capabilities of §4 are satisfied by
hf-skills and its companion tools, so an implementer can build only the workflow
logic. The skill names are the execution surface; do not reimplement them.
| Cap | Provided by (hf-skills) |
Notes |
|---|---|---|
| C1 Hub search / repo details | hf-cli (hf models/datasets list/info), huggingface-best (leaderboard-ranked model choice), companion hub MCP tools |
Use for model/dataset discovery and validation. |
| C2 Dataset inspection | huggingface-datasets (Dataset Viewer: rows, search, filter, statistics, parquet); validation helpers inside the trainer skills |
Covers schema, splits, samples, stats. Satisfies the §5.4 audit. |
| C3 Paper search & read | huggingface-papers (paper markdown + structured metadata, linked models/datasets/spaces/repo), hf-cli papers, huggingface-paper-publisher |
Read methodology from the markdown; follow linked artifacts. |
| C4 Citation-graph crawl | Partial gap. huggingface-papers exposes a paper's linked artifacts and metadata but not a full references/forward-citations graph with influence/intent. |
Implement the "crawl downstream work" step via paper-page links plus the host harness web retrieval (C13); accept reduced fidelity vs. a dedicated citation API, and say so. |
| C5 Documentation retrieval | Companion doc tools assumed by the skills (hf_doc_search / hf_doc_fetch), referenced throughout huggingface-llm-trainer etc. |
Use for current TRL/Transformers/etc. APIs. |
| C6 Working example code | Trainer skills ship reference scripts (huggingface-llm-trainer, huggingface-vision-trainer, train-sentence-transformers); host harness can read repos |
Prefer copying the skills' production templates over synthesizing. |
| C7 Disposable sandbox (CPU/GPU) | Partial gap. No drop-in "GPU sandbox provision" skill equivalent to ml-intern's. Realize §5.5 preflight as a short, cheap job (C8) on a small GPU flavor with a tiny subset, or a local GPU smoke test via huggingface-community-evals (--limit). |
The rule (smoke-test small first) is preserved; the mechanism becomes a minimal job or local run. |
| C8 Managed jobs | hf-cli (hf jobs run/inspect/logs/cancel, scheduled jobs) and the hf_jobs companion tool used by the trainer skills |
Set timeout (R5), flavor (R6), env/secrets, persistence. |
| C9 Training methods | huggingface-llm-trainer (SFT/DPO/GRPO/reward via TRL), huggingface-vision-trainer, train-sentence-transformers |
Method ↔ data-shape rules in §5.4 align with these skills' own validation. |
| C10 Tracking + alerts | huggingface-trackio (init/log/alert/finish, alert levels, CLI --json retrieval, Space dashboards) |
This is the §5.7 decision channel. Read alerts back via the CLI JSON. |
| C11 Durable storage | hf-cli (upload/repos/buckets), trainers' push_to_hub, huggingface-datasets upload |
Satisfies R1/R2. |
| C12 Evaluation | huggingface-community-evals (inspect-ai / lighteval, local GPU), trainer eval hooks |
Satisfies §5.8 "evaluated and confirmed". |
| C13 General web/doc retrieval | Gap in hf-skills. Source from the host agent harness's built-in web search/fetch. |
Used by research (C4 fallback) and for non-HF docs. |
| C14 Notifications | Gap in hf-skills. Source from the host harness (messaging integration) if needed. |
Optional; gated per §5.9. |
Gaps to handle explicitly when implementing: C4 (full citation graph),
C7 (dedicated GPU sandbox), C13 (general web), C14 (notifications). For each, the
spec keeps the rule and lets the implementer satisfy it with the nearest
hf-skills mechanism or a host-harness capability, stating any reduced fidelity.
This appendix is illustrative only, in response to the usage example of
building an ml-researcher from hf-skills instead of a custom agentic loop.
It is not part of the specification.
- Driving skill(s). Encode the phase order and the gate conditions (§5) as the
researcher's top-level procedure: a skill that sequences research → validate →
audit → preflight → submit → monitor → iterate → evaluate, delegating each
concrete action to the relevant
hf-skill(Appendix A). - Research subagent. Implement §5.2 as a separate subagent with its own context and a read-only toolset (papers, docs, examples, web). Its return schema is the recipe table + code patterns + references. This is the one place a subagent is structurally required (context isolation).
- Hooks for the control contract (§7) and the hard gates (§6). Implement as
pre/post-action hooks rather than prose, so they are enforced not merely
requested:
- Pre-job hook: refuse a job submission unless the §5.6 pre-flight checklist is satisfied (reference impl, verified schema, smoke-test result, persistence destination, sized timeout, monitoring wired). Refuse local paths in scripts.
- Batch hook: allow only one job from a batch until its logs confirm a healthy start (one-job-first).
- Repetition / continuation / malformed guards: the §7.2–7.4 detectors.
- Compaction hook: the §7.6 policy.
- Approval hook: the §7.7 policy (paid compute, destructive storage, scheduled jobs).
- Plan/tracker. Use the host harness's todo mechanism for the §5.1 plan, and
huggingface-trackiofor the §5.7 alert-driven loop.
The intent of this split is that hf-skills provides the doing, and the
researcher provides only the discipline: the phase ordering, the verification
gates, the persistence and anti-scope-change rules, and the loop guards.