2026-02-27T20:58:00Z by Showboat 0.6.0
This analysis explores Terminal Bench 2.0 results: 10,947 trial results across 27 agent/model submissions on 89 benchmark tasks. The data was loaded from JSON result files into a SQLite database with proper relational schema.
Which agent/model combination scores the highest across all 89 tasks?
uv run --with sqlite-utils sqlite-utils query tb2.db "select submission, n_trials, n_passed, n_failed, n_errored, avg_reward from submission_stats order by avg_reward desc" --tablesubmission n_trials n_passed n_failed n_errored avg_reward
------------------------------------- ---------- ---------- ---------- ----------- ------------
Droid__GPT-5.3-Codex 445 344 73 28 0.773
Simple-Codex__GPT-5.3-Codex 445 332 46 67 0.7563
Terminus-KIRA__Gemini-3.1-Pro-Preview 445 333 76 36 0.7517
Terminus-KIRA__Claude-Opus-4.6 445 331 54 60 0.7489
Judy__Claude-Opus-4.6 445 320 65 60 0.7256
Droid__Claude-Opus-4.6 445 311 108 26 0.7052
CodeBrain-1__GPT-5.3-Codex 445 313 98 34 0.705
Mux__GPT-5.3-Codex 445 305 114 26 0.6854
Deep-Agents__GPT-5.2-Codex 445 293 77 75 0.6798
Crux__Claude-Opus-4.6 445 272 99 74 0.67
Mux__Claude-Opus-4.6 445 296 113 36 0.6652
OpenSage__Gemini-3-Pro-Preview 445 290 131 24 0.6576
Terminus2__GPT-5.3-Codex 445 288 70 87 0.6545
Ante__Gemini-3-Pro-Preview 445 288 129 28 0.6501
Terminus2__Claude-Opus-4.6 445 280 92 73 0.6349
CodeBrain-1__Gemini-3-Pro-Preview 445 277 136 32 0.6225
Mux__GPT-5.2 89 54 25 10 0.6207
Mux__Claude-Opus-4.5 89 52 29 8 0.5843
Terminus2__GLM-5 445 231 113 101 0.5397
OpenCode__Claude-Opus-4.5 89 46 38 5 0.5227
MAYA__Claude-4.5-sonnet 445 190 241 14 0.4408
Terminus2__Kimi-k2.5 445 189 161 95 0.4385
Terminus2__Minimax-m2.5 445 188 92 165 0.4292
Terminus2__DeepSeek-V3.2 445 176 183 86 0.3982
Terminus2__GLM-4.7 445 147 139 159 0.3475
ClaudeCode__GLM-4.7 445 148 250 47 0.3348
dakou__qwen3-coder-480b 445 121 232 92 0.275
Droid with GPT-5.3-Codex leads at 77.3% pass rate, followed closely by Simple-Codex (also GPT-5.3-Codex) and Terminus-KIRA (Gemini 3.1 Pro Preview). The top 4 are within 2.5 percentage points of each other.
Multiple agents use the same underlying model. Which agent gets the most out of each model?
uv run --with sqlite-utils sqlite-utils query tb2.db "
select s.model_name, s.slug as best_submission, ss.avg_reward, ss.n_passed, ss.n_trials
from submissions s
join submission_stats ss on ss.submission = s.slug
where ss.avg_reward = (
select max(ss2.avg_reward)
from submissions s2
join submission_stats ss2 on ss2.submission = s2.slug
where s2.model_name = s.model_name
)
order by ss.avg_reward desc
" --tablemodel_name best_submission avg_reward n_passed n_trials
---------------------------------- ------------------------------------- ------------ ---------- ----------
gpt-5.3-codex Droid__GPT-5.3-Codex 0.773 344 445
openai/gpt-5.3-codex Simple-Codex__GPT-5.3-Codex 0.7563 332 445
vertex_ai/gemini-3.1-pro-preview Terminus-KIRA__Gemini-3.1-Pro-Preview 0.7517 333 445
vertex_ai/claude-opus-4-6 Terminus-KIRA__Claude-Opus-4.6 0.7489 331 445
aurora-01-21 Droid__Claude-Opus-4.6 0.7052 311 445
openai:gpt-5.2-codex Deep-Agents__GPT-5.2-Codex 0.6798 293 445
claude-sonnet-4-20250514 Crux__Claude-Opus-4.6 0.67 272 445
anthropic/claude-opus-4-6 Mux__Claude-Opus-4.6 0.6652 296 445
litellm_proxy/gemini-3-pro-preview OpenSage__Gemini-3-Pro-Preview 0.6576 290 445
gemini-3-pro-preview Ante__Gemini-3-Pro-Preview 0.6501 288 445
vertex_ai/gemini-3-pro-preview CodeBrain-1__Gemini-3-Pro-Preview 0.6225 277 445
openai/gpt-5.2 Mux__GPT-5.2 0.6207 54 89
anthropic/claude-opus-4-5 Mux__Claude-Opus-4.5 0.5843 52 89
openai/glm-5 Terminus2__GLM-5 0.5397 231 445
openai/kimi-k2.5:cloud Terminus2__Kimi-k2.5 0.4385 189 445
openai/minimax-m2.5:cloud Terminus2__Minimax-m2.5 0.4292 188 445
deepseek/deepseek-chat Terminus2__DeepSeek-V3.2 0.3982 176 445
openai/glm-4.7:cloud Terminus2__GLM-4.7 0.3475 147 445
GLM-4.7 ClaudeCode__GLM-4.7 0.3348 148 445
qwen3-coder-modelscope dakou__qwen3-coder-480b 0.275 121 445
Interesting — some models appear under different provider prefixes (e.g. Claude Opus 4.6 appears as aurora-01-21, vertex_ai/claude-opus-4-6, and anthropic/claude-opus-4-6). The agent scaffold matters: Terminus-KIRA gets 74.9% from Claude Opus 4.6 via Vertex, while Mux gets 66.5% from the same model via Anthropic's API directly.
Which of the 89 tasks have the highest failure rates across all submissions?
uv run --with sqlite-utils sqlite-utils query tb2.db "
select task_name, n_trials, n_passed, n_failed, n_errored, failure_rate
from task_stats order by failure_rate desc limit 15
" --tabletask_name n_trials n_passed n_failed n_errored failure_rate
-------------------------- ---------- ---------- ---------- ----------- --------------
make-doom-for-mips 123 0 12 111 1
sam-cell-seg 123 3 117 3 0.975
install-windows-3.11 123 6 109 8 0.9508
caffe-cifar-10 123 6 34 83 0.9474
filter-js-from-html 123 6 108 9 0.9474
gpt2-codegolf 123 10 25 88 0.9187
extract-moves-from-video 123 10 52 61 0.9174
raman-fitting 123 10 78 35 0.9174
train-fasttext 123 12 25 86 0.8919
mteb-retrieve 123 13 105 5 0.8898
video-processing 123 14 107 2 0.8862
torch-tensor-parallelism 123 24 88 11 0.8049
dna-assembly 123 25 80 18 0.7934
db-wal-recovery 123 27 48 48 0.7805
torch-pipeline-parallelism 123 27 69 27 0.7805
make-doom-for-mips is completely unsolved — 0 passes out of 123 attempts, with 111 of those being errors (mostly timeouts). sam-cell-seg and install-windows-3.11 are nearly as brutal.
Notice that some tasks have high error counts rather than clean failures — caffe-cifar-10 has 83 errors out of 123 attempts, suggesting agents crash or timeout rather than producing wrong answers.
uv run --with sqlite-utils sqlite-utils query tb2.db "select task_name, n_trials, n_failed, n_errored from task_stats where n_passed = 0" --tabletask_name n_trials n_failed n_errored
------------------ ---------- ---------- -----------
make-doom-for-mips 123 12 111
Only one task is completely unsolved: compiling DOOM for MIPS. 90% of attempts don't even finish — they timeout.
uv run --with sqlite-utils sqlite-utils query tb2.db "
select task_name, n_trials, n_passed, n_failed, n_errored, avg_reward
from task_stats order by failure_rate asc limit 15
" --tabletask_name n_trials n_passed n_failed n_errored avg_reward
-------------------------- ---------- ---------- ---------- ----------- ------------
git-leak-recovery 123 121 2 0 0.9837
cobol-modernization 123 119 4 0 0.9675
constraints-scheduling 123 118 5 0 0.9593
fix-git 123 118 5 0 0.9593
nginx-request-logging 123 118 5 0 0.9593
vulnerable-secret 123 116 6 1 0.9508
portfolio-optimization 123 114 6 3 0.95
custom-memory-heap-crash 123 116 2 5 0.9431
multi-source-data-merger 123 116 7 0 0.9431
prove-plus-comm 123 116 7 0 0.9431
modernize-scientific-stack 123 113 6 4 0.9417
log-summary-date-ranges 123 115 8 0 0.935
code-from-image 123 113 3 7 0.9339
distribution-search 123 114 5 4 0.9268
git-multibranch 123 113 9 1 0.9187
git-leak-recovery is nearly universally solved (98.4%) — only 2 failures across all 123 attempts. Tasks involving git operations, COBOL modernization, and constraint scheduling are consistently easy for all models.
What kinds of errors do agents hit?
uv run --with sqlite-utils sqlite-utils query tb2.db "
select exception_type, count(*) as n,
round(100.0 * count(*) / (select count(*) from trials where exception_type is not null), 1) as pct
from trials
where exception_type is not null
group by exception_type
order by n desc
" --tableexception_type n pct
---------------------------- ---- -----
AgentTimeoutError 1596 92.9
DaytonaError 31 1.8
RuntimeError 20 1.2
VerifierTimeoutError 18 1
BadRequestError 9 0.5
NameError 8 0.5
OSError 8 0.5
RewardFileNotFoundError 8 0.5
EnvironmentStartTimeoutError 6 0.3
AddTestsDirError 4 0.2
AttributeError 4 0.2
DownloadVerifierDirError 4 0.2
AgentSetupTimeoutError 1 0.1
KeyError 1 0.1
AgentTimeoutError dominates at 92.9% of all errors. The remaining errors are a mix of infrastructure issues (DaytonaError, EnvironmentStartTimeoutError) and agent bugs (NameError, AttributeError).
uv run --with sqlite-utils sqlite-utils query tb2.db "
select submission,
count(*) as n_trials,
sum(case when exception_type is not null then 1 else 0 end) as n_errors,
round(100.0 * sum(case when exception_type is not null then 1 else 0 end) / count(*), 1) as error_pct,
sum(case when exception_type = 'AgentTimeoutError' then 1 else 0 end) as n_timeouts
from trials
group by submission
order by error_pct desc
" --tablesubmission n_trials n_errors error_pct n_timeouts
------------------------------------- ---------- ---------- ----------- ------------
Terminus2__Minimax-m2.5 445 181 40.7 179
Terminus2__GLM-4.7 445 161 36.2 143
Terminus2__GLM-5 445 111 24.9 108
Terminus2__Kimi-k2.5 445 101 22.7 91
dakou__qwen3-coder-480b 445 95 21.3 92
Simple-Codex__GPT-5.3-Codex 445 92 20.7 90
Terminus2__GPT-5.3-Codex 445 92 20.7 92
Terminus2__DeepSeek-V3.2 445 91 20.4 90
Judy__Claude-Opus-4.6 445 86 19.3 83
Deep-Agents__GPT-5.2-Codex 445 82 18.4 70
Terminus-KIRA__Claude-Opus-4.6 445 81 18.2 78
Terminus2__Claude-Opus-4.6 445 81 18.2 81
Crux__Claude-Opus-4.6 445 78 17.5 40
Mux__GPT-5.2 89 12 13.5 11
ClaudeCode__GLM-4.7 445 48 10.8 46
Mux__Claude-Opus-4.6 445 46 10.3 46
Terminus-KIRA__Gemini-3.1-Pro-Preview 445 43 9.7 43
Mux__Claude-Opus-4.5 89 8 9 8
CodeBrain-1__GPT-5.3-Codex 445 36 8.1 35
CodeBrain-1__Gemini-3-Pro-Preview 445 35 7.9 35
Ante__Gemini-3-Pro-Preview 445 30 6.7 29
Droid__GPT-5.3-Codex 445 29 6.5 29
Mux__GPT-5.3-Codex 445 29 6.5 29
Droid__Claude-Opus-4.6 445 26 5.8 22
OpenCode__Claude-Opus-4.5 89 5 5.6 5
OpenSage__Gemini-3-Pro-Preview 445 25 5.6 21
MAYA__Claude-4.5-sonnet 445 14 3.1 0
Terminus2 with Minimax-m2.5 errors 40.7% of the time — almost all timeouts. MAYA with Claude 4.5 Sonnet has the lowest error rate (3.1%) and zero timeouts, though it has a lower overall pass rate since it cleanly fails rather than crashing.
uv run --with sqlite-utils sqlite-utils query tb2.db "
select task_name, count(*) as n_timeouts,
round(100.0 * count(*) / (select count(*) from trials t2 where t2.task_name = t.task_name), 1) as timeout_pct
from trials t
where exception_type = 'AgentTimeoutError'
group by task_name
order by timeout_pct desc
limit 15
" --tabletask_name n_timeouts timeout_pct
-------------------------- ------------ -------------
make-doom-for-mips 109 88.6
gpt2-codegolf 88 71.5
caffe-cifar-10 84 68.3
train-fasttext 83 67.5
extract-moves-from-video 60 48.8
make-mips-interpreter 57 46.3
qemu-alpine-ssh 56 45.5
tune-mjcf 52 42.3
write-compressor 50 40.7
path-tracing 48 39
db-wal-recovery 48 39
adaptive-rejection-sampler 42 34.1
torch-pipeline-parallelism 39 31.7
polyglot-rust-c 39 31.7
path-tracing-reverse 35 28.5
The timeout-prone tasks are the computationally heavy ones: compiling DOOM for MIPS (88.6% timeout), GPT-2 code golf (71.5%), training Caffe on CIFAR-10 (68.3%). These tasks likely need more time than the default timeout allows, regardless of the agent's skill.
How long do agents actually spend on tasks, and does spending more time correlate with success?
uv run --with sqlite-utils sqlite-utils query tb2.db "
select status, count(*) as n,
round(avg((julianday(agent_exec_finished_at) - julianday(agent_exec_started_at)) * 86400), 1) as avg_agent_sec,
round(min((julianday(agent_exec_finished_at) - julianday(agent_exec_started_at)) * 86400), 1) as min_sec,
round(max((julianday(agent_exec_finished_at) - julianday(agent_exec_started_at)) * 86400), 1) as max_sec
from trials
where agent_exec_started_at is not null and agent_exec_finished_at is not null
group by status
" --tablestatus n avg_agent_sec min_sec max_sec
-------- ---- --------------- --------- ---------
errored 1509 1504.1 0 12000.1
failed 2984 541 0.4 9338.2
passed 6415 423.3 1.7 6250
Errored trials average 1,504 seconds (25 minutes) — almost all timeouts running to the limit. Passed trials average 423 seconds (7 minutes) and failed trials 541 seconds (9 minutes). Agents that succeed tend to do so relatively quickly. Agents that fail spend more time trying before giving up.
uv run --with sqlite-utils sqlite-utils query tb2.db "
select submission,
round(avg((julianday(env_setup_finished_at) - julianday(env_setup_started_at)) * 86400), 1) as env_setup_sec,
round(avg((julianday(agent_setup_finished_at) - julianday(agent_setup_started_at)) * 86400), 1) as agent_setup_sec,
round(avg((julianday(agent_exec_finished_at) - julianday(agent_exec_started_at)) * 86400), 1) as exec_sec,
round(avg((julianday(verifier_finished_at) - julianday(verifier_started_at)) * 86400), 1) as verify_sec
from trials
where env_setup_started_at is not null
group by submission
order by exec_sec desc
limit 15
" --tablesubmission env_setup_sec agent_setup_sec exec_sec verify_sec
------------------------------ --------------- ----------------- ---------- ------------
Terminus2__Minimax-m2.5 17.5 18.3 965.5 93.5
Terminus2__GLM-4.7 14 20.6 886.2 90.8
Terminus2__GLM-5 10.5 17.2 822.9 63.2
Judy__Claude-Opus-4.6 11.2 47.1 760.5 72
Terminus-KIRA__Claude-Opus-4.6 39.2 17.8 753.8 46.2
Simple-Codex__GPT-5.3-Codex 2.7 13.2 742.4 66.1
dakou__qwen3-coder-480b 52.1 109.7 737.2 122.6
Terminus2__GPT-5.3-Codex 2.9 12.7 734.7 48
Terminus2__Kimi-k2.5 15.1 12.4 711.6 74.5
Deep-Agents__GPT-5.2-Codex 3.4 0 704.5 45.8
Terminus2__DeepSeek-V3.2 9.5 8.7 703.4 31.1
Terminus2__Claude-Opus-4.6 3.6 12.9 656.2 49.6
Mux__Claude-Opus-4.6 9.2 78.7 574.4 40.2
Mux__GPT-5.2 6.2 79.2 570.5 95.1
OpenCode__Claude-Opus-4.5 26.8 40.3 557.1 66.7
Agent execution dominates the time budget, typically 10-20x longer than setup or verification. The dakou agent has notably long agent setup (110 seconds on average) — perhaps it's downloading a large model. Verification takes 30-120 seconds depending on the task.
Only some submissions report token usage and costs.
uv run --with sqlite-utils sqlite-utils query tb2.db "
select submission,
sum(case when status = 'passed' then 1 else 0 end) as n_passed,
round(sum(cost_usd), 2) as total_cost,
round(avg(cost_usd), 2) as avg_cost_per_trial,
round(sum(cost_usd) / nullif(sum(case when status = 'passed' then 1 else 0 end), 0), 2) as cost_per_pass
from trials
where cost_usd is not null and cost_usd > 0
group by submission
having n_passed > 0
order by cost_per_pass asc
" --tablesubmission n_passed total_cost avg_cost_per_trial cost_per_pass
------------------------------------- ---------- ------------ -------------------- ---------------
Mux__GPT-5.3-Codex 236 0 0 0
Terminus2__DeepSeek-V3.2 176 13.67 0.03 0.08
Mux__Claude-Opus-4.6 101 37.22 0.32 0.37
Terminus2__Claude-Opus-4.6 280 293.61 0.66 1.05
Terminus-KIRA__Gemini-3.1-Pro-Preview 333 406.63 0.91 1.22
Terminus-KIRA__Claude-Opus-4.6 331 713.67 1.61 2.16
Huge cost range: DeepSeek V3.2 costs just $0.08 per correct answer while Terminus-KIRA with Claude Opus 4.6 costs $2.16 per correct answer — 27x more expensive. The GPT-5.3-Codex run via Mux reports zero cost which is likely a reporting issue.
Terminus-KIRA with Claude Opus 4.6 spent $713.67 total across 445 trials — an expensive benchmark run.
Where does Claude Opus 4.6 (via Judy) beat GPT-5.3-Codex (via Droid), and vice versa?
uv run --with sqlite-utils sqlite-utils query tb2.db "
select m1.task_name,
round(m1.avg_reward, 2) as claude_opus,
round(m2.avg_reward, 2) as gpt_codex,
round(m1.avg_reward - m2.avg_reward, 2) as claude_advantage
from submission_task_matrix m1
join submission_task_matrix m2 using (task_name)
where m1.submission = 'Judy__Claude-Opus-4.6'
and m2.submission = 'Droid__GPT-5.3-Codex'
and abs(m1.avg_reward - m2.avg_reward) > 0.3
order by claude_advantage desc
" --tabletask_name claude_opus gpt_codex claude_advantage
---------------------------- ------------- ----------- ------------------
qemu-alpine-ssh 1 0 1
mcmc-sampling-stan 0.8 0 0.8
torch-pipeline-parallelism 0.8 0 0.8
db-wal-recovery 1 0.4 0.6
overfull-hbox 1 0.4 0.6
query-optimize 0.75 0.4 0.35
extract-moves-from-video 0 0.4 -0.4
git-multibranch 0.6 1 -0.4
mteb-leaderboard 0.2 0.6 -0.4
path-tracing 0.6 1 -0.4
schemelike-metacircular-eval 0.6 1 -0.4
tune-mjcf 0.4 0.8 -0.4
write-compressor 0.6 1 -0.4
financial-document-processor 0.4 1 -0.6
adaptive-rejection-sampler 0.2 1 -0.8
build-pmars 0.2 1 -0.8
make-mips-interpreter 0 0.8 -0.8
openssl-selfsigned-cert 0.2 1 -0.8
regex-chess 0 1 -1
Claude Opus 4.6 (Judy) uniquely solves qemu-alpine-ssh (100% vs 0%), and does much better on mcmc-sampling-stan and torch-pipeline-parallelism. But GPT-5.3-Codex (Droid) dominates on regex-chess (100% vs 0%), adaptive-rejection-sampler, build-pmars, and openssl-selfsigned-cert. These are genuinely different capability profiles, not just noise.
Some tasks are too easy (everyone passes) or too hard (everyone fails) to be useful discriminators. Which tasks best separate the top models from the bottom?
uv run --with sqlite-utils sqlite-utils query tb2.db "
select m.task_name,
round(avg(case when ss.avg_reward > 0.65 then m.avg_reward end), 3) as top_half,
round(avg(case when ss.avg_reward <= 0.65 then m.avg_reward end), 3) as bottom_half,
round(
avg(case when ss.avg_reward > 0.65 then m.avg_reward end) -
avg(case when ss.avg_reward <= 0.65 then m.avg_reward end), 3
) as discrimination
from submission_task_matrix m
join submission_stats ss on ss.submission = m.submission
where ss.n_trials > 100
group by m.task_name
having discrimination is not null
order by discrimination desc
limit 15
" --tabletask_name top_half bottom_half discrimination
------------------------- ---------- ------------- ----------------
sanitize-git-repo 0.886 0.1 0.786
path-tracing 0.7 0.02 0.68
break-filter-js-from-html 0.9 0.22 0.68
protein-assembly 0.757 0.1 0.657
feal-linear-cryptanalysis 0.893 0.24 0.653
write-compressor 0.775 0.14 0.635
path-tracing-reverse 0.714 0.08 0.634
bn-fit-modify 0.971 0.36 0.611
chess-best-move 0.729 0.12 0.609
overfull-hbox 0.786 0.2 0.586
circuit-fibsqrt 0.886 0.3 0.586
sqlite-db-truncate 1 0.435 0.565
large-scale-text-editing 0.957 0.4 0.557
winning-avg-corewars 0.764 0.24 0.524
build-cython-ext 0.971 0.46 0.511
sanitize-git-repo is the strongest discriminator — top-half models solve it 88.6% of the time while bottom-half models solve it only 10%. Tasks like path-tracing, break-filter-js-from-html, and feal-linear-cryptanalysis also strongly separate the pack. These are the tasks that really test whether an agent framework is well-built.
Which tasks have the highest variance in performance across submissions? These are tasks where the agent scaffold matters more than the model.
uv run --with sqlite-utils sqlite-utils query tb2.db "
select task_name,
round(avg(avg_reward), 3) as mean_reward,
round(min(avg_reward), 3) as worst,
round(max(avg_reward), 3) as best,
round(avg(avg_reward * avg_reward) - avg(avg_reward) * avg(avg_reward), 4) as variance,
count(*) as n_submissions
from submission_task_matrix
group by task_name
having n_submissions > 10
order by variance desc
limit 15
" --tabletask_name mean_reward worst best variance n_submissions
---------------------------- ------------- ------- ------ ---------- ---------------
break-filter-js-from-html 0.608 0 1 0.2046 27
feal-linear-cryptanalysis 0.589 0 1 0.2002 27
path-tracing-reverse 0.474 0 1 0.196 27
chess-best-move 0.459 0 1 0.1891 27
write-compressor 0.454 0 1 0.1876 27
financial-document-processor 0.57 0 1 0.1828 27
schemelike-metacircular-eval 0.633 0 1 0.1815 27
overfull-hbox 0.556 0 1 0.1788 27
polyglot-rust-c 0.333 0 1 0.1778 27
sanitize-git-repo 0.57 0 1 0.1769 27
protein-assembly 0.43 0 1 0.1769 27
qemu-alpine-ssh 0.444 0 1 0.1699 27
circuit-fibsqrt 0.681 0 1 0.1667 27
path-tracing 0.37 0 1 0.1613 27
configure-git-webserver 0.511 0 1 0.161 27
Every high-variance task has submissions scoring 0 and others scoring 1 — these are all-or-nothing tasks where the agent's approach either works completely or fails completely. break-filter-js-from-html has the highest variance: mean reward of 0.61 but some submissions never solve it while others always do.
Some submissions ran the benchmark multiple times. Did they get better?
uv run --with sqlite-utils sqlite-utils query tb2.db "
select r.submission, r.run_date,
count(*) as n_trials,
sum(case when t.status = 'passed' then 1 else 0 end) as n_passed,
round(avg(t.reward), 4) as avg_reward
from trials t
join runs r on t.run_id = r.id
group by r.submission, r.run_date
having n_trials > 40
order by r.submission, r.run_date
" --tablesubmission run_date n_trials n_passed avg_reward
------------------------------------- --------------------------------------- ---------- ---------- ------------
Ante__Gemini-3-Pro-Preview 2025-12-31__22-36-36 445 288 0.6501
ClaudeCode__GLM-4.7 2026-02-06__13-08-08 445 148 0.3348
CodeBrain-1__GPT-5.3-Codex 2026-02-08__11-01-20 89 60 0.6742
CodeBrain-1__GPT-5.3-Codex 2026-02-08__20-43-33 89 66 0.7416
CodeBrain-1__GPT-5.3-Codex 2026-02-09__01-09-07 89 65 0.7303
CodeBrain-1__GPT-5.3-Codex 2026-02-09__10-03-38 89 60 0.6818
CodeBrain-1__GPT-5.3-Codex 2026-02-09__14-06-10 89 62 0.6966
CodeBrain-1__Gemini-3-Pro-Preview 2026-01-31__22-50-53 89 56 0.6292
CodeBrain-1__Gemini-3-Pro-Preview 2026-02-01__11-30-31 89 56 0.6292
CodeBrain-1__Gemini-3-Pro-Preview 2026-02-01__16-43-15 89 54 0.6067
CodeBrain-1__Gemini-3-Pro-Preview 2026-02-05__00-52-21 89 56 0.6292
CodeBrain-1__Gemini-3-Pro-Preview 2026-02-05__11-19-55 89 55 0.618
Crux__Claude-Opus-4.6 submission-run 445 272 0.67
Deep-Agents__GPT-5.2-Codex 2026-02-10__12-43-13 445 293 0.6798
Droid__Claude-Opus-4.6 fudge_2026-01-29__03-21-39 89 61 0.6932
Droid__Claude-Opus-4.6 fudge_2026-01-29__06-23-59 89 64 0.7273
Droid__Claude-Opus-4.6 fudge_2026-01-29__07-55-11 89 61 0.6932
Droid__Claude-Opus-4.6 fudge_2026-01-29__23-21-40 89 63 0.7079
Droid__Claude-Opus-4.6 fudge_2026-01-30__02-21-25 89 62 0.7045
Droid__GPT-5.3-Codex 53codex_2026-02-16__17-43-12 89 69 0.7753
Droid__GPT-5.3-Codex 53codex_2026-02-21__21-58-56 89 68 0.764
Droid__GPT-5.3-Codex 53codex_2026-02-22__06-09-07 89 68 0.764
Droid__GPT-5.3-Codex 53codex_2026-02-22__09-59-36 89 69 0.7753
Droid__GPT-5.3-Codex 53codex_2026-02-22__10-19-01 89 70 0.7865
Judy__Claude-Opus-4.6 2026-02-10__18-07-36 89 64 0.7273
Judy__Claude-Opus-4.6 2026-02-11__11-58-19 89 63 0.7079
Judy__Claude-Opus-4.6 2026-02-12__11-31-57 89 62 0.7045
Judy__Claude-Opus-4.6 2026-02-12__18-19-55 89 64 0.7273
Judy__Claude-Opus-4.6 2026-02-13__00-57-16 89 67 0.7614
MAYA__Claude-4.5-sonnet 2025-11-13__09-25-32 89 38 0.4419
MAYA__Claude-4.5-sonnet 2025-12-05__06-20-00 89 38 0.4419
MAYA__Claude-4.5-sonnet 2025-12-05__16-42-55 89 38 0.4368
MAYA__Claude-4.5-sonnet 2025-12-06__16-31-45 89 38 0.4524
MAYA__Claude-4.5-sonnet 2025-12-09__13-26-21 89 38 0.4318
Mux__Claude-Opus-4.5 2026-01-16__00-15-05 89 52 0.5843
Mux__Claude-Opus-4.6 2026-02-09__01-47-04 89 59 0.6629
Mux__Claude-Opus-4.6 2026-02-09__01-47-09 89 62 0.6966
Mux__Claude-Opus-4.6 2026-02-10__00-43-32 89 57 0.6404
Mux__Claude-Opus-4.6 2026-02-10__02-03-21 89 60 0.6742
Mux__Claude-Opus-4.6 2026-02-10__02-43-25 89 58 0.6517
Mux__GPT-5.2 2026-01-16__00-55-00 89 54 0.6207
Mux__GPT-5.3-Codex 2026-02-08__19-57-27 445 305 0.6854
OpenCode__Claude-Opus-4.5 2026-01-11__01-42-13 89 46 0.5227
OpenSage__Gemini-3-Pro-Preview 2026-01-23__08-25-43 89 57 0.6477
OpenSage__Gemini-3-Pro-Preview 2026-01-23__22-00-30 89 59 0.6705
OpenSage__Gemini-3-Pro-Preview 2026-01-25__05-43-22 89 60 0.6742
OpenSage__Gemini-3-Pro-Preview 2026-01-25__20-11-02 89 57 0.6477
OpenSage__Gemini-3-Pro-Preview 2026-02-12__20-47-36 89 57 0.6477
Simple-Codex__GPT-5.3-Codex simple-codex-gpt-5.3-codex-5x-duplicate 445 332 0.7563
Terminus-KIRA__Claude-Opus-4.6 2026-02-13__11-48-54 445 331 0.7489
Terminus-KIRA__Gemini-3.1-Pro-Preview 2026-02-20__11-30-30 445 333 0.7517
Terminus2__Claude-Opus-4.6 2026-02-05__16-08-28 89 61 0.6854
Terminus2__Claude-Opus-4.6 2026-02-05__17-41-47 356 219 0.6222
Terminus2__DeepSeek-V3.2 2026-02-07__07-47-43 445 176 0.3982
Terminus2__GLM-4.7 2026-01-27__12-34-00 445 147 0.3475
Terminus2__GLM-5 2026-02-14__13-57-51 445 231 0.5397
Terminus2__GPT-5.3-Codex terminus-gpt-5.3-codex-5x 445 288 0.6545
Terminus2__Kimi-k2.5 2026-01-26__22-34-00 445 189 0.4385
Terminus2__Minimax-m2.5 2026-02-18__13-31-00 445 188 0.4292
dakou__qwen3-coder-480b 2025-12-25__23-49-10 445 121 0.275
Several submissions ran 5 times (89 tasks each, totaling 445). Performance is fairly stable across runs — CodeBrain-1 with GPT-5.3-Codex ranges from 67.4% to 74.2%, and Judy with Claude Opus 4.6 ranges from 70.5% to 76.1%. The variance between runs is much smaller than the variance between different agent/model combinations.
MAYA with Claude 4.5 Sonnet is remarkably consistent: 43-45% across all 5 runs.
Some errors aren't the agent's fault — they're infrastructure issues. Let's isolate those.
uv run --with sqlite-utils sqlite-utils query tb2.db "
select submission, exception_type, count(*) as n, exception_message
from trials
where exception_type in (
'DaytonaError', 'EnvironmentStartTimeoutError',
'DownloadVerifierDirError', 'AddTestsDirError',
'RewardFileNotFoundError', 'VerifierTimeoutError'
)
group by submission, exception_type
order by n desc
" --tablesubmission exception_type n exception_message
------------------------------ ---------------------------- --- ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Terminus2__GLM-4.7 DaytonaError 17 Failed to create session: <html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
</body>
</html>
Terminus2__Kimi-k2.5 DaytonaError 9 Failed to create session: unauthorized: failed to get auth URL: failed to parse request host: invalid host format: port and sandbox ID not found
MAYA__Claude-4.5-sonnet VerifierTimeoutError 6 Verifier execution timed out after 360.0 seconds
Droid__Claude-Opus-4.6 RewardFileNotFoundError 4 No reward file found at jobs/fudge_2026-01-29__07-55-11/break-filter-js-from-html__3G5YVbb/verifier/reward.txt or jobs/fudge_2026-01-29__07-55-11/break-filter-js-from-html__3G5YVbb/verifier/reward.json
OpenSage__Gemini-3-Pro-Preview VerifierTimeoutError 4 Verifier execution timed out after 900.0 seconds
Deep-Agents__GPT-5.2-Codex DaytonaError 3 Failed to create session: unauthorized: failed to get auth URL: failed to parse request host: invalid host format: port and sandbox ID not found
Judy__Claude-Opus-4.6 VerifierTimeoutError 3 Verifier execution timed out after 900.0 seconds
MAYA__Claude-4.5-sonnet EnvironmentStartTimeoutError 3 Environment start timed out after 600.0 seconds
Terminus-KIRA__Claude-Opus-4.6 EnvironmentStartTimeoutError 3 Environment start timed out after 600.0 seconds
Terminus2__GLM-5 AddTestsDirError 3 Failed to add tests directory to environment.
dakou__qwen3-coder-480b DownloadVerifierDirError 3 Failed to download verifier directory from environment
ClaudeCode__GLM-4.7 VerifierTimeoutError 2 Verifier execution timed out after 1800.0 seconds
Crux__Claude-Opus-4.6 RewardFileNotFoundError 2 No reward file found at jobs/fa20096c/break-filter-js-from-html__aie7YwZ/verifier/reward.txt or jobs/fa20096c/break-filter-js-from-html__aie7YwZ/verifier/reward.json
Simple-Codex__GPT-5.3-Codex DaytonaError 2 Failed to get session command: bad request: no IP address found. Is the Sandbox started?
CodeBrain-1__GPT-5.3-Codex VerifierTimeoutError 1 Verifier execution timed out after 900.0 seconds
Mux__GPT-5.2 VerifierTimeoutError 1 Verifier execution timed out after 900.0 seconds
Terminus2__DeepSeek-V3.2 RewardFileNotFoundError 1 No reward file found at jobs/2026-02-07__07-47-43/break-filter-js-from-html__h4vtei9/verifier/reward.txt or jobs/2026-02-07__07-47-43/break-filter-js-from-html__h4vtei9/verifier/reward.json
Terminus2__GLM-4.7 DownloadVerifierDirError 1 Failed to download verifier directory from environment
Terminus2__Kimi-k2.5 AddTestsDirError 1 Failed to add tests directory to environment.
Terminus2__Minimax-m2.5 RewardFileNotFoundError 1 No reward file found at jobs/Terminus2__Minimax-m2.5/break-filter-js-from-html__QQ3KKz4/verifier/reward.txt or jobs/Terminus2__Minimax-m2.5/break-filter-js-from-html__QQ3KKz4/verifier/reward.json
Terminus2__Minimax-m2.5 VerifierTimeoutError 1 Verifier execution timed out after 1800.0 seconds
Terminus2 with GLM-4.7 hit 17 DaytonaError (502 Bad Gateway) — pure infrastructure failures. The break-filter-js-from-html task causes RewardFileNotFoundError across multiple submissions, suggesting a verifier bug for that specific task. VerifierTimeoutError hits several submissions, meaning the task completed but the grading process itself timed out.
- make-doom-for-mips — 100.0% failure rate
- sam-cell-seg — 97.5% failure rate
- install-windows-3.11 — 95.1% failure rate
- caffe-cifar-10 — 94.7% failure rate
- filter-js-from-html — 94.7% failure rate
- gpt2-codegolf — 91.9% failure rate
- extract-moves-from-video — 91.7% failure rate
- raman-fitting — 91.7% failure rate
- train-fasttext — 89.2% failure rate
- mteb-retrieve — 89.0% failure rate
- video-processing — 88.6% failure rate
- torch-tensor-parallelism — 80.5% failure rate
- dna-assembly — 79.3% failure rate
- db-wal-recovery — 78.0% failure rate
- torch-pipeline-parallelism — 78.0% failure rate
- dna-insert — 77.9% failure rate
- mteb-leaderboard — 76.3% failure rate
- model-extraction-relu-logits — 75.8% failure rate
- make-mips-interpreter — 75.4% failure rate
- gcode-to-text — 74.6% failure rate
- regex-chess — 70.1% failure rate
- polyglot-c-py — 65.0% failure rate
- polyglot-rust-c — 63.4% failure rate
- query-optimize — 61.3% failure rate
- path-tracing — 59.3% failure rate
- adaptive-rejection-sampler — 59.0% failure rate
- qemu-alpine-ssh — 57.4% failure rate
- path-tracing-reverse — 54.5% failure rate
- protein-assembly — 52.9% failure rate
- chess-best-move — 52.8% failure rate
- write-compressor — 49.6% failure rate
- configure-git-webserver — 47.1% failure rate
- tune-mjcf — 46.3% failure rate
- winning-avg-corewars — 45.9% failure rate
- cancel-async-tasks — 44.7% failure rate
- financial-document-processor — 43.9% failure rate
- overfull-hbox — 43.7% failure rate
- sanitize-git-repo — 43.4% failure rate
- extract-elf — 43.0% failure rate
- schemelike-metacircular-eval — 39.5% failure rate
- compile-compcert — 37.0% failure rate
- feal-linear-cryptanalysis — 36.4% failure rate
- circuit-fibsqrt — 35.9% failure rate
- break-filter-js-from-html — 33.7% failure rate
- sparql-university — 30.9% failure rate
- largest-eigenval — 30.1% failure rate
- build-pmars — 29.3% failure rate
- mailman — 29.2% failure rate
- large-scale-text-editing — 27.7% failure rate
- bn-fit-modify — 27.6% failure rate
- qemu-startup — 27.6% failure rate
- rstan-to-pystan — 26.9% failure rate
- build-cython-ext — 23.6% failure rate
- password-recovery — 23.6% failure rate
- pytorch-model-cli — 23.6% failure rate
- feal-differential-cryptanalysis — 23.3% failure rate
- count-dataset-tokens — 23.1% failure rate
- sqlite-db-truncate — 22.9% failure rate
- llm-inference-batching-scheduler — 21.9% failure rate
- reshard-c4-data — 20.8% failure rate
- mcmc-sampling-stan — 20.7% failure rate
- fix-ocaml-gc — 20.0% failure rate
- openssl-selfsigned-cert — 19.5% failure rate
- sqlite-with-gcov — 18.7% failure rate
- pytorch-model-recovery — 18.5% failure rate
- build-pov-ray — 17.4% failure rate
- crack-7z-hash — 17.1% failure rate
- kv-store-grpc — 15.4% failure rate
- hf-model-inference — 14.9% failure rate
- headless-terminal — 14.8% failure rate
- merge-diff-arc-agi-task — 12.3% failure rate
- pypi-server — 11.4% failure rate
- regex-log — 11.4% failure rate
- fix-code-vulnerability — 9.0% failure rate
- git-multibranch — 8.1% failure rate
- distribution-search — 7.3% failure rate
- code-from-image — 6.6% failure rate
- log-summary-date-ranges — 6.5% failure rate
- modernize-scientific-stack — 5.8% failure rate
- custom-memory-heap-crash — 5.7% failure rate
- multi-source-data-merger — 5.7% failure rate
- prove-plus-comm — 5.7% failure rate
- portfolio-optimization — 5.0% failure rate
- vulnerable-secret — 4.9% failure rate
- constraints-scheduling — 4.1% failure rate
- fix-git — 4.1% failure rate
- nginx-request-logging — 4.1% failure rate
- cobol-modernization — 3.2% failure rate
- git-leak-recovery — 1.6% failure rate