Skip to content

Instantly share code, notes, and snippets.

@simonw

simonw/README.md Secret

Last active February 27, 2026 21:04
Show Gist options
  • Select an option

  • Save simonw/3bff274abcbbbf8766e9437a542db248 to your computer and use it in GitHub Desktop.

Select an option

Save simonw/3bff274abcbbbf8766e9437a542db248 to your computer and use it in GitHub Desktop.

Terminal Bench 2.0 — Analysis

2026-02-27T20:58:00Z by Showboat 0.6.0

This analysis explores Terminal Bench 2.0 results: 10,947 trial results across 27 agent/model submissions on 89 benchmark tasks. The data was loaded from JSON result files into a SQLite database with proper relational schema.

Overall Leaderboard

Which agent/model combination scores the highest across all 89 tasks?

uv run --with sqlite-utils sqlite-utils query tb2.db "select submission, n_trials, n_passed, n_failed, n_errored, avg_reward from submission_stats order by avg_reward desc" --table
submission                               n_trials    n_passed    n_failed    n_errored    avg_reward
-------------------------------------  ----------  ----------  ----------  -----------  ------------
Droid__GPT-5.3-Codex                          445         344          73           28        0.773
Simple-Codex__GPT-5.3-Codex                   445         332          46           67        0.7563
Terminus-KIRA__Gemini-3.1-Pro-Preview         445         333          76           36        0.7517
Terminus-KIRA__Claude-Opus-4.6                445         331          54           60        0.7489
Judy__Claude-Opus-4.6                         445         320          65           60        0.7256
Droid__Claude-Opus-4.6                        445         311         108           26        0.7052
CodeBrain-1__GPT-5.3-Codex                    445         313          98           34        0.705
Mux__GPT-5.3-Codex                            445         305         114           26        0.6854
Deep-Agents__GPT-5.2-Codex                    445         293          77           75        0.6798
Crux__Claude-Opus-4.6                         445         272          99           74        0.67
Mux__Claude-Opus-4.6                          445         296         113           36        0.6652
OpenSage__Gemini-3-Pro-Preview                445         290         131           24        0.6576
Terminus2__GPT-5.3-Codex                      445         288          70           87        0.6545
Ante__Gemini-3-Pro-Preview                    445         288         129           28        0.6501
Terminus2__Claude-Opus-4.6                    445         280          92           73        0.6349
CodeBrain-1__Gemini-3-Pro-Preview             445         277         136           32        0.6225
Mux__GPT-5.2                                   89          54          25           10        0.6207
Mux__Claude-Opus-4.5                           89          52          29            8        0.5843
Terminus2__GLM-5                              445         231         113          101        0.5397
OpenCode__Claude-Opus-4.5                      89          46          38            5        0.5227
MAYA__Claude-4.5-sonnet                       445         190         241           14        0.4408
Terminus2__Kimi-k2.5                          445         189         161           95        0.4385
Terminus2__Minimax-m2.5                       445         188          92          165        0.4292
Terminus2__DeepSeek-V3.2                      445         176         183           86        0.3982
Terminus2__GLM-4.7                            445         147         139          159        0.3475
ClaudeCode__GLM-4.7                           445         148         250           47        0.3348
dakou__qwen3-coder-480b                       445         121         232           92        0.275

Droid with GPT-5.3-Codex leads at 77.3% pass rate, followed closely by Simple-Codex (also GPT-5.3-Codex) and Terminus-KIRA (Gemini 3.1 Pro Preview). The top 4 are within 2.5 percentage points of each other.

Best Model per Backbone

Multiple agents use the same underlying model. Which agent gets the most out of each model?

uv run --with sqlite-utils sqlite-utils query tb2.db "
select s.model_name, s.slug as best_submission, ss.avg_reward, ss.n_passed, ss.n_trials
from submissions s
join submission_stats ss on ss.submission = s.slug
where ss.avg_reward = (
  select max(ss2.avg_reward)
  from submissions s2
  join submission_stats ss2 on ss2.submission = s2.slug
  where s2.model_name = s.model_name
)
order by ss.avg_reward desc
" --table
model_name                          best_submission                          avg_reward    n_passed    n_trials
----------------------------------  -------------------------------------  ------------  ----------  ----------
gpt-5.3-codex                       Droid__GPT-5.3-Codex                         0.773          344         445
openai/gpt-5.3-codex                Simple-Codex__GPT-5.3-Codex                  0.7563         332         445
vertex_ai/gemini-3.1-pro-preview    Terminus-KIRA__Gemini-3.1-Pro-Preview        0.7517         333         445
vertex_ai/claude-opus-4-6           Terminus-KIRA__Claude-Opus-4.6               0.7489         331         445
aurora-01-21                        Droid__Claude-Opus-4.6                       0.7052         311         445
openai:gpt-5.2-codex                Deep-Agents__GPT-5.2-Codex                   0.6798         293         445
claude-sonnet-4-20250514            Crux__Claude-Opus-4.6                        0.67           272         445
anthropic/claude-opus-4-6           Mux__Claude-Opus-4.6                         0.6652         296         445
litellm_proxy/gemini-3-pro-preview  OpenSage__Gemini-3-Pro-Preview               0.6576         290         445
gemini-3-pro-preview                Ante__Gemini-3-Pro-Preview                   0.6501         288         445
vertex_ai/gemini-3-pro-preview      CodeBrain-1__Gemini-3-Pro-Preview            0.6225         277         445
openai/gpt-5.2                      Mux__GPT-5.2                                 0.6207          54          89
anthropic/claude-opus-4-5           Mux__Claude-Opus-4.5                         0.5843          52          89
openai/glm-5                        Terminus2__GLM-5                             0.5397         231         445
openai/kimi-k2.5:cloud              Terminus2__Kimi-k2.5                         0.4385         189         445
openai/minimax-m2.5:cloud           Terminus2__Minimax-m2.5                      0.4292         188         445
deepseek/deepseek-chat              Terminus2__DeepSeek-V3.2                     0.3982         176         445
openai/glm-4.7:cloud                Terminus2__GLM-4.7                           0.3475         147         445
GLM-4.7                             ClaudeCode__GLM-4.7                          0.3348         148         445
qwen3-coder-modelscope              dakou__qwen3-coder-480b                      0.275          121         445

Interesting — some models appear under different provider prefixes (e.g. Claude Opus 4.6 appears as aurora-01-21, vertex_ai/claude-opus-4-6, and anthropic/claude-opus-4-6). The agent scaffold matters: Terminus-KIRA gets 74.9% from Claude Opus 4.6 via Vertex, while Mux gets 66.5% from the same model via Anthropic's API directly.

The Hardest Tasks

Which of the 89 tasks have the highest failure rates across all submissions?

uv run --with sqlite-utils sqlite-utils query tb2.db "
select task_name, n_trials, n_passed, n_failed, n_errored, failure_rate
from task_stats order by failure_rate desc limit 15
" --table
task_name                     n_trials    n_passed    n_failed    n_errored    failure_rate
--------------------------  ----------  ----------  ----------  -----------  --------------
make-doom-for-mips                 123           0          12          111          1
sam-cell-seg                       123           3         117            3          0.975
install-windows-3.11               123           6         109            8          0.9508
caffe-cifar-10                     123           6          34           83          0.9474
filter-js-from-html                123           6         108            9          0.9474
gpt2-codegolf                      123          10          25           88          0.9187
extract-moves-from-video           123          10          52           61          0.9174
raman-fitting                      123          10          78           35          0.9174
train-fasttext                     123          12          25           86          0.8919
mteb-retrieve                      123          13         105            5          0.8898
video-processing                   123          14         107            2          0.8862
torch-tensor-parallelism           123          24          88           11          0.8049
dna-assembly                       123          25          80           18          0.7934
db-wal-recovery                    123          27          48           48          0.7805
torch-pipeline-parallelism         123          27          69           27          0.7805

make-doom-for-mips is completely unsolved — 0 passes out of 123 attempts, with 111 of those being errors (mostly timeouts). sam-cell-seg and install-windows-3.11 are nearly as brutal.

Notice that some tasks have high error counts rather than clean failures — caffe-cifar-10 has 83 errors out of 123 attempts, suggesting agents crash or timeout rather than producing wrong answers.

Tasks No One Has Solved

uv run --with sqlite-utils sqlite-utils query tb2.db "select task_name, n_trials, n_failed, n_errored from task_stats where n_passed = 0" --table
task_name             n_trials    n_failed    n_errored
------------------  ----------  ----------  -----------
make-doom-for-mips         123          12          111

Only one task is completely unsolved: compiling DOOM for MIPS. 90% of attempts don't even finish — they timeout.

The Easiest Tasks

uv run --with sqlite-utils sqlite-utils query tb2.db "
select task_name, n_trials, n_passed, n_failed, n_errored, avg_reward
from task_stats order by failure_rate asc limit 15
" --table
task_name                     n_trials    n_passed    n_failed    n_errored    avg_reward
--------------------------  ----------  ----------  ----------  -----------  ------------
git-leak-recovery                  123         121           2            0        0.9837
cobol-modernization                123         119           4            0        0.9675
constraints-scheduling             123         118           5            0        0.9593
fix-git                            123         118           5            0        0.9593
nginx-request-logging              123         118           5            0        0.9593
vulnerable-secret                  123         116           6            1        0.9508
portfolio-optimization             123         114           6            3        0.95
custom-memory-heap-crash           123         116           2            5        0.9431
multi-source-data-merger           123         116           7            0        0.9431
prove-plus-comm                    123         116           7            0        0.9431
modernize-scientific-stack         123         113           6            4        0.9417
log-summary-date-ranges            123         115           8            0        0.935
code-from-image                    123         113           3            7        0.9339
distribution-search                123         114           5            4        0.9268
git-multibranch                    123         113           9            1        0.9187

git-leak-recovery is nearly universally solved (98.4%) — only 2 failures across all 123 attempts. Tasks involving git operations, COBOL modernization, and constraint scheduling are consistently easy for all models.

Error Analysis

What kinds of errors do agents hit?

uv run --with sqlite-utils sqlite-utils query tb2.db "
select exception_type, count(*) as n,
  round(100.0 * count(*) / (select count(*) from trials where exception_type is not null), 1) as pct
from trials
where exception_type is not null
group by exception_type
order by n desc
" --table
exception_type                   n    pct
----------------------------  ----  -----
AgentTimeoutError             1596   92.9
DaytonaError                    31    1.8
RuntimeError                    20    1.2
VerifierTimeoutError            18    1
BadRequestError                  9    0.5
NameError                        8    0.5
OSError                          8    0.5
RewardFileNotFoundError          8    0.5
EnvironmentStartTimeoutError     6    0.3
AddTestsDirError                 4    0.2
AttributeError                   4    0.2
DownloadVerifierDirError         4    0.2
AgentSetupTimeoutError           1    0.1
KeyError                         1    0.1

AgentTimeoutError dominates at 92.9% of all errors. The remaining errors are a mix of infrastructure issues (DaytonaError, EnvironmentStartTimeoutError) and agent bugs (NameError, AttributeError).

Which submissions error the most?

uv run --with sqlite-utils sqlite-utils query tb2.db "
select submission,
  count(*) as n_trials,
  sum(case when exception_type is not null then 1 else 0 end) as n_errors,
  round(100.0 * sum(case when exception_type is not null then 1 else 0 end) / count(*), 1) as error_pct,
  sum(case when exception_type = 'AgentTimeoutError' then 1 else 0 end) as n_timeouts
from trials
group by submission
order by error_pct desc
" --table
submission                               n_trials    n_errors    error_pct    n_timeouts
-------------------------------------  ----------  ----------  -----------  ------------
Terminus2__Minimax-m2.5                       445         181         40.7           179
Terminus2__GLM-4.7                            445         161         36.2           143
Terminus2__GLM-5                              445         111         24.9           108
Terminus2__Kimi-k2.5                          445         101         22.7            91
dakou__qwen3-coder-480b                       445          95         21.3            92
Simple-Codex__GPT-5.3-Codex                   445          92         20.7            90
Terminus2__GPT-5.3-Codex                      445          92         20.7            92
Terminus2__DeepSeek-V3.2                      445          91         20.4            90
Judy__Claude-Opus-4.6                         445          86         19.3            83
Deep-Agents__GPT-5.2-Codex                    445          82         18.4            70
Terminus-KIRA__Claude-Opus-4.6                445          81         18.2            78
Terminus2__Claude-Opus-4.6                    445          81         18.2            81
Crux__Claude-Opus-4.6                         445          78         17.5            40
Mux__GPT-5.2                                   89          12         13.5            11
ClaudeCode__GLM-4.7                           445          48         10.8            46
Mux__Claude-Opus-4.6                          445          46         10.3            46
Terminus-KIRA__Gemini-3.1-Pro-Preview         445          43          9.7            43
Mux__Claude-Opus-4.5                           89           8          9               8
CodeBrain-1__GPT-5.3-Codex                    445          36          8.1            35
CodeBrain-1__Gemini-3-Pro-Preview             445          35          7.9            35
Ante__Gemini-3-Pro-Preview                    445          30          6.7            29
Droid__GPT-5.3-Codex                          445          29          6.5            29
Mux__GPT-5.3-Codex                            445          29          6.5            29
Droid__Claude-Opus-4.6                        445          26          5.8            22
OpenCode__Claude-Opus-4.5                      89           5          5.6             5
OpenSage__Gemini-3-Pro-Preview                445          25          5.6            21
MAYA__Claude-4.5-sonnet                       445          14          3.1             0

Terminus2 with Minimax-m2.5 errors 40.7% of the time — almost all timeouts. MAYA with Claude 4.5 Sonnet has the lowest error rate (3.1%) and zero timeouts, though it has a lower overall pass rate since it cleanly fails rather than crashing.

Tasks that cause the most timeouts

uv run --with sqlite-utils sqlite-utils query tb2.db "
select task_name, count(*) as n_timeouts,
  round(100.0 * count(*) / (select count(*) from trials t2 where t2.task_name = t.task_name), 1) as timeout_pct
from trials t
where exception_type = 'AgentTimeoutError'
group by task_name
order by timeout_pct desc
limit 15
" --table
task_name                     n_timeouts    timeout_pct
--------------------------  ------------  -------------
make-doom-for-mips                   109           88.6
gpt2-codegolf                         88           71.5
caffe-cifar-10                        84           68.3
train-fasttext                        83           67.5
extract-moves-from-video              60           48.8
make-mips-interpreter                 57           46.3
qemu-alpine-ssh                       56           45.5
tune-mjcf                             52           42.3
write-compressor                      50           40.7
path-tracing                          48           39
db-wal-recovery                       48           39
adaptive-rejection-sampler            42           34.1
torch-pipeline-parallelism            39           31.7
polyglot-rust-c                       39           31.7
path-tracing-reverse                  35           28.5

The timeout-prone tasks are the computationally heavy ones: compiling DOOM for MIPS (88.6% timeout), GPT-2 code golf (71.5%), training Caffe on CIFAR-10 (68.3%). These tasks likely need more time than the default timeout allows, regardless of the agent's skill.

Timing Analysis

How long do agents actually spend on tasks, and does spending more time correlate with success?

uv run --with sqlite-utils sqlite-utils query tb2.db "
select status, count(*) as n,
  round(avg((julianday(agent_exec_finished_at) - julianday(agent_exec_started_at)) * 86400), 1) as avg_agent_sec,
  round(min((julianday(agent_exec_finished_at) - julianday(agent_exec_started_at)) * 86400), 1) as min_sec,
  round(max((julianday(agent_exec_finished_at) - julianday(agent_exec_started_at)) * 86400), 1) as max_sec
from trials
where agent_exec_started_at is not null and agent_exec_finished_at is not null
group by status
" --table
status       n    avg_agent_sec    min_sec    max_sec
--------  ----  ---------------  ---------  ---------
errored   1509           1504.1        0      12000.1
failed    2984            541          0.4     9338.2
passed    6415            423.3        1.7     6250

Errored trials average 1,504 seconds (25 minutes) — almost all timeouts running to the limit. Passed trials average 423 seconds (7 minutes) and failed trials 541 seconds (9 minutes). Agents that succeed tend to do so relatively quickly. Agents that fail spend more time trying before giving up.

Time breakdown: setup vs execution vs verification

uv run --with sqlite-utils sqlite-utils query tb2.db "
select submission,
  round(avg((julianday(env_setup_finished_at) - julianday(env_setup_started_at)) * 86400), 1) as env_setup_sec,
  round(avg((julianday(agent_setup_finished_at) - julianday(agent_setup_started_at)) * 86400), 1) as agent_setup_sec,
  round(avg((julianday(agent_exec_finished_at) - julianday(agent_exec_started_at)) * 86400), 1) as exec_sec,
  round(avg((julianday(verifier_finished_at) - julianday(verifier_started_at)) * 86400), 1) as verify_sec
from trials
where env_setup_started_at is not null
group by submission
order by exec_sec desc
limit 15
" --table
submission                        env_setup_sec    agent_setup_sec    exec_sec    verify_sec
------------------------------  ---------------  -----------------  ----------  ------------
Terminus2__Minimax-m2.5                    17.5               18.3       965.5          93.5
Terminus2__GLM-4.7                         14                 20.6       886.2          90.8
Terminus2__GLM-5                           10.5               17.2       822.9          63.2
Judy__Claude-Opus-4.6                      11.2               47.1       760.5          72
Terminus-KIRA__Claude-Opus-4.6             39.2               17.8       753.8          46.2
Simple-Codex__GPT-5.3-Codex                 2.7               13.2       742.4          66.1
dakou__qwen3-coder-480b                    52.1              109.7       737.2         122.6
Terminus2__GPT-5.3-Codex                    2.9               12.7       734.7          48
Terminus2__Kimi-k2.5                       15.1               12.4       711.6          74.5
Deep-Agents__GPT-5.2-Codex                  3.4                0         704.5          45.8
Terminus2__DeepSeek-V3.2                    9.5                8.7       703.4          31.1
Terminus2__Claude-Opus-4.6                  3.6               12.9       656.2          49.6
Mux__Claude-Opus-4.6                        9.2               78.7       574.4          40.2
Mux__GPT-5.2                                6.2               79.2       570.5          95.1
OpenCode__Claude-Opus-4.5                  26.8               40.3       557.1          66.7

Agent execution dominates the time budget, typically 10-20x longer than setup or verification. The dakou agent has notably long agent setup (110 seconds on average) — perhaps it's downloading a large model. Verification takes 30-120 seconds depending on the task.

Cost Analysis

Only some submissions report token usage and costs.

uv run --with sqlite-utils sqlite-utils query tb2.db "
select submission,
  sum(case when status = 'passed' then 1 else 0 end) as n_passed,
  round(sum(cost_usd), 2) as total_cost,
  round(avg(cost_usd), 2) as avg_cost_per_trial,
  round(sum(cost_usd) / nullif(sum(case when status = 'passed' then 1 else 0 end), 0), 2) as cost_per_pass
from trials
where cost_usd is not null and cost_usd > 0
group by submission
having n_passed > 0
order by cost_per_pass asc
" --table
submission                               n_passed    total_cost    avg_cost_per_trial    cost_per_pass
-------------------------------------  ----------  ------------  --------------------  ---------------
Mux__GPT-5.3-Codex                            236          0                     0                0
Terminus2__DeepSeek-V3.2                      176         13.67                  0.03             0.08
Mux__Claude-Opus-4.6                          101         37.22                  0.32             0.37
Terminus2__Claude-Opus-4.6                    280        293.61                  0.66             1.05
Terminus-KIRA__Gemini-3.1-Pro-Preview         333        406.63                  0.91             1.22
Terminus-KIRA__Claude-Opus-4.6                331        713.67                  1.61             2.16

Huge cost range: DeepSeek V3.2 costs just $0.08 per correct answer while Terminus-KIRA with Claude Opus 4.6 costs $2.16 per correct answer — 27x more expensive. The GPT-5.3-Codex run via Mux reports zero cost which is likely a reporting issue.

Terminus-KIRA with Claude Opus 4.6 spent $713.67 total across 445 trials — an expensive benchmark run.

Head-to-Head: Claude vs GPT on Individual Tasks

Where does Claude Opus 4.6 (via Judy) beat GPT-5.3-Codex (via Droid), and vice versa?

uv run --with sqlite-utils sqlite-utils query tb2.db "
select m1.task_name,
  round(m1.avg_reward, 2) as claude_opus,
  round(m2.avg_reward, 2) as gpt_codex,
  round(m1.avg_reward - m2.avg_reward, 2) as claude_advantage
from submission_task_matrix m1
join submission_task_matrix m2 using (task_name)
where m1.submission = 'Judy__Claude-Opus-4.6'
  and m2.submission = 'Droid__GPT-5.3-Codex'
  and abs(m1.avg_reward - m2.avg_reward) > 0.3
order by claude_advantage desc
" --table
task_name                       claude_opus    gpt_codex    claude_advantage
----------------------------  -------------  -----------  ------------------
qemu-alpine-ssh                        1             0                  1
mcmc-sampling-stan                     0.8           0                  0.8
torch-pipeline-parallelism             0.8           0                  0.8
db-wal-recovery                        1             0.4                0.6
overfull-hbox                          1             0.4                0.6
query-optimize                         0.75          0.4                0.35
extract-moves-from-video               0             0.4               -0.4
git-multibranch                        0.6           1                 -0.4
mteb-leaderboard                       0.2           0.6               -0.4
path-tracing                           0.6           1                 -0.4
schemelike-metacircular-eval           0.6           1                 -0.4
tune-mjcf                              0.4           0.8               -0.4
write-compressor                       0.6           1                 -0.4
financial-document-processor           0.4           1                 -0.6
adaptive-rejection-sampler             0.2           1                 -0.8
build-pmars                            0.2           1                 -0.8
make-mips-interpreter                  0             0.8               -0.8
openssl-selfsigned-cert                0.2           1                 -0.8
regex-chess                            0             1                 -1

Claude Opus 4.6 (Judy) uniquely solves qemu-alpine-ssh (100% vs 0%), and does much better on mcmc-sampling-stan and torch-pipeline-parallelism. But GPT-5.3-Codex (Droid) dominates on regex-chess (100% vs 0%), adaptive-rejection-sampler, build-pmars, and openssl-selfsigned-cert. These are genuinely different capability profiles, not just noise.

Consistency: Which Tasks Discriminate Strong from Weak Models?

Some tasks are too easy (everyone passes) or too hard (everyone fails) to be useful discriminators. Which tasks best separate the top models from the bottom?

uv run --with sqlite-utils sqlite-utils query tb2.db "
select m.task_name,
  round(avg(case when ss.avg_reward > 0.65 then m.avg_reward end), 3) as top_half,
  round(avg(case when ss.avg_reward <= 0.65 then m.avg_reward end), 3) as bottom_half,
  round(
    avg(case when ss.avg_reward > 0.65 then m.avg_reward end) -
    avg(case when ss.avg_reward <= 0.65 then m.avg_reward end), 3
  ) as discrimination
from submission_task_matrix m
join submission_stats ss on ss.submission = m.submission
where ss.n_trials > 100
group by m.task_name
having discrimination is not null
order by discrimination desc
limit 15
" --table
task_name                    top_half    bottom_half    discrimination
-------------------------  ----------  -------------  ----------------
sanitize-git-repo               0.886          0.1               0.786
path-tracing                    0.7            0.02              0.68
break-filter-js-from-html       0.9            0.22              0.68
protein-assembly                0.757          0.1               0.657
feal-linear-cryptanalysis       0.893          0.24              0.653
write-compressor                0.775          0.14              0.635
path-tracing-reverse            0.714          0.08              0.634
bn-fit-modify                   0.971          0.36              0.611
chess-best-move                 0.729          0.12              0.609
overfull-hbox                   0.786          0.2               0.586
circuit-fibsqrt                 0.886          0.3               0.586
sqlite-db-truncate              1              0.435             0.565
large-scale-text-editing        0.957          0.4               0.557
winning-avg-corewars            0.764          0.24              0.524
build-cython-ext                0.971          0.46              0.511

sanitize-git-repo is the strongest discriminator — top-half models solve it 88.6% of the time while bottom-half models solve it only 10%. Tasks like path-tracing, break-filter-js-from-html, and feal-linear-cryptanalysis also strongly separate the pack. These are the tasks that really test whether an agent framework is well-built.

The Most Inconsistent Tasks

Which tasks have the highest variance in performance across submissions? These are tasks where the agent scaffold matters more than the model.

uv run --with sqlite-utils sqlite-utils query tb2.db "
select task_name,
  round(avg(avg_reward), 3) as mean_reward,
  round(min(avg_reward), 3) as worst,
  round(max(avg_reward), 3) as best,
  round(avg(avg_reward * avg_reward) - avg(avg_reward) * avg(avg_reward), 4) as variance,
  count(*) as n_submissions
from submission_task_matrix
group by task_name
having n_submissions > 10
order by variance desc
limit 15
" --table
task_name                       mean_reward    worst    best    variance    n_submissions
----------------------------  -------------  -------  ------  ----------  ---------------
break-filter-js-from-html             0.608        0       1      0.2046               27
feal-linear-cryptanalysis             0.589        0       1      0.2002               27
path-tracing-reverse                  0.474        0       1      0.196                27
chess-best-move                       0.459        0       1      0.1891               27
write-compressor                      0.454        0       1      0.1876               27
financial-document-processor          0.57         0       1      0.1828               27
schemelike-metacircular-eval          0.633        0       1      0.1815               27
overfull-hbox                         0.556        0       1      0.1788               27
polyglot-rust-c                       0.333        0       1      0.1778               27
sanitize-git-repo                     0.57         0       1      0.1769               27
protein-assembly                      0.43         0       1      0.1769               27
qemu-alpine-ssh                       0.444        0       1      0.1699               27
circuit-fibsqrt                       0.681        0       1      0.1667               27
path-tracing                          0.37         0       1      0.1613               27
configure-git-webserver               0.511        0       1      0.161                27

Every high-variance task has submissions scoring 0 and others scoring 1 — these are all-or-nothing tasks where the agent's approach either works completely or fails completely. break-filter-js-from-html has the highest variance: mean reward of 0.61 but some submissions never solve it while others always do.

Submissions That Improved Across Runs

Some submissions ran the benchmark multiple times. Did they get better?

uv run --with sqlite-utils sqlite-utils query tb2.db "
select r.submission, r.run_date,
  count(*) as n_trials,
  sum(case when t.status = 'passed' then 1 else 0 end) as n_passed,
  round(avg(t.reward), 4) as avg_reward
from trials t
join runs r on t.run_id = r.id
group by r.submission, r.run_date
having n_trials > 40
order by r.submission, r.run_date
" --table
submission                             run_date                                   n_trials    n_passed    avg_reward
-------------------------------------  ---------------------------------------  ----------  ----------  ------------
Ante__Gemini-3-Pro-Preview             2025-12-31__22-36-36                            445         288        0.6501
ClaudeCode__GLM-4.7                    2026-02-06__13-08-08                            445         148        0.3348
CodeBrain-1__GPT-5.3-Codex             2026-02-08__11-01-20                             89          60        0.6742
CodeBrain-1__GPT-5.3-Codex             2026-02-08__20-43-33                             89          66        0.7416
CodeBrain-1__GPT-5.3-Codex             2026-02-09__01-09-07                             89          65        0.7303
CodeBrain-1__GPT-5.3-Codex             2026-02-09__10-03-38                             89          60        0.6818
CodeBrain-1__GPT-5.3-Codex             2026-02-09__14-06-10                             89          62        0.6966
CodeBrain-1__Gemini-3-Pro-Preview      2026-01-31__22-50-53                             89          56        0.6292
CodeBrain-1__Gemini-3-Pro-Preview      2026-02-01__11-30-31                             89          56        0.6292
CodeBrain-1__Gemini-3-Pro-Preview      2026-02-01__16-43-15                             89          54        0.6067
CodeBrain-1__Gemini-3-Pro-Preview      2026-02-05__00-52-21                             89          56        0.6292
CodeBrain-1__Gemini-3-Pro-Preview      2026-02-05__11-19-55                             89          55        0.618
Crux__Claude-Opus-4.6                  submission-run                                  445         272        0.67
Deep-Agents__GPT-5.2-Codex             2026-02-10__12-43-13                            445         293        0.6798
Droid__Claude-Opus-4.6                 fudge_2026-01-29__03-21-39                       89          61        0.6932
Droid__Claude-Opus-4.6                 fudge_2026-01-29__06-23-59                       89          64        0.7273
Droid__Claude-Opus-4.6                 fudge_2026-01-29__07-55-11                       89          61        0.6932
Droid__Claude-Opus-4.6                 fudge_2026-01-29__23-21-40                       89          63        0.7079
Droid__Claude-Opus-4.6                 fudge_2026-01-30__02-21-25                       89          62        0.7045
Droid__GPT-5.3-Codex                   53codex_2026-02-16__17-43-12                     89          69        0.7753
Droid__GPT-5.3-Codex                   53codex_2026-02-21__21-58-56                     89          68        0.764
Droid__GPT-5.3-Codex                   53codex_2026-02-22__06-09-07                     89          68        0.764
Droid__GPT-5.3-Codex                   53codex_2026-02-22__09-59-36                     89          69        0.7753
Droid__GPT-5.3-Codex                   53codex_2026-02-22__10-19-01                     89          70        0.7865
Judy__Claude-Opus-4.6                  2026-02-10__18-07-36                             89          64        0.7273
Judy__Claude-Opus-4.6                  2026-02-11__11-58-19                             89          63        0.7079
Judy__Claude-Opus-4.6                  2026-02-12__11-31-57                             89          62        0.7045
Judy__Claude-Opus-4.6                  2026-02-12__18-19-55                             89          64        0.7273
Judy__Claude-Opus-4.6                  2026-02-13__00-57-16                             89          67        0.7614
MAYA__Claude-4.5-sonnet                2025-11-13__09-25-32                             89          38        0.4419
MAYA__Claude-4.5-sonnet                2025-12-05__06-20-00                             89          38        0.4419
MAYA__Claude-4.5-sonnet                2025-12-05__16-42-55                             89          38        0.4368
MAYA__Claude-4.5-sonnet                2025-12-06__16-31-45                             89          38        0.4524
MAYA__Claude-4.5-sonnet                2025-12-09__13-26-21                             89          38        0.4318
Mux__Claude-Opus-4.5                   2026-01-16__00-15-05                             89          52        0.5843
Mux__Claude-Opus-4.6                   2026-02-09__01-47-04                             89          59        0.6629
Mux__Claude-Opus-4.6                   2026-02-09__01-47-09                             89          62        0.6966
Mux__Claude-Opus-4.6                   2026-02-10__00-43-32                             89          57        0.6404
Mux__Claude-Opus-4.6                   2026-02-10__02-03-21                             89          60        0.6742
Mux__Claude-Opus-4.6                   2026-02-10__02-43-25                             89          58        0.6517
Mux__GPT-5.2                           2026-01-16__00-55-00                             89          54        0.6207
Mux__GPT-5.3-Codex                     2026-02-08__19-57-27                            445         305        0.6854
OpenCode__Claude-Opus-4.5              2026-01-11__01-42-13                             89          46        0.5227
OpenSage__Gemini-3-Pro-Preview         2026-01-23__08-25-43                             89          57        0.6477
OpenSage__Gemini-3-Pro-Preview         2026-01-23__22-00-30                             89          59        0.6705
OpenSage__Gemini-3-Pro-Preview         2026-01-25__05-43-22                             89          60        0.6742
OpenSage__Gemini-3-Pro-Preview         2026-01-25__20-11-02                             89          57        0.6477
OpenSage__Gemini-3-Pro-Preview         2026-02-12__20-47-36                             89          57        0.6477
Simple-Codex__GPT-5.3-Codex            simple-codex-gpt-5.3-codex-5x-duplicate         445         332        0.7563
Terminus-KIRA__Claude-Opus-4.6         2026-02-13__11-48-54                            445         331        0.7489
Terminus-KIRA__Gemini-3.1-Pro-Preview  2026-02-20__11-30-30                            445         333        0.7517
Terminus2__Claude-Opus-4.6             2026-02-05__16-08-28                             89          61        0.6854
Terminus2__Claude-Opus-4.6             2026-02-05__17-41-47                            356         219        0.6222
Terminus2__DeepSeek-V3.2               2026-02-07__07-47-43                            445         176        0.3982
Terminus2__GLM-4.7                     2026-01-27__12-34-00                            445         147        0.3475
Terminus2__GLM-5                       2026-02-14__13-57-51                            445         231        0.5397
Terminus2__GPT-5.3-Codex               terminus-gpt-5.3-codex-5x                       445         288        0.6545
Terminus2__Kimi-k2.5                   2026-01-26__22-34-00                            445         189        0.4385
Terminus2__Minimax-m2.5                2026-02-18__13-31-00                            445         188        0.4292
dakou__qwen3-coder-480b                2025-12-25__23-49-10                            445         121        0.275

Several submissions ran 5 times (89 tasks each, totaling 445). Performance is fairly stable across runs — CodeBrain-1 with GPT-5.3-Codex ranges from 67.4% to 74.2%, and Judy with Claude Opus 4.6 ranges from 70.5% to 76.1%. The variance between runs is much smaller than the variance between different agent/model combinations.

MAYA with Claude 4.5 Sonnet is remarkably consistent: 43-45% across all 5 runs.

Infrastructure Errors

Some errors aren't the agent's fault — they're infrastructure issues. Let's isolate those.

uv run --with sqlite-utils sqlite-utils query tb2.db "
select submission, exception_type, count(*) as n, exception_message
from trials
where exception_type in (
  'DaytonaError', 'EnvironmentStartTimeoutError',
  'DownloadVerifierDirError', 'AddTestsDirError',
  'RewardFileNotFoundError', 'VerifierTimeoutError'
)
group by submission, exception_type
order by n desc
" --table
submission                      exception_type                  n  exception_message
------------------------------  ----------------------------  ---  ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Terminus2__GLM-4.7              DaytonaError                   17  Failed to create session: <html>
                                                                   <head><title>502 Bad Gateway</title></head>
                                                                   <body>
                                                                   <center><h1>502 Bad Gateway</h1></center>
                                                                   </body>
                                                                   </html>
Terminus2__Kimi-k2.5            DaytonaError                    9  Failed to create session: unauthorized: failed to get auth URL: failed to parse request host: invalid host format: port and sandbox ID not found
MAYA__Claude-4.5-sonnet         VerifierTimeoutError            6  Verifier execution timed out after 360.0 seconds
Droid__Claude-Opus-4.6          RewardFileNotFoundError         4  No reward file found at jobs/fudge_2026-01-29__07-55-11/break-filter-js-from-html__3G5YVbb/verifier/reward.txt or jobs/fudge_2026-01-29__07-55-11/break-filter-js-from-html__3G5YVbb/verifier/reward.json
OpenSage__Gemini-3-Pro-Preview  VerifierTimeoutError            4  Verifier execution timed out after 900.0 seconds
Deep-Agents__GPT-5.2-Codex      DaytonaError                    3  Failed to create session: unauthorized: failed to get auth URL: failed to parse request host: invalid host format: port and sandbox ID not found
Judy__Claude-Opus-4.6           VerifierTimeoutError            3  Verifier execution timed out after 900.0 seconds
MAYA__Claude-4.5-sonnet         EnvironmentStartTimeoutError    3  Environment start timed out after 600.0 seconds
Terminus-KIRA__Claude-Opus-4.6  EnvironmentStartTimeoutError    3  Environment start timed out after 600.0 seconds
Terminus2__GLM-5                AddTestsDirError                3  Failed to add tests directory to environment.
dakou__qwen3-coder-480b         DownloadVerifierDirError        3  Failed to download verifier directory from environment
ClaudeCode__GLM-4.7             VerifierTimeoutError            2  Verifier execution timed out after 1800.0 seconds
Crux__Claude-Opus-4.6           RewardFileNotFoundError         2  No reward file found at jobs/fa20096c/break-filter-js-from-html__aie7YwZ/verifier/reward.txt or jobs/fa20096c/break-filter-js-from-html__aie7YwZ/verifier/reward.json
Simple-Codex__GPT-5.3-Codex     DaytonaError                    2  Failed to get session command: bad request: no IP address found. Is the Sandbox started?
CodeBrain-1__GPT-5.3-Codex      VerifierTimeoutError            1  Verifier execution timed out after 900.0 seconds
Mux__GPT-5.2                    VerifierTimeoutError            1  Verifier execution timed out after 900.0 seconds
Terminus2__DeepSeek-V3.2        RewardFileNotFoundError         1  No reward file found at jobs/2026-02-07__07-47-43/break-filter-js-from-html__h4vtei9/verifier/reward.txt or jobs/2026-02-07__07-47-43/break-filter-js-from-html__h4vtei9/verifier/reward.json
Terminus2__GLM-4.7              DownloadVerifierDirError        1  Failed to download verifier directory from environment
Terminus2__Kimi-k2.5            AddTestsDirError                1  Failed to add tests directory to environment.
Terminus2__Minimax-m2.5         RewardFileNotFoundError         1  No reward file found at jobs/Terminus2__Minimax-m2.5/break-filter-js-from-html__QQ3KKz4/verifier/reward.txt or jobs/Terminus2__Minimax-m2.5/break-filter-js-from-html__QQ3KKz4/verifier/reward.json
Terminus2__Minimax-m2.5         VerifierTimeoutError            1  Verifier execution timed out after 1800.0 seconds

Terminus2 with GLM-4.7 hit 17 DaytonaError (502 Bad Gateway) — pure infrastructure failures. The break-filter-js-from-html task causes RewardFileNotFoundError across multiple submissions, suggesting a verifier bug for that specific task. VerifierTimeoutError hits several submissions, meaning the task completed but the grading process itself timed out.

All 89 Tasks by Failure Rate

from huggingface_hub import snapshot_download
local_dir = snapshot_download(
repo_id="harborframework/terminal-bench-2-leaderboard",
repo_type="dataset",
allow_patterns=["**/result.json"],
local_dir="./tb2-results"
)
import json
import pathlib
import sqlite_utils
ROOT = pathlib.Path("tb2-results/submissions")
DB_PATH = "tb2.db"
db = sqlite_utils.Database(DB_PATH, recreate=True)
submissions = {} # submission slug -> row
runs = {} # (submission, run_id) -> row
tasks = {} # task_name -> row
trials = []
for path in ROOT.rglob("result.json"):
data = json.loads(path.read_text())
if "task_name" not in data:
continue
# Extract submission and run_id from path:
# .../submissions/terminal-bench/2.0/{submission}/{run_id}/{trial}/result.json
parts = path.relative_to(ROOT).parts
submission_slug = parts[2] # e.g. "Ante__Gemini-3-Pro-Preview"
run_date = parts[3] # e.g. "2025-12-31__22-36-36"
# --- submissions ---
if submission_slug not in submissions:
config_agent = (data.get("config") or {}).get("agent") or {}
agent_info = data.get("agent_info") or {}
model_info = agent_info.get("model_info") or {}
submissions[submission_slug] = {
"slug": submission_slug,
"agent_name": agent_info.get("name"),
"agent_version": agent_info.get("version"),
"agent_import_path": config_agent.get("import_path"),
"model_name": config_agent.get("model_name"),
"model_provider": model_info.get("provider") if model_info else None,
}
# --- runs ---
run_id = f"{submission_slug}-{run_date}"
if run_id not in runs:
config = data.get("config") or {}
runs[run_id] = {
"id": run_id,
"submission": submission_slug,
"run_date": run_date,
"job_id": config.get("job_id"),
"timeout_multiplier": config.get("timeout_multiplier"),
"environment_type": (config.get("environment") or {}).get("type"),
}
# --- tasks ---
task_name = data["task_name"]
if task_name not in tasks:
task_id = data.get("task_id") or {}
tasks[task_name] = {
"task_name": task_name,
"source": data.get("source"),
"git_url": task_id.get("git_url"),
"git_commit_id": task_id.get("git_commit_id"),
"task_checksum": data.get("task_checksum"),
}
# --- trial ---
agent_result = data.get("agent_result") or {}
exception_info = data.get("exception_info") or {}
verifier_result = data.get("verifier_result") or {}
rewards = verifier_result.get("rewards") or {}
env_setup = data.get("environment_setup") or {}
agent_setup = data.get("agent_setup") or {}
agent_exec = data.get("agent_execution") or {}
verifier = data.get("verifier") or {}
# Determine status: passed / failed / errored
reward = rewards.get("reward")
has_exception = bool(exception_info.get("exception_type"))
if reward == 1.0:
status = "passed"
elif has_exception:
status = "errored"
else:
status = "failed"
trials.append({
"id": data["id"],
"run_id": run_id,
"submission": submission_slug,
"task_name": task_name,
"trial_name": data["trial_name"],
# outcome
"status": status,
"reward": reward,
# agent usage
"n_input_tokens": agent_result.get("n_input_tokens"),
"n_output_tokens": agent_result.get("n_output_tokens"),
"n_cache_tokens": agent_result.get("n_cache_tokens"),
"cost_usd": agent_result.get("cost_usd"),
# exception
"exception_type": exception_info.get("exception_type"),
"exception_message": exception_info.get("exception_message"),
"exception_occurred_at": exception_info.get("occurred_at"),
# timestamps
"started_at": data.get("started_at"),
"finished_at": data.get("finished_at"),
# phase timestamps
"env_setup_started_at": env_setup.get("started_at"),
"env_setup_finished_at": env_setup.get("finished_at"),
"agent_setup_started_at": agent_setup.get("started_at"),
"agent_setup_finished_at": agent_setup.get("finished_at"),
"agent_exec_started_at": agent_exec.get("started_at"),
"agent_exec_finished_at": agent_exec.get("finished_at"),
"verifier_started_at": verifier.get("started_at"),
"verifier_finished_at": verifier.get("finished_at"),
# source file
"file_path": str(path),
})
# --- Write tables ---
db["submissions"].insert_all(submissions.values(), pk="slug")
print(f" submissions: {len(submissions)}")
db["runs"].insert_all(runs.values(), pk="id")
db["runs"].add_foreign_key("submission", "submissions", "slug", ignore=True)
print(f" runs: {len(runs)}")
db["tasks"].insert_all(tasks.values(), pk="task_name")
print(f" tasks: {len(tasks)}")
db["trials"].insert_all(trials, pk="id")
db["trials"].add_foreign_key("run_id", "runs", "id", ignore=True)
db["trials"].add_foreign_key("submission", "submissions", "slug", ignore=True)
db["trials"].add_foreign_key("task_name", "tasks", "task_name", ignore=True)
print(f" trials: {len(trials)}")
# --- Indexes ---
for col in ["submission", "task_name", "status", "reward", "run_id", "exception_type"]:
db["trials"].create_index([col], if_not_exists=True)
db["trials"].create_index(["submission", "task_name"], if_not_exists=True)
# --- Views ---
db.execute("DROP VIEW IF EXISTS task_stats")
db.execute("""
CREATE VIEW task_stats AS
SELECT
task_name,
count(*) AS n_trials,
sum(CASE WHEN status = 'passed' THEN 1 ELSE 0 END) AS n_passed,
sum(CASE WHEN status = 'failed' THEN 1 ELSE 0 END) AS n_failed,
sum(CASE WHEN status = 'errored' THEN 1 ELSE 0 END) AS n_errored,
round(avg(reward), 4) AS avg_reward,
round(1.0 - avg(reward), 4) AS failure_rate
FROM trials
GROUP BY task_name
""")
db.execute("DROP VIEW IF EXISTS submission_stats")
db.execute("""
CREATE VIEW submission_stats AS
SELECT
submission,
count(*) AS n_trials,
sum(CASE WHEN status = 'passed' THEN 1 ELSE 0 END) AS n_passed,
sum(CASE WHEN status = 'failed' THEN 1 ELSE 0 END) AS n_failed,
sum(CASE WHEN status = 'errored' THEN 1 ELSE 0 END) AS n_errored,
round(avg(reward), 4) AS avg_reward,
round(sum(cost_usd), 2) AS total_cost_usd,
round(avg(cost_usd), 4) AS avg_cost_usd,
sum(n_input_tokens) AS total_input_tokens,
sum(n_output_tokens) AS total_output_tokens
FROM trials
GROUP BY submission
""")
db.execute("DROP VIEW IF EXISTS submission_task_matrix")
db.execute("""
CREATE VIEW submission_task_matrix AS
SELECT
submission,
task_name,
count(*) AS n_trials,
round(avg(reward), 4) AS avg_reward,
sum(CASE WHEN status = 'errored' THEN 1 ELSE 0 END) AS n_errored,
group_concat(DISTINCT exception_type) AS exception_types
FROM trials
GROUP BY submission, task_name
""")
print(f"\nLoaded into {DB_PATH} with tables: submissions, runs, tasks, trials")
print(f"Views: task_stats, submission_stats, submission_task_matrix")
This file has been truncated, but you can view the full file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment