Skip to content

Instantly share code, notes, and snippets.

@josusanmartin
Created March 7, 2026 15:34
Show Gist options
  • Select an option

  • Save josusanmartin/5cc5f696fe827cf5ba3e1da4355e1565 to your computer and use it in GitHub Desktop.

Select an option

Save josusanmartin/5cc5f696fe827cf5ba3e1da4355e1565 to your computer and use it in GitHub Desktop.
GPU MODE Home Challenge Evaluator Audit

Date audited: March 7, 2026

Purpose

This document is a conservative audit of the current GPU MODE home challenges.

It is written to avoid overclaiming. In particular, it separates:

  • Verified leaderboard exploit
    • a live leaderboard submission using the evaluator-bypass pattern was accepted
  • Verified exploit path (test-only)
    • a live test submission using the evaluator-bypass pattern was accepted, but no leaderboard-mode confirmation is claimed here
  • Likely exploitable
    • not live-confirmed in this audit, but the public evaluator source shows the same vulnerable in-process trust pattern
  • Strong evidence of prior exploitation
    • suspicious public leaderboard results that are consistent with the confirmed exploit path, but not directly proven to be exploits by this audit

What Was Verified Directly

The following claims are backed by one or both of:

  1. live public leaderboard API results from https://www.gpumode.com/api/leaderboard/<id>
  2. authenticated popcorn-cli submissions show <id> records for the submission IDs listed below

Core Vulnerable Pattern

The evaluator families below share the same trust-boundary problem:

  1. trusted evaluator imports mutable live objects from reference.py and sometimes utils.py
  2. trusted evaluator later imports untrusted submission.py into the same Python interpreter
  3. the submission can mutate trusted function objects
  4. the evaluator continues to trust those mutated objects for correctness and/or benchmark setup

Representative public source patterns:

  • problems/amd_202602/eval.py
  • problems/pmpp_v2/eval.py
  • problems/nvidia/eval.py
  • problems/amd/eval.py
  • problems/amd_distributed/eval.py
  • problems/helion/eval.py
  • problems/bioml/trimul/eval.py

The important evaluator shape is:

  • import check_implementation and generate_input from reference
  • import custom_kernel from submission in the worker process
  • call check_implementation(...) and generate_input(...) after the submission has had a chance to mutate process state

Examples in the public repo:

  • problems/amd_202602/eval.py:20
  • problems/amd_202602/eval.py:208
  • problems/amd_202602/eval.py:244
  • problems/pmpp_v2/eval.py:20
  • problems/pmpp_v2/eval.py:194
  • problems/pmpp_v2/eval.py:232
  • problems/nvidia/eval.py:23
  • problems/nvidia/eval.py:208
  • problems/nvidia/eval.py:250

Verified Leaderboard Exploits

These are the strongest claims in this report.

ID Challenge GPU Family Our submission Verified result
765 amd-mixed-mla MI355X amd_202602 #514036 leaderboard-mode exploit accepted
764 amd-moe-mxfp4 MI355X amd_202602 #514407 leaderboard-mode exploit accepted
763 amd-mxfp4-mm MI355X amd_202602 #514446 leaderboard-mode exploit accepted
540 matmul_v2 A100 pmpp_v2 #514443 leaderboard-mode exploit accepted
538 grayscale_v2 A100 pmpp_v2 #514436 leaderboard-mode exploit accepted
543 vectoradd_v2 A100 pmpp_v2 leaderboard-mode exploit accepted
544 vectorsum_v2 A100 pmpp_v2 leaderboard-mode exploit accepted
542 sort_v2 A100 pmpp_v2 leaderboard-mode exploit accepted
541 prefixsum_v2 A100 pmpp_v2 leaderboard-mode exploit accepted
539 histogram_v2 A100 pmpp_v2 leaderboard-mode exploit accepted
537 conv2d_v2 A100 pmpp_v2 leaderboard-mode exploit accepted

Public leaderboard context for the verified cases

765 amd-mixed-mla

  • our exploit-backed public score: 4.359759 µs
  • current public #1: ooousay at 4.331021 µs
  • best clearly non-trivial public score visible right now: ry2009 at 81.806836 µs

Interpretation:

  • the exploit path is verified
  • the current ooousay result is strongly suspicious, but this audit does not prove ownership or method

764 amd-moe-mxfp4

  • our exploit-backed public score: 4.546129 µs
  • current public #1 / previous leader class: ooousay at 4.715652 µs
  • best clearly non-trivial public score visible right now: ry2009 at 155.513357 µs

Interpretation:

  • the exploit path is verified
  • the current ooousay result is strongly suspicious, but not directly proven here

763 amd-mxfp4-mm

  • our exploit-backed public score: 4.557848 µs
  • current public #1: ooousay at 4.341114 µs
  • best clearly non-trivial public score visible right now: _radna at 13.191042 µs

Interpretation:

  • the exploit path is verified
  • this report should not claim we reached rank 1 on this board

540 matmul_v2

Verified on A100 only in this audit.

  • our exploit-backed public score on A100: 14.653440 µs
  • current next-best public A100 score: 683.008015 µs

Additional public signals:

  • B200, H100, and L4 currently show shiyegao entries at 1e-9 seconds (0.001 µs)

Interpretation:

  • the exploit path is verified on A100
  • the 1 ns entries on other GPUs are near-certain invalid / exploit-like results, but this audit did not submit to those GPU targets

538 grayscale_v2

Verified on A100 only in this audit.

  • our exploit-backed public score on A100: 13.506560 µs
  • current next-best public A100 score: 2374.911964 µs

Interpretation:

  • the exploit path is verified on A100

543 vectoradd_v2

  • our exploit-backed public score on A100: 11.5 µs
  • current next-best public A100 score: 893 µs

544 vectorsum_v2

  • our exploit-backed public score on A100: 14.9 µs
  • current next-best public A100 score: 44.2 µs (legitimate)

542 sort_v2

  • our exploit-backed public score on A100: 11.3 µs
  • current next-best public A100 score: 1903.5 µs (legitimate)

541 prefixsum_v2

  • our exploit-backed public score on A100: 11.8 µs
  • current next-best public A100 score: 483.7 µs (legitimate)

539 histogram_v2

  • our exploit-backed public score on A100: 8.77 µs
  • current next-best public A100 score: 18.1 µs (legitimate)

537 conv2d_v2

  • our exploit-backed public score on A100: 11.5 µs
  • current next-best public A100 score: 39927 µs (legitimate)

Likely Exploitable From Public Source Review

These were not live-confirmed during this audit. The claim is limited to:

  • they use the same in-process evaluator pattern in the public repo
  • therefore they are plausible candidates for the same issue

NVIDIA family

Shared evaluator family:

  • problems/nvidia/eval.py
  • problems/nvidia/nvfp4_group_gemm/eval.py

Likely affected challenges:

  • 730 nvfp4_group_gemm
  • 697 modal_nvfp4_dual_gemm
  • 598 nvfp4_dual_gemm
  • 597 nvfp4_gemm
  • 595 nvfp4_gemv

AMD distributed family

Shared evaluator:

  • problems/amd_distributed/eval.py

Likely affected challenges:

  • 565 amd-ag-gemm
  • 564 amd-gemm-rs
  • 563 amd-all2all

AMD 2025 family

Shared evaluators:

  • problems/amd/eval.py
  • problems/amd/mla-decode/eval.py

Likely affected challenges:

  • 463 amd-mla-decode
  • 430 amd-mixture-of-experts
  • 399 amd-fp8-mm
  • 398 amd-identity

PMPP v2 family

All 8 challenges in this family are now verified exploitable (see above).

PMPP family

Shared evaluator:

  • problems/pmpp/eval.py

Likely affected:

  • 346 vectorsum
  • 345 vectoradd
  • 344 sort
  • 343 prefixsum
  • 342 matmul
  • 341 histogram
  • 340 grayscale
  • 339 conv2d

Helion family

Shared evaluator:

  • problems/helion/eval.py

Likely affected if still active / reachable from home-family infrastructure:

  • Helion challenge set using that evaluator pattern

Bioml family

Shared evaluator:

  • problems/bioml/trimul/eval.py

Likely affected:

  • 496 trimul

Strong Evidence Of Prior Exploitation

These are not direct proofs of third-party exploitation, but they are strong signals.

Board Signal Why it is suspicious
540 matmul_v2 on B200/H100/L4 shiyegao at 0.001 µs not physically plausible for the stated matrix sizes
541 prefixsum_v2 on H100 mreso at 1e-9 s via poc_prefixsum_v2.py filename explicitly labels this as a proof-of-concept exploit
765 amd-mixed-mla on MI355X ooousay at 4.331021 µs same ~4 µs timing regime as the confirmed exploit path
764 amd-moe-mxfp4 on MI355X ooousay at 4.715652 µs same ~4-5 µs timing regime as the confirmed exploit path
763 amd-mxfp4-mm on MI355X ooousay at 4.341114 µs same ~4-5 µs timing regime as the confirmed exploit path

Detailed analysis: shiyegao on matmul_v2

Profile:

  • username: shiyegao
  • submitted file: node.py (not submission.py — the local filename before upload)
  • submission IDs: #107769 (B200), #107767 (H100), #107752 (L4)
  • submission date: 2025-11-27 (all three within ~13 minutes)
  • total submissions: 12 (B200), 2 (H100), 10 (L4)
  • score: 1e-9 seconds (0.001 µs) on all three GPU types

Cross-leaderboard behavior:

shiyegao also appears on the trimul leaderboard (ID 496) with legitimate scores:

  • B200: 554.395 µs (rank 1, submission #480316)
  • H100: 1074.021 µs (rank 1, submission #450489)
  • A100: 2272.534 µs (rank 2, submission #483089)

These trimul scores are plausible and competitive. This indicates shiyegao is a capable kernel developer who chose to exploit matmul_v2 specifically.

How 0.001 µs is achieved:

Our exploit technique produces ~10-15 µs scores — the floor of CUDA event recording + Python function call overhead. shiyegao's 0.001 µs is 10,000x lower, which means they used a more aggressive exploit variant.

The most likely approach: in addition to patching ref_kernel, generate_input, and clear_l2_cache, they also patched the timing infrastructure itself. Probable technique:

# Patch torch.cuda.Event.elapsed_time to return near-zero
torch.cuda.Event.elapsed_time = lambda self, other: 1e-6  # 1e-6 ms = 1 ns

Since elapsed_time returns milliseconds and the evaluator does duration = elapsed_time(...) * 1e6, a return value of 1e-6 ms gives 1 ns — exactly matching the 1e-9 seconds score.

This is possible because torch.cuda.Event.elapsed_time is a regular Python method that can be monkey-patched from the same process.

Quantitative note on shiyegao at 0.001 µs

The public matmul_v2 task definition includes a benchmark with:

  • m = 4096
  • n = 5120
  • k = 4096

For a dense GEMM, that is approximately:

  • 2 * m * n * k = 171,798,691,840 floating-point operations

If that work were truly completed in 0.001 µs = 1e-9 s, the implied throughput would be:

  • 1.7179869184e20 FLOP/s
  • approximately 171.8 exaFLOP/s

That is not remotely consistent with a single B200, H100, or L4, nor with the expected scale of CUDA event timing and launch overhead.

Even a hypothetical 1.0 µs runtime for that largest benchmark would still imply:

  • 171,798.7 TFLOP/s

which is already wildly above realistic sustained throughput for the hardware involved.

So the 0.001 µs public scores should be treated as effectively impossible under honest evaluation.

Detailed analysis: mreso on prefixsum_v2

Profile:

  • username: mreso
  • submitted file: poc_prefixsum_v2.py (the poc_ prefix explicitly labels this as a proof-of-concept)
  • submission IDs: #512606 and #512605 (H100)
  • score: 1e-9 seconds (0.001 µs) on H100

Cross-leaderboard behavior:

mreso has legitimate competitive submissions across many pmpp_v2 leaderboards:

  • matmul_v2 (540): B200 143 µs, H100 220 µs, L4 2230 µs, A100 749 µs
  • sort_v2 (542): B200 5598 µs (rank 5), H100 6590 µs (rank 7)
  • vectoradd_v2 (543): B200 248 µs, H100 525 µs
  • vectorsum_v2 (544): B200 64 µs, H100 94 µs
  • histogram_v2 (539): B200 1640 µs, H100 1880 µs, L4 2060 µs
  • conv2d_v2 (537): B200 42 ms (rank 2)
  • prefixsum_v2 (541): L4 9070 µs (rank 1, legitimate)

The legitimate submissions use descriptive filenames like submission_sort_v2.py, submission_vectoradd_v2.py, etc. The exploit file is distinctly named poc_prefixsum_v2.py.

Interpretation:

mreso is a capable kernel developer who independently discovered the evaluator vulnerability and submitted a single proof-of-concept exploit to demonstrate it. The poc_ filename convention and the fact that only one leaderboard was targeted suggest this was a security test, not an attempt to game leaderboards.

The 1e-9 score matches shiyegao's technique (Class 2: timing infrastructure patching), confirming that this exploit variant has been independently discovered by at least three parties (our audit, shiyegao, and mreso).

Claims This Report Can Defend

These are safe to send:

  1. the public evaluator architecture is vulnerable in principle
  2. live leaderboard-mode exploitation was verified on 765, 764, 763, 540, 538, 543, 544, 542, 541, 539, 537 (all on A100 except AMD problems on MI355X)
  3. the entire pmpp_v2 family (8 problems) and entire amd_202602 family (3 problems) are fully verified
  4. multiple additional home challenges are likely affected because they share the same evaluator pattern
  5. some public third-party scores are strongly suspicious
  6. the vulnerability has been independently discovered by at least three parties (our audit, shiyegao, mreso)

Claims This Report Should NOT Make

These should be removed or softened:

  1. do not say every home challenge was live-tested
  2. do not say 763 was confirmed rank 1 by our submission
  3. do not call third-party entries “confirmed exploit” unless the team independently validates them
  4. do not mix leaderboard results across GPUs on multi-GPU boards
  5. do not label a score as “legitimate #1” unless that has been separately established

Recommended Disclosure Wording

Suggested wording:

We verified live leaderboard-mode evaluator bypasses on all 3 amd_202602 challenges (765, 764, 763 on MI355X) and all 8 pmpp_v2 challenges (540, 538, 543, 544, 542, 541, 539, 537 on A100) — 11 leaderboards total. We also reviewed the public evaluator source for the remaining home challenge families and found the same in-process trust pattern, so those should be treated as likely affected until disproven. At least two other users have independently exploited the same vulnerability: shiyegao on matmul_v2 (2025-11-27) and mreso on prefixsum_v2 (2026-03-04, explicitly named poc_prefixsum_v2.py). Some additional existing public scores from other users are strongly suspicious, but we are not asserting ownership or method for those entries.

Local Workspace Artifacts

The following folders in this workspace contain local proof artifacts corresponding to the live-confirmed cases:

These are disclosure artifacts, not legitimate optimized solutions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment