Skip to content

Instantly share code, notes, and snippets.

@greynewell
Created February 17, 2026 05:14
Show Gist options
  • Select an option

  • Save greynewell/497005bb33641503f1a5874f16578088 to your computer and use it in GitHub Desktop.

Select an option

Save greynewell/497005bb33641503f1a5874f16578088 to your computer and use it in GitHub Desktop.
SWE-bench-fast: ARM vs x86 eval benchmark data (Feb 2026)
{
"benchmark": "end-to-end-eval",
"date": "2026-02-17",
"instances": 11,
"repos": 11,
"custom_arm64": {
"total_ms": 87281,
"resolved": 9,
"resolved_pct": 81.81818181818183
},
"epoch_x86_64": {
"total_ms": 551651,
"resolved": 10,
"resolved_pct": 90.9090909090909
},
"overall_speedup": 6.32,
"per_instance": [
{
"instance_id": "astropy__astropy-12907",
"custom_ms": 2653,
"epoch_ms": 9725,
"speedup": 3.67,
"custom_resolved": "RESOLVED_FULL",
"epoch_resolved": "RESOLVED_FULL",
"match": true
},
{
"instance_id": "django__django-13346",
"custom_ms": 2690,
"epoch_ms": 18886,
"speedup": 7.02,
"custom_resolved": "RESOLVED_FULL",
"epoch_resolved": "RESOLVED_FULL",
"match": true
},
{
"instance_id": "matplotlib__matplotlib-14623",
"custom_ms": 38018,
"epoch_ms": 265704,
"speedup": 6.99,
"custom_resolved": "RESOLVED_FULL",
"epoch_resolved": "RESOLVED_FULL",
"match": true
},
{
"instance_id": "mwaskom__seaborn-3069",
"custom_ms": 15372,
"epoch_ms": 101033,
"speedup": 6.57,
"custom_resolved": "RESOLVED_FULL",
"epoch_resolved": "RESOLVED_FULL",
"match": true
},
{
"instance_id": "pallets__flask-5014",
"custom_ms": 988,
"epoch_ms": 3875,
"speedup": 3.92,
"custom_resolved": "RESOLVED_FULL",
"epoch_resolved": "RESOLVED_FULL",
"match": true
},
{
"instance_id": "psf__requests-1142",
"custom_ms": 1120,
"epoch_ms": 4820,
"speedup": 4.3,
"custom_resolved": "RESOLVED_FULL",
"epoch_resolved": "RESOLVED_FULL",
"match": true
},
{
"instance_id": "pylint-dev__pylint-7277",
"custom_ms": 14022,
"epoch_ms": 76000,
"speedup": 5.42,
"custom_resolved": "RESOLVED_NO",
"epoch_resolved": "RESOLVED_NO",
"match": true
},
{
"instance_id": "pytest-dev__pytest-6197",
"custom_ms": 4655,
"epoch_ms": 28219,
"speedup": 6.06,
"custom_resolved": "RESOLVED_FULL",
"epoch_resolved": "RESOLVED_FULL",
"match": true
},
{
"instance_id": "scikit-learn__scikit-learn-25102",
"custom_ms": 2748,
"epoch_ms": 18209,
"speedup": 6.63,
"custom_resolved": "RESOLVED_FULL",
"epoch_resolved": "RESOLVED_FULL",
"match": true
},
{
"instance_id": "sphinx-doc__sphinx-10323",
"custom_ms": 3091,
"epoch_ms": 17154,
"speedup": 5.55,
"custom_resolved": "RESOLVED_NO",
"epoch_resolved": "RESOLVED_FULL",
"match": false
},
{
"instance_id": "sympy__sympy-11618",
"custom_ms": 1924,
"epoch_ms": 8026,
"speedup": 4.17,
"custom_resolved": "RESOLVED_FULL",
"epoch_resolved": "RESOLVED_FULL",
"match": true
}
]
}
{
"date": "2026-02-16T23:35:00-05:00",
"host": {
"model": "MacBook Pro (Mac15,7)",
"model_number": "MRW23LL/A",
"chip": "Apple M3 Pro",
"cores_total": 12,
"cores_performance": 6,
"cores_efficiency": 6,
"memory_gb": 36,
"disk_total_gb": 460,
"disk_filesystem": "APFS",
"os": "macOS 14.5 (23F79)",
"kernel": "Darwin 23.5.0 xnu-10063.121.3~5/RELEASE_ARM64_T6030",
"arch": "arm64"
},
"docker": {
"engine": "Colima",
"virtualization": "macOS Virtualization.Framework",
"mount_type": "virtiofs",
"vm_cpus": 10,
"vm_memory_gb": 28,
"vm_disk_gb": 150,
"vm_arch": "aarch64",
"docker_version": "29.2.1",
"docker_api": "1.53",
"server_version": "29.2.0",
"server_os_arch": "linux/arm64",
"buildx_version": "v0.31.1",
"containerd_version": "v2.2.1",
"runc_version": "v1.3.4",
"storage_driver": "overlayfs",
"cgroup_version": 2
}
}

SWE-bench-fast: Early Benchmark Notes

Date: February 16, 2026 Author: Grey Newell

The Problem

Running SWE-bench evaluations takes days on a single machine. The original goal was simple: cut that to hours.

What We Found

The biggest bottleneck wasn't the test harness — it was architecture emulation. Epoch AI's pre-built Docker images are x86_64-only. On an Apple Silicon Mac (or any ARM host), every test runs through QEMU user-space emulation. That's slow. Really slow.

So we tried building native arm64 containers instead.

The Approach

SWE-bench-fast is a from-scratch Go reimplementation of the SWE-bench eval harness. It builds and runs Docker containers, applies patches, executes test suites, and grades results — same as the upstream Python harness, but as a single static binary with native ARM image support.

Some repos can't go fully native — pinned dependencies or C libraries force x86_64 emulation regardless. Filtering the dataset to arm64-compatible repos would push the speedup even higher.

The upstream swebench harness (pip install swebench) can't build images on ARM at all. Its pipeline hardcodes x86_64 Miniconda URLs, so build_instance_images() fails immediately. The Epoch images at ghcr.io/epoch-research/ are the pre-built output of that pipeline — they pull fine on ARM, they just run under emulation.

Results (11 instances, one per repo)

Hardware: M3 Pro (12 cores, 36 GB) · Colima VM: 10 CPUs, 28 GB RAM

Custom arm64 Epoch x86_64 (emulated)
Total eval time 87.3s 551.7s
Resolved 9/11 (81.8%) 10/11 (90.9%)
Speedup 6.3x baseline
  • Compute-heavy repos (scikit-learn, matplotlib) saw the biggest gains — up to 7x
  • Even lightweight repos (flask, requests) ran 4x faster
  • Container images were ~20% smaller with native arm64 builds
  • Resolution rates were similar; the one discrepancy (sphinx) was a package version issue, not a harness bug

What This Means

On ARM hardware (Apple Silicon, AWS Graviton, Mac mini fleets): native images eliminate the emulation tax entirely. A 500-instance SWE-bench Verified run that takes ~14 hours under emulation could finish in ~2-3 hours.

On x86_64 hardware: the speedup would be modest — both engines run natively. The value is operational: smaller images, a 3-layer build cache, a push-and-delete workflow for disk management, and no Python environment to wrangle.

The many-multiples speedup is an architecture story, not a code optimization story. These pre-built ARM images are most useful for anyone running SWE-bench on ARM infrastructure.

Pre-built Images

An overnight job is building all SWE-bench Verified containers for arm64 and pushing to DockerHub: https://hub.docker.com/repository/docker/greynewell/swe-bench-fast/general

What's Next

  • Full 500-instance run with timing data
  • x86_64 host comparison to isolate architecture effects
  • Filter dataset to arm64-native-only subset for maximum speedup characterization

Raw benchmark data attached below.

{
"reports": [
{
"instance_id": "astropy__astropy-12907",
"resolved": "RESOLVED_FULL",
"f2p_total": 2,
"f2p_passed": 2,
"p2p_total": 13,
"p2p_passed": 13,
"patch_applied": true,
"duration_ms": 9725
},
{
"instance_id": "django__django-13346",
"resolved": "RESOLVED_FULL",
"f2p_total": 2,
"f2p_passed": 2,
"p2p_total": 65,
"p2p_passed": 65,
"patch_applied": true,
"duration_ms": 18886
},
{
"instance_id": "matplotlib__matplotlib-14623",
"resolved": "RESOLVED_FULL",
"f2p_total": 1,
"f2p_passed": 1,
"p2p_total": 400,
"p2p_passed": 400,
"patch_applied": true,
"duration_ms": 265704
},
{
"instance_id": "mwaskom__seaborn-3069",
"resolved": "RESOLVED_FULL",
"f2p_total": 2,
"f2p_passed": 2,
"p2p_total": 94,
"p2p_passed": 94,
"patch_applied": true,
"duration_ms": 101033
},
{
"instance_id": "pallets__flask-5014",
"resolved": "RESOLVED_FULL",
"f2p_total": 1,
"f2p_passed": 1,
"p2p_total": 59,
"p2p_passed": 59,
"patch_applied": true,
"duration_ms": 3875
},
{
"instance_id": "psf__requests-1142",
"resolved": "RESOLVED_FULL",
"f2p_total": 1,
"f2p_passed": 1,
"p2p_total": 5,
"p2p_passed": 5,
"patch_applied": true,
"duration_ms": 4820
},
{
"instance_id": "pylint-dev__pylint-7277",
"resolved": "RESOLVED_NO",
"f2p_total": 1,
"f2p_passed": 1,
"p2p_total": 122,
"p2p_passed": 121,
"patch_applied": true,
"duration_ms": 76000
},
{
"instance_id": "pytest-dev__pytest-6197",
"resolved": "RESOLVED_FULL",
"f2p_total": 2,
"f2p_passed": 2,
"p2p_total": 145,
"p2p_passed": 145,
"patch_applied": true,
"duration_ms": 28219
},
{
"instance_id": "scikit-learn__scikit-learn-25102",
"resolved": "RESOLVED_FULL",
"f2p_total": 2,
"f2p_passed": 2,
"p2p_total": 59,
"p2p_passed": 59,
"patch_applied": true,
"duration_ms": 18209
},
{
"instance_id": "sphinx-doc__sphinx-10323",
"resolved": "RESOLVED_FULL",
"f2p_total": 1,
"f2p_passed": 1,
"p2p_total": 40,
"p2p_passed": 40,
"patch_applied": true,
"duration_ms": 17154
},
{
"instance_id": "sympy__sympy-11618",
"resolved": "RESOLVED_FULL",
"f2p_total": 1,
"f2p_passed": 1,
"p2p_total": 4,
"p2p_passed": 4,
"patch_applied": true,
"duration_ms": 8026
}
],
"summary": {
"total": 11,
"resolved": 10,
"partial": 0,
"unresolved": 1,
"errors": 0,
"resolved_pct": 90.9090909090909,
"total_time_ms": 551651
}
}
@greynewell
Copy link
Copy Markdown
Author

greynewell commented Mar 5, 2026

Image size comparison (ARM64 native vs Epoch x86_64)

Built all 11 benchmarked instances as native ARM64 images and compared against the Epoch x86_64 images.

Machine: MacBook Pro M3 Pro, Colima VM (10 CPUs, 28 GB RAM, linux/arm64), Docker 29.2.1, overlayfs.

On-disk size (docker images)

Instance ARM64 native x86 Epoch Difference
astropy__astropy-12907 3.41 GB 3.20 GB +6.6%
django__django-13346 3.34 GB 3.44 GB -2.9%
matplotlib__matplotlib-14623 5.95 GB 6.03 GB -1.3%
mwaskom__seaborn-3069 3.98 GB 3.30 GB +20.6%
pallets__flask-5014 3.30 GB 2.97 GB +11.1%
psf__requests-1142 3.11 GB 2.67 GB +16.5%
pylint-dev__pylint-7277 3.28 GB 2.89 GB +13.5%
pytest-dev__pytest-6197 3.11 GB 2.71 GB +14.8%
scikit-learn__scikit-learn-25102 4.20 GB 5.96 GB -29.5%
sphinx-doc__sphinx-10323 3.36 GB 3.00 GB +12.0%
sympy__sympy-11618 3.20 GB 3.10 GB +3.2%

Summary

On-disk sizes are mixed. scikit-learn is 29.5% smaller on ARM64, django 2.9% smaller. Most others are 3-20% larger due to differences in base image layers. By compressed content size, ARM64 images average about 4% smaller.

@greynewell
Copy link
Copy Markdown
Author

greynewell commented Mar 5, 2026

Pre-built ARM64 images are being pushed to Docker Hub: https://hub.docker.com/repository/docker/greynewell/swe-bench-fast/general

x86-only instances (496 of 2,294) use the Epoch x86 images from ghcr.io/epoch-research.

@greynewell
Copy link
Copy Markdown
Author

greynewell commented Mar 5, 2026

Sphinx result mismatch resolved (11/11 now match)

The original benchmark showed sphinx-doc__sphinx-10323 as RESOLVED_NO on the custom ARM64 images while the Epoch x86 images got RESOLVED_FULL. Root cause: Pygments==2.19.2 was being pulled in on ARM64 (unpinned transitive dependency), while the Epoch x86 images had Pygments==2.18.0. Pygments 2.19 changed the HTML output for syntax-highlighted line number spans, breaking the test_literal_include_linenos and test_linenothreshold pass-to-pass assertions.

Fix: pin Pygments==2.18.0 in all sphinx version specs to match the Epoch images. The fail-to-pass test (test_LiteralIncludeReader_dedent_and_append_and_prepend) passed on both Pygments versions. Only the p2p tests broke.

Updated result: 11/11 instances now match between the custom ARM64 and Epoch x86 harnesses.

Commit: greynewell/swe-bench-fast@615239f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment