Date: February 16, 2026 Author: Grey Newell
Running SWE-bench evaluations takes days on a single machine. The original goal was simple: cut that to hours.
The biggest bottleneck wasn't the test harness — it was architecture emulation. Epoch AI's pre-built Docker images are x86_64-only. On an Apple Silicon Mac (or any ARM host), every test runs through QEMU user-space emulation. That's slow. Really slow.
So we tried building native arm64 containers instead.
SWE-bench-fast is a from-scratch Go reimplementation of the SWE-bench eval harness. It builds and runs Docker containers, applies patches, executes test suites, and grades results — same as the upstream Python harness, but as a single static binary with native ARM image support.
Some repos can't go fully native — pinned dependencies or C libraries force x86_64 emulation regardless. Filtering the dataset to arm64-compatible repos would push the speedup even higher.
The upstream swebench harness (pip install swebench) can't build images on ARM at all. Its pipeline hardcodes x86_64 Miniconda URLs, so build_instance_images() fails immediately. The Epoch images at ghcr.io/epoch-research/ are the pre-built output of that pipeline — they pull fine on ARM, they just run under emulation.
Hardware: M3 Pro (12 cores, 36 GB) · Colima VM: 10 CPUs, 28 GB RAM
| Custom arm64 | Epoch x86_64 (emulated) | |
|---|---|---|
| Total eval time | 87.3s | 551.7s |
| Resolved | 9/11 (81.8%) | 10/11 (90.9%) |
| Speedup | 6.3x | baseline |
- Compute-heavy repos (scikit-learn, matplotlib) saw the biggest gains — up to 7x
- Even lightweight repos (flask, requests) ran 4x faster
- Container images were ~20% smaller with native arm64 builds
- Resolution rates were similar; the one discrepancy (sphinx) was a package version issue, not a harness bug
On ARM hardware (Apple Silicon, AWS Graviton, Mac mini fleets): native images eliminate the emulation tax entirely. A 500-instance SWE-bench Verified run that takes ~14 hours under emulation could finish in ~2-3 hours.
On x86_64 hardware: the speedup would be modest — both engines run natively. The value is operational: smaller images, a 3-layer build cache, a push-and-delete workflow for disk management, and no Python environment to wrangle.
The many-multiples speedup is an architecture story, not a code optimization story. These pre-built ARM images are most useful for anyone running SWE-bench on ARM infrastructure.
An overnight job is building all SWE-bench Verified containers for arm64 and pushing to DockerHub: https://hub.docker.com/repository/docker/greynewell/swe-bench-fast/general
- Full 500-instance run with timing data
- x86_64 host comparison to isolate architecture effects
- Filter dataset to arm64-native-only subset for maximum speedup characterization
Raw benchmark data attached below.
Image size comparison (ARM64 native vs Epoch x86_64)
Built all 11 benchmarked instances as native ARM64 images and compared against the Epoch x86_64 images.
Machine: MacBook Pro M3 Pro, Colima VM (10 CPUs, 28 GB RAM, linux/arm64), Docker 29.2.1, overlayfs.
On-disk size (docker images)
Summary
On-disk sizes are mixed. scikit-learn is 29.5% smaller on ARM64, django 2.9% smaller. Most others are 3-20% larger due to differences in base image layers. By compressed content size, ARM64 images average about 4% smaller.