Skip to content

Instantly share code, notes, and snippets.

View alint77's full-sized avatar

Ali Naeimi alint77

View GitHub Profile
@alint77
alint77 / gh200_investigation.md
Last active May 22, 2026 10:43
A weekend down the JUPITER GH200 rabbit hole — GEMM, power, and the mysterious D2H ceiling

A weekend down the JUPITER GH200 rabbit hole

We started with a simple question while training a small ModernBERT-class model on the JSC JUPITER supercomputer:

The pure-torch training pipeline is faster than the Transformer Engine one. Why?

By the time we surfaced, we had measured the GPU power policy, the LPDDR5X read/write asymmetry, the C2C interconnect's behavior, and a per-SKU copy-engine ceiling that NVIDIA does not document. This is the writeup of how we got there.

System under test: NVIDIA GH200 Grace-Hopper Superchip on the JUPITER Booster, comparing against the login nodes (a different GH200 SKU). All measurements are BF16 unless noted. Everything runs on a single GH200 in single-process configuration except where noted.