We started with a simple question while training a small ModernBERT-class model on the JSC JUPITER supercomputer:
The pure-torch training pipeline is faster than the Transformer Engine one. Why?
By the time we surfaced, we had measured the GPU power policy, the LPDDR5X read/write asymmetry, the C2C interconnect's behavior, and a per-SKU copy-engine ceiling that NVIDIA does not document. This is the writeup of how we got there.
System under test: NVIDIA GH200 Grace-Hopper Superchip on the JUPITER Booster, comparing against the login nodes (a different GH200 SKU). All measurements are BF16 unless noted. Everything runs on a single GH200 in single-process configuration except where noted.