Ali Naeimi alint77

## gh200_investigation.md

      
        
          
            
              
              1 file
            
          
          
            
              
              0 forks
            
          
            
              
                
                0 comments
              
            
          
            
              
              0 stars
            
          
        
        
          
              
          
          
            
                alint77
                / gh200_investigation.md
            
            
              Last active
              May 22, 2026 10:43
            
              
                A weekend down the JUPITER GH200 rabbit hole — GEMM, power, and the mysterious D2H ceiling
              
          
        
      
        

      
      
    A weekend down the JUPITER GH200 rabbit hole

We started with a simple question while training a small ModernBERT-class model on the JSC JUPITER supercomputer:

The pure-torch training pipeline is faster than the Transformer Engine one. Why?

By the time we surfaced, we had measured the GPU power policy, the LPDDR5X read/write asymmetry, the C2C interconnect's behavior, and a per-SKU copy-engine ceiling that NVIDIA does not document. This is the writeup of how we got there.
System under test: NVIDIA GH200 Grace-Hopper Superchip on the JUPITER Booster, comparing against the login nodes (a different GH200 SKU). All measurements are BF16 unless noted. Everything runs on a single GH200 in single-process configuration except where noted.