liangel-02

## cudnn_mismatch.md

      
              3 files
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                liangel-02
                / cudnn_mismatch.md
            
            
              Last active
              October 28, 2025 20:30
            
              
                Varlen API cudnn backward numerical mismatch
              
          
    Relevant PRs

cudnn backward
flash attention backward (this works)

Summary
We are implementing variable length attention with cuDNN backend and the outputs between our API and SDPA with packing doesn’t match after the backward.
In the provided repro, we included the definition of _varlen_attn(), our private custom op that calls into _cudnn_attention_forward(). We also define _backward(), the backward pass that is registered with torch.autograd(). This calls _cudnn_attention_backward().

  
## cuDDN_mismatch.md

      
              3 files
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                liangel-02
                / cuDDN_mismatch.md
            
            
              Last active
              October 7, 2025 20:16
            
              
                Varlen API cuDNN numerical mismatch
              
          
    Summary
We are implementing variable length attention with cuDNN backend and the outputs between our API and SDPA with packing doesn’t match after the forward.
In the provided repro, we included the definition of _varlen_attn(), our private custom op that calls into _cudnn_attention_forward().
Then, in our test:

We first define an AttentionBlock that has two forward methods, one that calls our implementation, and the other calls .scaled_dot_product_attention().
We call create_variable_length_batch() with batch_size = 2, max_seq_len = 128, embed_dim = 32, and num_heads = 4. This creates x_padded for SDPA and x_packed for varlen.
Then, we call the respective forward methods and compare outputs per batch. We expect that the tensors are close within the tolerance that we set.