- Sparsity in transformers: A systematic literature review - ScienceDirect
- Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT | NVIDIA Technical Blog (Updated: June 12, 2023)
- ‘sparse Transformer’ directory
- Large Transformer Model Inference Optimization | Lil'Log (Updated: January 10, 2023)
- The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs
- [2504.17768] The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs
- Sparse Transformer Explained | Papers With Code
- Boost Vision Transformer With GPU-Friendly Sparsity and ...
- Beyond 2:4: Exploring V:N:M Sparsity for Efficient Transformer Inference on GPUs | OpenReview
- Circuits Updates - January 2025
- [2109.12188] Predicting Attention Sparsity in Transformers
- [2503.16672] Accelerating Transformer Inference and Training with 2:4 Activation Sparsity
- [2502.06766] Exploiting Sparsity for Long Context Inference: Million Token Contexts on Commodity GPUs
- [2504.17768] The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs
- The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs
- Accelerating Transformer Pre-Training with 2:4 Sparsity
- [2302.14705] AccelTran: A Sparsity-Aware Accelerator for Dynamic Inference with Transformers
- Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT | NVIDIA Technical Blog (Updated: June 12, 2023)
- [2305.18382] Adaptive Sparsity Level during Training for Efficient Time Series Forecasting with Transformers
- [2111.12763] Sparse is Enough in Scaling Transformers
I'll search for information about the Transformer sparsity benchmark 2025 paper to help with your project.Let me search for more specific information about the 2025 Transformer sparsity benchmark paper.Based on my search results, I found information about a highly relevant paper: "The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs" from April 2025. Here's the key information for your AI coding agent:
Authors: Piotr Nawrot et al. (research conducted at Cohere, piotr.nawrot@ed.ac.uk)
Publication Date: April 24, 2025
-
Comprehensive Benchmark: Performs careful comparison of training-free sparse attention methods at varying model scales, sequence lengths, and sparsity levels on diverse long-sequence tasks
-
Model Used: Uses Qwen 2.5 model family which supports 128k context length and provides multiple model sizes trained with consistent methodology
-
Implementation Details:
- Uses vLLM inference engine with full bf16 precision
- Focuses exclusively on content-aware methods, as fixed patterns consistently underperform
-
isoFLOPS Analysis: For very long sequences, larger and highly sparse models are preferable to smaller and dense ones
-
Sparsity Patterns: Best-performing sparse patterns are Vertical-Slash for prefilling and Quest for decoding
-
Phase-Dependent Performance: Sparsity level attainable while guaranteeing accuracy preservation is higher during decoding than prefilling
-
Trade-offs: Even moderate sparsity levels often result in significant performance degradation on at least one task
- Performance measured using downstream accuracy on long-context benchmarks (QA, RULER, Story)
- Novel scaling laws with log-linear formulation for modeling inference performance
- Modifies only the attention mechanism while preserving original architectures
- Includes RAG performance evaluation following Yue et al. (2025)
- No clear strategy performs best across all tasks and phases
- Sparse attention is key for processing longer sequences but requires careful evaluation of trade-offs
- Controls attention sparsity via compression ratio, directly impacting inference compute and memory requirements
- NVIDIA A100 GPU supports 2:4 sparsity pattern with Sparse Tensor Cores
- Can achieve over 30% performance/watt gain compared to dense networks
This benchmark provides a comprehensive evaluation framework for implementing sparse attention in Transformers, which would be crucial for your AI coding agent project. The paper's focus on practical trade-offs and scaling laws makes it particularly relevant for real-world implementations.