They basically all suggest that apparent improvements to the state of the art in ML and related fields are often not real, or at least the result of factors other than what the authors claim.
The state of sparsity in deep neural networks
What is the state of neural network pruning?
On the State of the Art of Evaluation in Neural Language Models
Do Transformer Modifications Transfer Across Implementations and Applications?
Are we really making much progress? A worrying analysis of recent neural recommendation approaches
Improvements that don't add up: ad-hoc retrieval results since 1998
On the need for time series data mining benchmarks: a survey and empirical demonstration
On the Importance of Application-Grounded Experimental Design for Evaluating Explainable ML Methods
Stop Oversampling for Class Imbalance Learning: A Critical Review
No True State-of-the-Art? OOD Detection Methods are Inconsistent across Datasets
Compressed Communication for Distributed DeepLearning: Survey and Quantitative Evaluation
Optimizer Benchmarking Needs to Account for Hyperparameter Tuning
On Empirical Comparisons of Optimizers for Deep Learning
Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers
Bag of Tricks for Training Deeper Graph Neural Networks: A Comprehensive Benchmark Study
Tune It or Don't Use It: Benchmarking Data-Efficient Image Classification
// this one actually has a consistent finding: nonnegative ICA works best as measured by loss after a fixed number of training iterations, followed by SVD. Initialization for Nonnegative Matrix Factorization: a Comprehensive Review
What's Wrong with Social Science and How to Fix It: Reflections After Reading 2578 Papers. This one is not about ML, but I'm including for relevance. Especially since it shows that even economists, who probably understand statistical testing better than most deep learning researchers, hit basically the same issues as everyone else.
On Efficient Real-Time Semantic Segmentation: A Survey Semantic segmentation actually is making progress, as measured in a standardized experimental setup on fixed hardware.
Leakage and the Reproducibility Crisis in ML-based Science. See also their website