They basically all suggest that apparent improvements to the state of the art in ML and related fields are often not real, or at least the result of factors other than what the authors claim.
The state of sparsity in deep neural networks
What is the state of neural network pruning?
On the State of the Art of Evaluation in Neural Language Models
Do Transformer Modifications Transfer Across Implementations and Applications?
Are we really making much progress? A worrying analysis of recent neural recommendation approaches
Improvements that don't add up: ad-hoc retrieval results since 1998
On the need for time series data mining benchmarks: a survey and empirical demonstration
On the Importance of Application-Grounded Experimental Design for Evaluating Explainable ML Methods
Stop Oversampling for Class Imbalance Learning: A Critical Review
No True State-of-the-Art? OOD Detection Methods are Inconsistent across Datasets
Compressed Communication for Distributed DeepLearning: Survey and Quantitative Evaluation
Optimizer Benchmarking Needs to Account for Hyperparameter Tuning
On Empirical Comparisons of Optimizers for Deep Learning
Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers
Bag of Tricks for Training Deeper Graph Neural Networks: A Comprehensive Benchmark Study
Tune It or Don't Use It: Benchmarking Data-Efficient Image Classification
// this one actually has a consistent finding: nonnegative ICA works best as measured by loss after a fixed number of training iterations, followed by SVD. Initialization for Nonnegative Matrix Factorization: a Comprehensive Review
What's Wrong with Social Science and How to Fix It: Reflections After Reading 2578 Papers. This one is not about ML, but I'm including for relevance. Especially since it shows that even economists, who probably understand statistical testing better than most deep learning researchers, hit basically the same issues as everyone else.
On Efficient Real-Time Semantic Segmentation: A Survey Semantic segmentation actually is making progress, as measured in a standardized experimental setup on fixed hardware.
Leakage and the Reproducibility Crisis in ML-based Science. See also their website
Nobel and Novice: Author Prominence Affects Peer Review This one studies economists, but I'm dumping it here for relevance. Good summary here.
Fair Comparison between Efficient Attentions (summary) They benchmarked some “efficient” attention variants on ImageNet in a standardized setup. The efficient attention mechanisms have fewer FLOPs, but are also less accurate.
Do Current Multi-Task Optimization Methods in Deep Learning Even Help? Compared to just using a tuned sum of task-specific losses, the answer is no.
Beware of the Simulated DAG! Causal Discovery Benchmarks May Be Easy to Game
Benchmarking Interpretability Tools for Deep Neural Networks
A Quantitative Review on Language Model Efficiency Research
Unprocessing Seven Years of Algorithmic Fairness
This isn't a meta-analysis and I can't get access to the paper, but this was too discouraging to omit: "Authors should: (1) not pick an important problem, (2) not challenge existing beliefs, (3) not obtain surprising results, (4) not use simple methods, (5) not provide full disclosure, and (6) not write clearly."
No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models
In the same vein, we recently found that recent SOTA results in causal structure learning appear to be due to patterns in the benchmarks which can be exploited in far simpler ways (Beware of the Simulated DAG! Causal Discovery Benchmarks May Be Easy to Game).