Skip to content

Instantly share code, notes, and snippets.

@dblalock
Last active November 28, 2023 14:50
Show Gist options
  • Save dblalock/1cfe87ce53c4a6eff917f5e39a93f8ec to your computer and use it in GitHub Desktop.
Save dblalock/1cfe87ce53c4a6eff917f5e39a93f8ec to your computer and use it in GitHub Desktop.
List of meta-analyses / independent benchmarking of machine learning and data mining papers

They basically all suggest that apparent improvements to the state of the art in ML and related fields are often not real, or at least the result of factors other than what the authors claim.

The state of sparsity in deep neural networks

What is the state of neural network pruning?

On the State of the Art of Evaluation in Neural Language Models

Do Transformer Modifications Transfer Across Implementations and Applications?

Are we really making much progress? A worrying analysis of recent neural recommendation approaches

Improvements that don't add up: ad-hoc retrieval results since 1998

On the need for time series data mining benchmarks: a survey and empirical demonstration

On the Importance of Application-Grounded Experimental Design for Evaluating Explainable ML Methods

Stop Oversampling for Class Imbalance Learning: A Critical Review

Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? An Extensive Empirical Study on Language Tasks

No True State-of-the-Art? OOD Detection Methods are Inconsistent across Datasets

Querying and mining of time series data: experimental comparison of representations and distance measures

Current Time Series Anomaly Detection Benchmarks are Flawed and are Creating the Illusion of Progress

When Do Curricula Work?

Compressed Communication for Distributed DeepLearning: Survey and Quantitative Evaluation

Optimizer Benchmarking Needs to Account for Hyperparameter Tuning

On Empirical Comparisons of Optimizers for Deep Learning

Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers

Bag of Tricks for Training Deeper Graph Neural Networks: A Comprehensive Benchmark Study

Tune It or Don't Use It: Benchmarking Data-Efficient Image Classification

// this one actually has a consistent finding: nonnegative ICA works best as measured by loss after a fixed number of training iterations, followed by SVD. Initialization for Nonnegative Matrix Factorization: a Comprehensive Review

Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? An Extensive Empirical Study on Language Tasks

What's Wrong with Social Science and How to Fix It: Reflections After Reading 2578 Papers. This one is not about ML, but I'm including for relevance. Especially since it shows that even economists, who probably understand statistical testing better than most deep learning researchers, hit basically the same issues as everyone else.

On Efficient Real-Time Semantic Segmentation: A Survey Semantic segmentation actually is making progress, as measured in a standardized experimental setup on fixed hardware.

Leakage and the Reproducibility Crisis in ML-based Science. See also their website

Nobel and Novice: Author Prominence Affects Peer Review This one studies economists, but I'm dumping it here for relevance. Good summary here.

Fair Comparison between Efficient Attentions (summary) They benchmarked some “efficient” attention variants on ImageNet in a standardized setup. The efficient attention mechanisms have fewer FLOPs, but are also less accurate.

Do Current Multi-Task Optimization Methods in Deep Learning Even Help? Compared to just using a tuned sum of task-specific losses, the answer is no.

Beware of the Simulated DAG! Causal Discovery Benchmarks May Be Easy to Game

Are we really making much progress in unsupervised graph outlier detection? Revisiting the problem with new insight and superior method

Benchmarking Interpretability Tools for Deep Neural Networks

A Quantitative Review on Language Model Efficiency Research

Unprocessing Seven Years of Algorithmic Fairness

This isn't a meta-analysis and I can't get access to the paper, but this was too discouraging to omit: "Authors should: (1) not pick an important problem, (2) not challenge existing beliefs, (3) not obtain surprising results, (4) not use simple methods, (5) not provide full disclosure, and (6) not write clearly."

No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models

@Scriddie
Copy link

Scriddie commented Oct 10, 2022

In the same vein, we recently found that recent SOTA results in causal structure learning appear to be due to patterns in the benchmarks which can be exploited in far simpler ways (Beware of the Simulated DAG! Causal Discovery Benchmarks May Be Easy to Game).

@dblalock
Copy link
Author

Awesome. Added it to the list!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment