Skip to content

Instantly share code, notes, and snippets.

@benmarwick
Last active May 29, 2019 15:50
Show Gist options
  • Save benmarwick/c8977f979849eabe318771735e39d13a to your computer and use it in GitHub Desktop.
Save benmarwick/c8977f979849eabe318771735e39d13a to your computer and use it in GitHub Desktop.
What are the most frequently used statistical tests?
---
title: "What are the most frequently used statistical tests?"
author: "Ben Marwick"
date: "March 31, 2016"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
warning = FALSE,
message = FALSE)
```
I was thinking recently about the history of statistics, and why some methods are popular today, and others are not. This led me to ask the question "what are the most popular basic statistical methods?"
I have a pretty good sense of what's popular in my own field, but that's a pretty small group. I wanted to have a look at scientists generally in lots of disciplines, and I wanted to do it from my kitchen table using R. Two methods seemed suitable: using rOpenSci's `fulltext`
Here's the list of statistical methods that I wanted to know about:
```{r tests}
the_tests <- c("t-test", "chi-square", "chi square", "chi-squared", "ANOVA", "Wilcox", "Fisher's exact", "Pearson", "z-test", "f-test", "Bayesian", "confidence interval", "Kruskal Wallis", "Kruskal-Wallis", "Wilcoxon", "correlation", "multiple correlation", "MANOVA", "factor analysis", "logistic regression", "multiple regression", "Principal component analysis", "bootstrap", "resampling", "Mann Whitney", "Mann-Whitney", "cluster analysis", "ANCOVA", "linear regression", "Kolmogorov-Smirnov")
```
Here is how we can search the full text of a bunch of journals. We might take this as an indicator of what researchers are actually using in their scientific publications.
```{r fulltext}
sources <- c('plos','crossref','arxiv', 'europmc', 'bmc')
library("purrr")
library("fulltext")
library("dplyr")
results <- the_tests %>%
map(~ ft_search(query = ., from = sources))
results_df <- results %>%
at_depth(2, 2) %>%
invoke(rbind, .) %>%
data.frame %>%
apply(., 1, unlist) %>%
data.frame %>%
colSums %>%
setNames(., nm = the_tests) %>%
data.frame(test = names(.),
freq = unname(.))
library(ggplot2)
ggplot(results_df, aes(reorder(test, -freq), freq)) +
geom_bar(stat = "identity") +
xlab("method") +
ylab("number of articles") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.2))
```
Here is how we can see how many Google searches each of the tests have received recently. This is an indication of what people are searching for, and would include students and people in industry whose research might not end up in publications that we could access with the previous method.
```{r gtrends_hide, echo=FALSE}
library(gtrendsR)
usr <- "benmarwick@gmail.com"
psw <- ""
gconnect(usr, psw)
```
```{r gtrends}
# we can only search for five terms at a time
the_tests_pieces <- split(the_tests, ceiling(seq_along(the_tests)/5))
text_trends <- vector("list", length(the_tests_pieces))
all_the_trends <- data.frame(matrix(ncol = length(the_tests),
nrow = 500))
# loop to search all the terms in batches of five terms at a time
for(i in seq_along(the_tests_pieces)){
# make a safe version of the function
gtrends_safe <- safely(gtrends)
# get the data from google, we'll just save the 'trend' bits for plotting
text_trends[[i]] <- gtrends_safe(the_tests_pieces[[i]])[[1]]$trend
# get the 'trends' and combine for all the stat methods we're interested in
}
date_time <- text_trends[[1]]$start
text_trends_1 <- lapply(text_trends,"[", 1:length(date_time), 3:7, drop=FALSE)
text_trends_2 <- text_trends_1[ ! sapply(text_trends_1, is.null) ]
text_trends_df <- data.frame(Reduce(dplyr::inner_join, list(text_trends_2)))
text_trends_df$date_time <- date_time
# total number of searches
gtrend_total <- colSums(text_trends_df[,(1:ncol(text_trends_df)-1)])
gtrend_total_df <- data.frame(test = names(gtrend_total),
value = unname(gtrend_total))
library(ggplot2)
ggplot(gtrend_total_df, aes(reorder(test, -value), value)) +
geom_bar(stat = "identity") +
xlab("method") +
ylab("number of \nGoogle searches") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
# over time
library(tidyr)
text_trends_df_long <- gather(text_trends_df, 'test', 'value', 1:24)
library(plotly)
library(ggrepel)
p <- ggplot(text_trends_df_long, aes(date_time, value, colour = test, label = test)) +
stat_smooth() +
coord_cartesian(xlim =c(min(text_trends_df_long$date_time), max(text_trends_df_long$date_time) + 100000000)) +
scale_x_datetime(date_breaks = "1 year", date_labels = "%Y") +
scale_y_log10() +
geom_text_repel(
data = subset(text_trends_df_long, date_time == max(date_time)),
aes(label = test),
size = 3,
nudge_x = 5,
segment.color = NA
) +
guides(colour=FALSE) +
theme_bw()
ggplotly(p)
@benmarwick
Copy link
Author

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment