benmarwick/most_common_stat_tests.r

## most_common_stat_tests.r
---
title: "What are the most frequently used statistical tests?"
author: "Ben Marwick"
date: "March 31, 2016"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
                      warning = FALSE,
                      message = FALSE)
```

I was thinking recently about the history of statistics, and why some methods are popular today, and others are not. This led me to ask the question "what are the most popular basic statistical methods?"

I have a pretty good sense of what's popular in my own field, but that's a pretty small group. I wanted to have a look at scientists generally in lots of disciplines, and I wanted to do it from my kitchen table using R. Two methods seemed suitable: using rOpenSci's `fulltext`

Here's the list of statistical methods that I wanted to know about:

```{r tests}
the_tests <- c("t-test", "chi-square", "chi square", "chi-squared", "ANOVA", "Wilcox", "Fisher's exact", "Pearson", "z-test", "f-test", "Bayesian", "confidence interval", "Kruskal Wallis", "Kruskal-Wallis", "Wilcoxon", "correlation", "multiple correlation", "MANOVA", "factor analysis", "logistic regression", "multiple regression", "Principal component analysis", "bootstrap", "resampling", "Mann Whitney", "Mann-Whitney", "cluster analysis", "ANCOVA", "linear regression", "Kolmogorov-Smirnov")
```

Here is how we can search the full text of a bunch of journals. We might take this as an indicator of what researchers are actually using in their scientific publications.

```{r fulltext}
sources <- c('plos','crossref','arxiv', 'europmc', 'bmc')

library("purrr")
library("fulltext")
library("dplyr")

results <-  the_tests %>%
  map(~ ft_search(query = ., from = sources))

results_df <-   results %>%
  at_depth(2, 2) %>%
  invoke(rbind, .)  %>%
  data.frame %>%
  apply(., 1, unlist) %>%
  data.frame %>%
  colSums %>%
  setNames(., nm = the_tests) %>%
  data.frame(test = names(.),
            freq = unname(.))

library(ggplot2)
ggplot(results_df, aes(reorder(test, -freq), freq)) +
  geom_bar(stat = "identity") +
  xlab("method") +
  ylab("number of articles") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.2))
```

Here is how we can see how many Google searches each of the tests have received recently. This is an indication of what people are searching for, and would include students and people in industry whose research might not end up in publications that we could access with the previous method.

```{r gtrends_hide, echo=FALSE}
library(gtrendsR)
usr <- "benmarwick@gmail.com"
psw <- ""
gconnect(usr, psw)
```


```{r gtrends}
# we can only search for five terms at a time
the_tests_pieces <- split(the_tests, ceiling(seq_along(the_tests)/5))
text_trends <- vector("list", length(the_tests_pieces))
all_the_trends <- data.frame(matrix(ncol = length(the_tests),
                                    nrow = 500))

# loop to search all the terms in batches of five terms at a time
for(i in seq_along(the_tests_pieces)){
  # make a safe version of the function
  gtrends_safe <- safely(gtrends)
  # get the data from google, we'll just save the 'trend' bits for plotting
  text_trends[[i]] <- gtrends_safe(the_tests_pieces[[i]])[[1]]$trend
  # get the 'trends' and combine for all the stat methods we're interested in
}

date_time <- text_trends[[1]]$start
text_trends_1 <- lapply(text_trends,"[", 1:length(date_time), 3:7, drop=FALSE)
text_trends_2 <- text_trends_1[ ! sapply(text_trends_1, is.null) ]
text_trends_df <- data.frame(Reduce(dplyr::inner_join, list(text_trends_2)))
text_trends_df$date_time <-  date_time

# total number of searches
gtrend_total <- colSums(text_trends_df[,(1:ncol(text_trends_df)-1)])
gtrend_total_df <- data.frame(test = names(gtrend_total),
                              value = unname(gtrend_total))

library(ggplot2)
ggplot(gtrend_total_df, aes(reorder(test, -value), value)) +
  geom_bar(stat = "identity")  +
  xlab("method") +
  ylab("number of \nGoogle searches") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

# over time
library(tidyr)
text_trends_df_long <- gather(text_trends_df, 'test', 'value', 1:24)

library(plotly)
library(ggrepel)

p <- ggplot(text_trends_df_long, aes(date_time, value, colour = test, label = test)) +
  stat_smooth() +
  coord_cartesian(xlim =c(min(text_trends_df_long$date_time), max(text_trends_df_long$date_time) + 100000000)) +
  scale_x_datetime(date_breaks = "1 year", date_labels = "%Y")   +
  scale_y_log10() +
  geom_text_repel(
    data = subset(text_trends_df_long, date_time == max(date_time)),
    aes(label = test),
    size = 3,
    nudge_x = 5,
    segment.color = NA
  ) +
   guides(colour=FALSE)  +
  theme_bw()
ggplotly(p)
	---
	title: "What are the most frequently used statistical tests?"
	author: "Ben Marwick"
	date: "March 31, 2016"
	output: html_document
	---

	```{r setup, include=FALSE}
	knitr::opts_chunk$set(echo = TRUE,
	warning = FALSE,
	message = FALSE)
	```

	I was thinking recently about the history of statistics, and why some methods are popular today, and others are not. This led me to ask the question "what are the most popular basic statistical methods?"

	I have a pretty good sense of what's popular in my own field, but that's a pretty small group. I wanted to have a look at scientists generally in lots of disciplines, and I wanted to do it from my kitchen table using R. Two methods seemed suitable: using rOpenSci's `fulltext`

	Here's the list of statistical methods that I wanted to know about:

	```{r tests}
	the_tests <- c("t-test", "chi-square", "chi square", "chi-squared", "ANOVA", "Wilcox", "Fisher's exact", "Pearson", "z-test", "f-test", "Bayesian", "confidence interval", "Kruskal Wallis", "Kruskal-Wallis", "Wilcoxon", "correlation", "multiple correlation", "MANOVA", "factor analysis", "logistic regression", "multiple regression", "Principal component analysis", "bootstrap", "resampling", "Mann Whitney", "Mann-Whitney", "cluster analysis", "ANCOVA", "linear regression", "Kolmogorov-Smirnov")
	```

	Here is how we can search the full text of a bunch of journals. We might take this as an indicator of what researchers are actually using in their scientific publications.

	```{r fulltext}
	sources <- c('plos','crossref','arxiv', 'europmc', 'bmc')

	library("purrr")
	library("fulltext")
	library("dplyr")

	results <- the_tests %>%
	map(~ ft_search(query = ., from = sources))

	results_df <- results %>%
	at_depth(2, 2) %>%
	invoke(rbind, .) %>%
	data.frame %>%
	apply(., 1, unlist) %>%
	data.frame %>%
	colSums %>%
	setNames(., nm = the_tests) %>%
	data.frame(test = names(.),
	freq = unname(.))

	library(ggplot2)
	ggplot(results_df, aes(reorder(test, -freq), freq)) +
	geom_bar(stat = "identity") +
	xlab("method") +
	ylab("number of articles") +
	theme_bw() +
	theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.2))
	```

	Here is how we can see how many Google searches each of the tests have received recently. This is an indication of what people are searching for, and would include students and people in industry whose research might not end up in publications that we could access with the previous method.

	```{r gtrends_hide, echo=FALSE}
	library(gtrendsR)
	usr <- "benmarwick@gmail.com"
	psw <- ""
	gconnect(usr, psw)
	```


	```{r gtrends}
	# we can only search for five terms at a time
	the_tests_pieces <- split(the_tests, ceiling(seq_along(the_tests)/5))
	text_trends <- vector("list", length(the_tests_pieces))
	all_the_trends <- data.frame(matrix(ncol = length(the_tests),
	nrow = 500))

	# loop to search all the terms in batches of five terms at a time
	for(i in seq_along(the_tests_pieces)){
	# make a safe version of the function
	gtrends_safe <- safely(gtrends)
	# get the data from google, we'll just save the 'trend' bits for plotting
	text_trends[[i]] <- gtrends_safe(the_tests_pieces[[i]])[[1]]$trend
	# get the 'trends' and combine for all the stat methods we're interested in
	}

	date_time <- text_trends[[1]]$start
	text_trends_1 <- lapply(text_trends,"[", 1:length(date_time), 3:7, drop=FALSE)
	text_trends_2 <- text_trends_1[ ! sapply(text_trends_1, is.null) ]
	text_trends_df <- data.frame(Reduce(dplyr::inner_join, list(text_trends_2)))
	text_trends_df$date_time <- date_time

	# total number of searches
	gtrend_total <- colSums(text_trends_df[,(1:ncol(text_trends_df)-1)])
	gtrend_total_df <- data.frame(test = names(gtrend_total),
	value = unname(gtrend_total))

	library(ggplot2)
	ggplot(gtrend_total_df, aes(reorder(test, -value), value)) +
	geom_bar(stat = "identity") +
	xlab("method") +
	ylab("number of \nGoogle searches") +
	theme_bw() +
	theme(axis.text.x = element_text(angle = 90, hjust = 1))

	# over time
	library(tidyr)
	text_trends_df_long <- gather(text_trends_df, 'test', 'value', 1:24)

	library(plotly)
	library(ggrepel)

	p <- ggplot(text_trends_df_long, aes(date_time, value, colour = test, label = test)) +
	stat_smooth() +
	coord_cartesian(xlim =c(min(text_trends_df_long$date_time), max(text_trends_df_long$date_time) + 100000000)) +
	scale_x_datetime(date_breaks = "1 year", date_labels = "%Y") +
	scale_y_log10() +
	geom_text_repel(
	data = subset(text_trends_df_long, date_time == max(date_time)),
	aes(label = test),
	size = 3,
	nudge_x = 5,
	segment.color = NA
	) +
	guides(colour=FALSE) +
	theme_bw()
	ggplotly(p)