Skip to content

Instantly share code, notes, and snippets.

@ledell
Last active August 29, 2015 14:11
Show Gist options
  • Save ledell/fd18f994bb5eabb3c324 to your computer and use it in GitHub Desktop.
Save ledell/fd18f994bb5eabb3c324 to your computer and use it in GitHub Desktop.
Quick estimate of gender distribution of CRAN package maintainers
library(miniCRAN)
library(gender)
library(stringr)
# Get package description data
# This took about an hour to run, so you can load the data directly below
# pkgs <- available.packages("http://cran.rstudio.com/src/contrib")
# desc <- getCranDescription(pkgs, repos = c(CRAN="http://cran.rstudio.com"))
desc <- read.csv("http://www.stat.berkeley.edu/~ledell/data/RStudioCRAN_pkgDesc_20141216.csv")
# Grab first name of maintainers
# Strip whitespace, split on a single space to get first name and remove punctuation
firstname <- function(maintainer) {
gsub("[[:punct:]]", "", str_split(str_trim(maintainer), " ", n = 2)[[1]][1])
}
maintainers <- sapply(desc$Maintainer, firstname)
name_freq <- table(maintainers)
# Predicted gender of unique maintainer names (1970 seemed like a good median birth year)
# This will be more accurate for U.S. names
# This takes a while to run (approx 35 mins)
pred_genders <- sapply(names(name_freq), function(x) gender(x, years = c(1970))$gender)
# Fix one important miscoding of gender!
pred_genders[which(names(pred_genders) == "Hadley")] <- "male"
# Estimated gender distribution of CRAN package maintainers
gender_table <- table(unlist(sapply(1:length(pred_genders),
function(i) rep(pred_genders[i], name_freq[i]))), useNA = "always")
#female male <NA>
# 539 3731 1835
# Proportions
round(gender_table/length(maintainers), 3)
#female male <NA>
# 0.088 0.611 0.301
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment