Skip to content

Instantly share code, notes, and snippets.

What would you like to do?
Inferring gender from column of first names in R
# Let's do some advanced stuff.
# First we have to isolate the first name from the NAME field
payroll$first_name <- gsub(".*\\,", "", payroll$NAME)
payroll$first_name <- gsub(" .*", "", payroll$first_name)
# Bring in a library to help normalize the cases
payroll$first_name <- str_to_title(payroll$first_name)
# Inferring gender from first name based on historical data from the Census and Social Security admin
# Running the gender function on the payroll column that has the first names
# This will probably take several minutes to process. Have patience.
payroll_gender<- gender(payroll$first_name)
# You'll now have a 9mb list. Convert it to a data frame to merge with the original payroll set
payroll_gender <-, lapply(payroll_gender, data.frame, stringsAsFactors=FALSE))
# Isolating the new gender dataframe to just name and gender
payroll_gender <- payroll_gender[c("name", "gender")]
# Cleaning it up to match with the original data frame to make joining easier
colnames(payroll_gender) <- c("first_name", "gender")
# Deleting duplicates. Makes it easier to join.
payroll_gender <- subset(payroll_gender, !duplicated(payroll_gender$first_name))
# Bringing in a package that helps with advanced data work
# Bringing together the payroll data set with the inferred gender data set
payroll <- left_join(payroll, payroll_gender)
# If no names were matched with a gender, set it to "unknown
payroll$gender[$gender)] <- "unknown"
# Let's take a look at the breakdown
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment