Stephanie Hicks
23 Feb 2015
This Rmd uses the UF Department of Statistics Job Postings website to determine the frequency of faculty, postdoc, lecturer and statistican jobs over the academic year.
One caveat: The website only has data starting from Aug 2014 up until now, so I cannot include the postings over the summer, but I am interested in seeing how these plots differ after including spring and summer of 2015.
library(rvest)
library(stringr)
library(lubridate)
library(ggplot2)
First, we scrape the tables from the UF Statistics Jobs website.
I'm using the rvest
package to parse the html page. The data is contained in
tables in the html pages, so I'm using the html()
and html_table()
functions to parse the html and parse the tables in the html pages,
respectively.
pgs = vector("list", 17)
for(i in 1:17){
jobs <- html(paste0("http://www.stat.ufl.edu/jobs/?page=", i))
pgs[[i]] = do.call(rbind, html_table(jobs))
}
dat = do.call(rbind, pgs)
colnames(dat) = c("Location", "Description", "Date")
These are the top 10 most frequent job description titles.
head(sort(table(dat$Description), decreasing = TRUE), 10)
##
## Assistant Professor Postdoctoral Fellow
## 26 18
## Biostatistician Assistant/Associate/Full Professor
## 17 11
## Statistician Assistant/Associate Professor
## 10 8
## Tenure Track Assistant Professor Postdoctoral Fellowship
## 8 7
## Assistant or Associate Professor Assistant Professor of Statistics
## 6 6
Using the str_detect()
function in the stringr
R package, we can
use regular expressions to subset the data frame for any jobs that match the
pattern "Lecture".
head(dat[str_detect(dat$Description, "Lecture"),])
## Location
## 9 INDIANA UNIVERSITY / BLOOMINGTON CAMPUS
## 15 Mount Holyoke College
## 19 University of Glasgow
## 100 Department of Statistics
## 118 Harvard Statistics
## 119 Harvard Statistics
## Description Date
## 9 Lecturer 02/17/2015
## 15 Visiting Lecturer in Statistics 02/12/2015
## 19 Lecturer / Senior Lecturer / Reader in Statistics 02/10/2015
## 100 Full Time Lecturer Position 12/23/2014
## 118 Lecturer 12/15/2014
## 119 Senior Lecturer 12/15/2014
Because the str_detect()
function can only accept one pattern, we can
use the paste()
function to get around that fact and subset the rows matching
either "Lecture" or "Instructor".
head(dat[str_detect(dat$Description, paste(c("Lecture", "Instructor"), collapse='|')),])
## Location
## 9 INDIANA UNIVERSITY / BLOOMINGTON CAMPUS
## 15 Mount Holyoke College
## 19 University of Glasgow
## 100 Department of Statistics
## 118 Harvard Statistics
## 119 Harvard Statistics
## Description Date
## 9 Lecturer 02/17/2015
## 15 Visiting Lecturer in Statistics 02/12/2015
## 19 Lecturer / Senior Lecturer / Reader in Statistics 02/10/2015
## 100 Full Time Lecturer Position 12/23/2014
## 118 Lecturer 12/15/2014
## 119 Senior Lecturer 12/15/2014
For simplicity, I grouped the data into four categories:
- faculty = tentured or non-tenured faculty position including chairs, deans and department heads.
- postdoc = postdoctoral fellows
- lecturer = lecturer or instructor
- statistican = a statistican whose primary role is data analysis or managing other data analysts.
I_faculty = str_detect(dat$Description, paste(c("Professor", "Tenure", "tenure", "Faculty",
"Assistant", "Chair", "Dean", "Department",
"Head"), collapse='|'))
I_postdoc = str_detect(dat$Description, paste(c("Post", "Fellow"), collapse='|'))
I_lecturer = str_detect(dat$Description, paste(c("Lecture", "Instructor"), collapse='|'))
I_statistician = str_detect(dat$Description, paste(c(ignore.case("Biostatistic"),
"Statistician", "Scientist",
"Staff", "Professional", "Analyst",
ignore.case("Researcher"), "Programmer",
"Research Associate", "Master",
"Manager", "Director", "Investigator",
"Specialist", "Consultant", "VP",
"Bioinformatician", "Biometrician",
"Computational"), collapse='|'))
Now, let's create a new column variable called "Position" with the job titles
dat$Position <- ifelse(I_postdoc, "Postdoc", ifelse(I_faculty, "Faculty",
ifelse(I_lecturer, "Lecturer",
ifelse(I_statistician, "Statistician", "Other"))))
dat[which(dat$Position == "Other"),]
## Location
## 56 IDEAS European training network
## 74 Aerojet Rocketdyne
## 143 Odyssey Reinsurance Company
## 184 NC State University
## 328 Indiana University
## 340 Univeristy of California, Davis
## 420 Applied Research Solutions, Inc.
## 448 Computational Biology
## Description Date Position
## 56 14 Early stage researchers 01/26/2015 Other
## 74 Summer Internship 01/16/2015 Other
## 143 Underwriting Associate 12/02/2014 Other
## 184 Grants Proposal Administrator 11/11/2014 Other
## 328 Bloomington Campus 10/03/2014 Other
## 340 Statistics 09/30/2014 Other
## 420 Test and Evaluation Subject Matter Expert 09/04/2014 Other
## 448 University of Pittsburgh 08/27/2014 Other
We see there are a few descriptions that were not able to be categorized using the regex patterns provided above. We'll use some google-fu next to determine where they belong.
Turns out the "University of Pittsburgh" advertisement is for a postdoc. The "Bloomington Campus" and "Statistics" advertisements are for faculty positions. The "14 Early stage researchers" are for statistician positions. I removed the last four ("Summer Internship", "Underwriting Associate", "Grants Proposal Administrator", "Test and Evaluation Subject Matter Expert") as I don't think they are relevant to the analysis here.
dat[which(dat$Description == "University of Pittsburgh"),]$Position <- "Postdoc"
dat[which(dat$Description %in% c("Bloomington Campus", "Statistics")),]$Position <- "Faculty"
dat[which(dat$Description %in% c("14 Early stage researchers")),]$Position <- "Statistician"
dat = dat[!(dat$Description %in% c("Summer Internship", "Underwriting Associate",
"Grants Proposal Administrator",
"Test and Evaluation Subject Matter Expert")),]
dat[which(dat$Position == "Other"),]
## [1] Location Description Date Position
## <0 rows> (or 0-length row.names)
OK, so now we have dealt with grouping all the positions. Let's use the
lubridate
R package to make the Date column more R friendly. I'm using the
mdy()
function to tell R this column contains dates in the form of
"month/day/year". The month()
function extracts the month from each of the
rows.
table(month(mdy(dat$Date)))
##
## 1 2 8 9 10 11 12
## 53 44 53 96 111 78 48
Let's add a few other columns to our data frame.
dat$Position = factor(dat$Position)
dat$Date = mdy(dat$Date)
dat$month = factor(month(dat$Date, label=TRUE),
levels = c("Aug", "Sep", "Oct", "Nov", "Dec", "Jan", "Feb"))
dat$dayOfWeek = wday(dat$Date, label = TRUE) # day of week
The frequency job postings by position, day of the week and month:
ggplot(dat, aes(x = Position)) + geom_bar() # frequency job by type
ggplot(dat, aes(x = dayOfWeek)) + geom_bar(position="dodge")
ggplot(dat, aes(x = month)) + geom_bar() # frequency job by month
Job postings by date, day of the week and month (colors represent the type of position).
ggplot(dat, aes(x = Date, fill = Position)) + geom_bar(position="dodge")
ggplot(dat, aes(x = dayOfWeek, fill = Position)) + geom_bar(position="dodge")
ggplot(dat, aes(x = month, fill = Position)) + geom_bar(position="dodge")
Most academic faculty positions are posted Sept-Nov and most postdoc positions are posted after that time period.