Skip to content

Instantly share code, notes, and snippets.

@stephaniehicks
Last active August 29, 2015 14:15
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save stephaniehicks/70498ddcaa94dd47b697 to your computer and use it in GitHub Desktop.
Save stephaniehicks/70498ddcaa94dd47b697 to your computer and use it in GitHub Desktop.

UF Department of Statistics Job Postings

Stephanie Hicks
23 Feb 2015

Purpose

This Rmd uses the UF Department of Statistics Job Postings website to determine the frequency of faculty, postdoc, lecturer and statistican jobs over the academic year.

One caveat: The website only has data starting from Aug 2014 up until now, so I cannot include the postings over the summer, but I am interested in seeing how these plots differ after including spring and summer of 2015.

Load libraries

library(rvest)
library(stringr)
library(lubridate)
library(ggplot2)

Scrape data

First, we scrape the tables from the UF Statistics Jobs website.

I'm using the rvest package to parse the html page. The data is contained in tables in the html pages, so I'm using the html() and html_table() functions to parse the html and parse the tables in the html pages, respectively.

pgs = vector("list", 17)
for(i in 1:17){
    jobs <- html(paste0("http://www.stat.ufl.edu/jobs/?page=", i))
    pgs[[i]] = do.call(rbind, html_table(jobs))
}
dat = do.call(rbind, pgs)
colnames(dat) = c("Location", "Description", "Date")

These are the top 10 most frequent job description titles.

head(sort(table(dat$Description), decreasing = TRUE), 10)
## 
##                Assistant Professor                Postdoctoral Fellow 
##                                 26                                 18 
##                    Biostatistician Assistant/Associate/Full Professor 
##                                 17                                 11 
##                       Statistician      Assistant/Associate Professor 
##                                 10                                  8 
##   Tenure Track Assistant Professor            Postdoctoral Fellowship 
##                                  8                                  7 
##   Assistant or Associate Professor  Assistant Professor of Statistics 
##                                  6                                  6

Data Cleaning

Using the str_detect() function in the stringr R package, we can use regular expressions to subset the data frame for any jobs that match the pattern "Lecture".

head(dat[str_detect(dat$Description, "Lecture"),])
##                                    Location
## 9   INDIANA UNIVERSITY / BLOOMINGTON CAMPUS
## 15                    Mount Holyoke College
## 19                    University of Glasgow
## 100                Department of Statistics
## 118                      Harvard Statistics
## 119                      Harvard Statistics
##                                           Description       Date
## 9                                            Lecturer 02/17/2015
## 15                    Visiting Lecturer in Statistics 02/12/2015
## 19  Lecturer / Senior Lecturer / Reader in Statistics 02/10/2015
## 100                       Full Time Lecturer Position 12/23/2014
## 118                                          Lecturer 12/15/2014
## 119                                   Senior Lecturer 12/15/2014

Because the str_detect() function can only accept one pattern, we can use the paste() function to get around that fact and subset the rows matching either "Lecture" or "Instructor".

head(dat[str_detect(dat$Description, paste(c("Lecture", "Instructor"), collapse='|')),])
##                                    Location
## 9   INDIANA UNIVERSITY / BLOOMINGTON CAMPUS
## 15                    Mount Holyoke College
## 19                    University of Glasgow
## 100                Department of Statistics
## 118                      Harvard Statistics
## 119                      Harvard Statistics
##                                           Description       Date
## 9                                            Lecturer 02/17/2015
## 15                    Visiting Lecturer in Statistics 02/12/2015
## 19  Lecturer / Senior Lecturer / Reader in Statistics 02/10/2015
## 100                       Full Time Lecturer Position 12/23/2014
## 118                                          Lecturer 12/15/2014
## 119                                   Senior Lecturer 12/15/2014

For simplicity, I grouped the data into four categories:

  1. faculty = tentured or non-tenured faculty position including chairs, deans and department heads.
  2. postdoc = postdoctoral fellows
  3. lecturer = lecturer or instructor
  4. statistican = a statistican whose primary role is data analysis or managing other data analysts.
I_faculty = str_detect(dat$Description, paste(c("Professor", "Tenure", "tenure", "Faculty", 
                                             "Assistant", "Chair", "Dean", "Department", 
                                             "Head"), collapse='|'))
I_postdoc = str_detect(dat$Description, paste(c("Post", "Fellow"), collapse='|'))
I_lecturer = str_detect(dat$Description, paste(c("Lecture", "Instructor"), collapse='|'))
I_statistician = str_detect(dat$Description, paste(c(ignore.case("Biostatistic"), 
                                                  "Statistician", "Scientist", 
                                                  "Staff", "Professional", "Analyst", 
                                                  ignore.case("Researcher"), "Programmer", 
                                                  "Research Associate", "Master",
                                                  "Manager", "Director", "Investigator",
                                                  "Specialist", "Consultant", "VP", 
                                                  "Bioinformatician", "Biometrician", 
                                                  "Computational"), collapse='|'))

Now, let's create a new column variable called "Position" with the job titles

dat$Position <- ifelse(I_postdoc, "Postdoc", ifelse(I_faculty, "Faculty", 
                                         ifelse(I_lecturer, "Lecturer", 
                                         ifelse(I_statistician, "Statistician", "Other"))))
dat[which(dat$Position == "Other"),]
##                             Location
## 56   IDEAS European training network
## 74                Aerojet Rocketdyne
## 143      Odyssey Reinsurance Company
## 184              NC State University
## 328               Indiana University
## 340  Univeristy of California, Davis
## 420 Applied Research Solutions, Inc.
## 448            Computational Biology
##                                   Description       Date Position
## 56                 14 Early stage researchers 01/26/2015    Other
## 74                          Summer Internship 01/16/2015    Other
## 143                    Underwriting Associate 12/02/2014    Other
## 184             Grants Proposal Administrator 11/11/2014    Other
## 328                        Bloomington Campus 10/03/2014    Other
## 340                                Statistics 09/30/2014    Other
## 420 Test and Evaluation Subject Matter Expert 09/04/2014    Other
## 448                  University of Pittsburgh 08/27/2014    Other

We see there are a few descriptions that were not able to be categorized using the regex patterns provided above. We'll use some google-fu next to determine where they belong.

Turns out the "University of Pittsburgh" advertisement is for a postdoc. The "Bloomington Campus" and "Statistics" advertisements are for faculty positions. The "14 Early stage researchers" are for statistician positions. I removed the last four ("Summer Internship", "Underwriting Associate", "Grants Proposal Administrator", "Test and Evaluation Subject Matter Expert") as I don't think they are relevant to the analysis here.

dat[which(dat$Description == "University of Pittsburgh"),]$Position <- "Postdoc" 
dat[which(dat$Description %in% c("Bloomington Campus", "Statistics")),]$Position <- "Faculty"
dat[which(dat$Description %in% c("14 Early stage researchers")),]$Position <- "Statistician"
dat = dat[!(dat$Description %in% c("Summer Internship", "Underwriting Associate",
                                      "Grants Proposal Administrator", 
                                      "Test and Evaluation Subject Matter Expert")),]
dat[which(dat$Position == "Other"),]
## [1] Location    Description Date        Position   
## <0 rows> (or 0-length row.names)

OK, so now we have dealt with grouping all the positions. Let's use the lubridate R package to make the Date column more R friendly. I'm using the mdy() function to tell R this column contains dates in the form of "month/day/year". The month() function extracts the month from each of the rows.

table(month(mdy(dat$Date)))
## 
##   1   2   8   9  10  11  12 
##  53  44  53  96 111  78  48

Let's add a few other columns to our data frame.

dat$Position = factor(dat$Position)
dat$Date = mdy(dat$Date)
dat$month = factor(month(dat$Date, label=TRUE), 
                   levels = c("Aug", "Sep", "Oct", "Nov", "Dec", "Jan", "Feb"))
dat$dayOfWeek = wday(dat$Date, label = TRUE) # day of week

Data visualization

The frequency job postings by position, day of the week and month:

ggplot(dat, aes(x = Position)) + geom_bar() # frequency job by type

ggplot(dat, aes(x = dayOfWeek)) + geom_bar(position="dodge")

ggplot(dat, aes(x = month)) + geom_bar() # frequency job by month

Job postings by date, day of the week and month (colors represent the type of position).

ggplot(dat, aes(x = Date, fill = Position)) + geom_bar(position="dodge")

ggplot(dat, aes(x = dayOfWeek, fill = Position)) + geom_bar(position="dodge")

ggplot(dat, aes(x = month, fill = Position)) + geom_bar(position="dodge")

Most academic faculty positions are posted Sept-Nov and most postdoc positions are posted after that time period.

---
title: "UF Department of Statistics Job Postings"
author: "Stephanie Hicks"
date: "23 Feb 2015"
output: html_document
keep_md: TRUE
---
## Purpose
This Rmd uses the UF Department of Statistics Job Postings website to determine
the frequency of faculty, postdoc, lecturer and statistican jobs over the
academic year.
One caveat: The website only has data starting from Aug 2014 up until now,
so I cannot include the postings over the summer, but I am interested in seeing
how these plots differ after including spring and summer of 2015.
#### Load libraries
```{r, message=FALSE}
library(rvest)
library(stringr)
library(lubridate)
library(ggplot2)
```
#### Scrape data
First, we scrape the tables from the UF Statistics Jobs website.
I'm using the `rvest` package to parse the html page. The data is contained in
tables in the html pages, so I'm using the `html()` and `html_table()`
functions to parse the html and parse the tables in the html pages,
respectively.
```{r}
pgs = vector("list", 17)
for(i in 1:17){
jobs <- html(paste0("http://www.stat.ufl.edu/jobs/?page=", i))
pgs[[i]] = do.call(rbind, html_table(jobs))
}
dat = do.call(rbind, pgs)
colnames(dat) = c("Location", "Description", "Date")
```
These are the top 10 most frequent job description titles.
```{r}
head(sort(table(dat$Description), decreasing = TRUE), 10)
```
#### Data Cleaning
Using the `str_detect()` function in the `stringr` R package, we can
use regular expressions to subset the data frame for any jobs that match the
pattern "Lecture".
```{r}
head(dat[str_detect(dat$Description, "Lecture"),])
```
Because the `str_detect()` function can only accept one pattern, we can
use the `paste()` function to get around that fact and subset the rows matching
either "Lecture" or "Instructor".
```{r}
head(dat[str_detect(dat$Description, paste(c("Lecture", "Instructor"), collapse='|')),])
```
For simplicity, I grouped the data into four categories:
1. faculty = tentured or non-tenured faculty position including chairs, deans
and department heads.
2. postdoc = postdoctoral fellows
3. lecturer = lecturer or instructor
4. statistican = a statistican whose primary role is data analysis or
managing other data analysts.
```{r}
I_faculty = str_detect(dat$Description, paste(c("Professor", "Tenure", "tenure", "Faculty",
"Assistant", "Chair", "Dean", "Department",
"Head"), collapse='|'))
I_postdoc = str_detect(dat$Description, paste(c("Post", "Fellow"), collapse='|'))
I_lecturer = str_detect(dat$Description, paste(c("Lecture", "Instructor"), collapse='|'))
I_statistician = str_detect(dat$Description, paste(c(ignore.case("Biostatistic"),
"Statistician", "Scientist",
"Staff", "Professional", "Analyst",
ignore.case("Researcher"), "Programmer",
"Research Associate", "Master",
"Manager", "Director", "Investigator",
"Specialist", "Consultant", "VP",
"Bioinformatician", "Biometrician",
"Computational"), collapse='|'))
```
Now, let's create a new column variable called "Position" with the job titles
```{r}
dat$Position <- ifelse(I_postdoc, "Postdoc", ifelse(I_faculty, "Faculty",
ifelse(I_lecturer, "Lecturer",
ifelse(I_statistician, "Statistician", "Other"))))
dat[which(dat$Position == "Other"),]
```
We see there are a few descriptions that were not able to be categorized using
the regex patterns provided above. We'll use some google-fu next to determine
where they belong.
Turns out the "University of Pittsburgh" advertisement is for a postdoc. The
"Bloomington Campus" and "Statistics" advertisements are for faculty positions.
The "14 Early stage researchers" are for statistician positions. I removed
the last four ("Summer Internship", "Underwriting Associate",
"Grants Proposal Administrator", "Test and Evaluation Subject Matter Expert")
as I don't think they are relevant to the analysis here.
```{r}
dat[which(dat$Description == "University of Pittsburgh"),]$Position <- "Postdoc"
dat[which(dat$Description %in% c("Bloomington Campus", "Statistics")),]$Position <- "Faculty"
dat[which(dat$Description %in% c("14 Early stage researchers")),]$Position <- "Statistician"
dat = dat[!(dat$Description %in% c("Summer Internship", "Underwriting Associate",
"Grants Proposal Administrator",
"Test and Evaluation Subject Matter Expert")),]
dat[which(dat$Position == "Other"),]
```
OK, so now we have dealt with grouping all the positions. Let's use the
`lubridate` R package to make the Date column more R friendly. I'm using the
`mdy()` function to tell R this column contains dates in the form of
"month/day/year". The `month()` function extracts the month from each of the
rows.
```{r}
table(month(mdy(dat$Date)))
```
Let's add a few other columns to our data frame.
```{r}
dat$Position = factor(dat$Position)
dat$Date = mdy(dat$Date)
dat$month = factor(month(dat$Date, label=TRUE),
levels = c("Aug", "Sep", "Oct", "Nov", "Dec", "Jan", "Feb"))
dat$dayOfWeek = wday(dat$Date, label = TRUE) # day of week
```
#### Data visualization
The frequency job postings by position, day of the week and month:
```{r}
ggplot(dat, aes(x = Position)) + geom_bar() # frequency job by type
ggplot(dat, aes(x = dayOfWeek)) + geom_bar(position="dodge")
ggplot(dat, aes(x = month)) + geom_bar() # frequency job by month
```
Job postings by date, day of the week and month (colors represent the type
of position).
```{r, message=FALSE}
ggplot(dat, aes(x = Date, fill = Position)) + geom_bar(position="dodge")
ggplot(dat, aes(x = dayOfWeek, fill = Position)) + geom_bar(position="dodge")
ggplot(dat, aes(x = month, fill = Position)) + geom_bar(position="dodge")
```
Most academic faculty positions are posted Sept-Nov and
most postdoc positions are posted after that time period.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment