stephaniehicks/2015-02-23_statsJobs.Rmd

## 2015-02-23_statsJobs.md

      
    Raw
  

              2015-02-23_statsJobs.md
            
          
    UF Department of Statistics Job Postings

Stephanie Hicks

23 Feb 2015
Purpose

This Rmd uses the UF Department of Statistics Job Postings website to determine
the frequency of faculty, postdoc, lecturer and statistican jobs over the
academic year.
One caveat: The website only has data starting from Aug 2014 up until now,
so I cannot include the postings over the summer, but I am interested in seeing
how these plots differ after including spring and summer of 2015.
Load libraries

library(rvest)
library(stringr)
library(lubridate)
library(ggplot2)
Scrape data

First, we scrape the tables from the UF Statistics Jobs website.
I'm using the rvest package to parse the html page.  The data is contained in
tables in the html pages, so I'm using the html() and html_table()
functions to parse the html and parse the tables in the html pages,
respectively.
pgs = vector("list", 17)
for(i in 1:17){
    jobs <- html(paste0("http://www.stat.ufl.edu/jobs/?page=", i))
    pgs[[i]] = do.call(rbind, html_table(jobs))
}
dat = do.call(rbind, pgs)
colnames(dat) = c("Location", "Description", "Date")
These are the top 10 most frequent job description titles.
head(sort(table(dat$Description), decreasing = TRUE), 10)
## 
##                Assistant Professor                Postdoctoral Fellow 
##                                 26                                 18 
##                    Biostatistician Assistant/Associate/Full Professor 
##                                 17                                 11 
##                       Statistician      Assistant/Associate Professor 
##                                 10                                  8 
##   Tenure Track Assistant Professor            Postdoctoral Fellowship 
##                                  8                                  7 
##   Assistant or Associate Professor  Assistant Professor of Statistics 
##                                  6                                  6

Data Cleaning

Using the str_detect() function in the stringr R package, we can
use regular expressions to subset the data frame for any jobs that match the
pattern "Lecture".
head(dat[str_detect(dat$Description, "Lecture"),])
##                                    Location
## 9   INDIANA UNIVERSITY / BLOOMINGTON CAMPUS
## 15                    Mount Holyoke College
## 19                    University of Glasgow
## 100                Department of Statistics
## 118                      Harvard Statistics
## 119                      Harvard Statistics
##                                           Description       Date
## 9                                            Lecturer 02/17/2015
## 15                    Visiting Lecturer in Statistics 02/12/2015
## 19  Lecturer / Senior Lecturer / Reader in Statistics 02/10/2015
## 100                       Full Time Lecturer Position 12/23/2014
## 118                                          Lecturer 12/15/2014
## 119                                   Senior Lecturer 12/15/2014

Because the str_detect() function can only accept one pattern, we can
use the paste() function to get around that fact and subset the rows matching
either "Lecture" or "Instructor".
head(dat[str_detect(dat$Description, paste(c("Lecture", "Instructor"), collapse='|')),])
##                                    Location
## 9   INDIANA UNIVERSITY / BLOOMINGTON CAMPUS
## 15                    Mount Holyoke College
## 19                    University of Glasgow
## 100                Department of Statistics
## 118                      Harvard Statistics
## 119                      Harvard Statistics
##                                           Description       Date
## 9                                            Lecturer 02/17/2015
## 15                    Visiting Lecturer in Statistics 02/12/2015
## 19  Lecturer / Senior Lecturer / Reader in Statistics 02/10/2015
## 100                       Full Time Lecturer Position 12/23/2014
## 118                                          Lecturer 12/15/2014
## 119                                   Senior Lecturer 12/15/2014

For simplicity, I grouped the data into four categories:

faculty = tentured or non-tenured faculty position including chairs, deans
and department heads.
postdoc = postdoctoral fellows
lecturer = lecturer or instructor
statistican = a statistican whose primary role is data analysis or
managing other data analysts.

I_faculty = str_detect(dat$Description, paste(c("Professor", "Tenure", "tenure", "Faculty", 
                                             "Assistant", "Chair", "Dean", "Department", 
                                             "Head"), collapse='|'))
I_postdoc = str_detect(dat$Description, paste(c("Post", "Fellow"), collapse='|'))
I_lecturer = str_detect(dat$Description, paste(c("Lecture", "Instructor"), collapse='|'))
I_statistician = str_detect(dat$Description, paste(c(ignore.case("Biostatistic"), 
                                                  "Statistician", "Scientist", 
                                                  "Staff", "Professional", "Analyst", 
                                                  ignore.case("Researcher"), "Programmer", 
                                                  "Research Associate", "Master",
                                                  "Manager", "Director", "Investigator",
                                                  "Specialist", "Consultant", "VP", 
                                                  "Bioinformatician", "Biometrician", 
                                                  "Computational"), collapse='|'))
Now, let's create a new column variable called "Position" with the job titles
dat$Position <- ifelse(I_postdoc, "Postdoc", ifelse(I_faculty, "Faculty", 
                                         ifelse(I_lecturer, "Lecturer", 
                                         ifelse(I_statistician, "Statistician", "Other"))))
dat[which(dat$Position == "Other"),]
##                             Location
## 56   IDEAS European training network
## 74                Aerojet Rocketdyne
## 143      Odyssey Reinsurance Company
## 184              NC State University
## 328               Indiana University
## 340  Univeristy of California, Davis
## 420 Applied Research Solutions, Inc.
## 448            Computational Biology
##                                   Description       Date Position
## 56                 14 Early stage researchers 01/26/2015    Other
## 74                          Summer Internship 01/16/2015    Other
## 143                    Underwriting Associate 12/02/2014    Other
## 184             Grants Proposal Administrator 11/11/2014    Other
## 328                        Bloomington Campus 10/03/2014    Other
## 340                                Statistics 09/30/2014    Other
## 420 Test and Evaluation Subject Matter Expert 09/04/2014    Other
## 448                  University of Pittsburgh 08/27/2014    Other

We see there are a few descriptions that were not able to be categorized using
the regex patterns provided above.  We'll use some google-fu next to determine
where they belong.
Turns out the "University of Pittsburgh" advertisement is for a postdoc. The
"Bloomington Campus" and "Statistics" advertisements are for faculty positions.
The "14 Early stage researchers" are for statistician positions. I removed
the last four ("Summer Internship", "Underwriting Associate",
"Grants Proposal Administrator", "Test and Evaluation Subject Matter Expert")
as I don't think they are relevant to the analysis here.
dat[which(dat$Description == "University of Pittsburgh"),]$Position <- "Postdoc" 
dat[which(dat$Description %in% c("Bloomington Campus", "Statistics")),]$Position <- "Faculty"
dat[which(dat$Description %in% c("14 Early stage researchers")),]$Position <- "Statistician"
dat = dat[!(dat$Description %in% c("Summer Internship", "Underwriting Associate",
                                      "Grants Proposal Administrator", 
                                      "Test and Evaluation Subject Matter Expert")),]
dat[which(dat$Position == "Other"),]
## [1] Location    Description Date        Position   
## <0 rows> (or 0-length row.names)

OK, so now we have dealt with grouping all the positions. Let's use the
lubridate R package to make the Date column more R friendly.  I'm using the
mdy() function to tell R this column contains dates in the form of
"month/day/year". The month() function extracts the month from each of the
rows.
table(month(mdy(dat$Date)))
## 
##   1   2   8   9  10  11  12 
##  53  44  53  96 111  78  48

Let's add a few other columns to our data frame.
dat$Position = factor(dat$Position)
dat$Date = mdy(dat$Date)
dat$month = factor(month(dat$Date, label=TRUE), 
                   levels = c("Aug", "Sep", "Oct", "Nov", "Dec", "Jan", "Feb"))
dat$dayOfWeek = wday(dat$Date, label = TRUE) # day of week
Data visualization

The frequency job postings by position, day of the week and month:
ggplot(dat, aes(x = Position)) + geom_bar() # frequency job by type

ggplot(dat, aes(x = dayOfWeek)) + geom_bar(position="dodge")

ggplot(dat, aes(x = month)) + geom_bar() # frequency job by month

Job postings by date, day of the week and month (colors represent the type
of position).
ggplot(dat, aes(x = Date, fill = Position)) + geom_bar(position="dodge")

ggplot(dat, aes(x = dayOfWeek, fill = Position)) + geom_bar(position="dodge")

ggplot(dat, aes(x = month, fill = Position)) + geom_bar(position="dodge")

Most academic faculty positions are posted Sept-Nov and
most postdoc positions are posted after that time period.

  
## 2015-02-23_statsJobs.Rmd
---
title: "UF Department of Statistics Job Postings"
author: "Stephanie Hicks"
date: "23 Feb 2015"
output: html_document
keep_md: TRUE
---

## Purpose

This Rmd uses the UF Department of Statistics Job Postings website to determine
the frequency of faculty, postdoc, lecturer and statistican jobs over the
academic year.

One caveat: The website only has data starting from Aug 2014 up until now,
so I cannot include the postings over the summer, but I am interested in seeing
how these plots differ after including spring and summer of 2015.


#### Load libraries

```{r, message=FALSE}
library(rvest)
library(stringr)
library(lubridate)
library(ggplot2)
```

#### Scrape data

First, we scrape the tables from the UF Statistics Jobs website.

I'm using the `rvest` package to parse the html page.  The data is contained in
tables in the html pages, so I'm using the `html()` and `html_table()`
functions to parse the html and parse the tables in the html pages,
respectively.

```{r}
pgs = vector("list", 17)
for(i in 1:17){
    jobs <- html(paste0("http://www.stat.ufl.edu/jobs/?page=", i))
    pgs[[i]] = do.call(rbind, html_table(jobs))
}
dat = do.call(rbind, pgs)
colnames(dat) = c("Location", "Description", "Date")
```

These are the top 10 most frequent job description titles.

```{r}
head(sort(table(dat$Description), decreasing = TRUE), 10)
```

#### Data Cleaning

Using the `str_detect()` function in the `stringr` R package, we can
use regular expressions to subset the data frame for any jobs that match the
pattern "Lecture".

```{r}
head(dat[str_detect(dat$Description, "Lecture"),])
```

Because the `str_detect()` function can only accept one pattern, we can
use the `paste()` function to get around that fact and subset the rows matching
either "Lecture" or "Instructor".

```{r}
head(dat[str_detect(dat$Description, paste(c("Lecture", "Instructor"), collapse='|')),])
```

For simplicity, I grouped the data into four categories:

1. faculty = tentured or non-tenured faculty position including chairs, deans
and department heads.
2. postdoc = postdoctoral fellows
3. lecturer = lecturer or instructor
4. statistican = a statistican whose primary role is data analysis or
managing other data analysts.

```{r}
I_faculty = str_detect(dat$Description, paste(c("Professor", "Tenure", "tenure", "Faculty",
                                             "Assistant", "Chair", "Dean", "Department",
                                             "Head"), collapse='|'))
I_postdoc = str_detect(dat$Description, paste(c("Post", "Fellow"), collapse='|'))
I_lecturer = str_detect(dat$Description, paste(c("Lecture", "Instructor"), collapse='|'))
I_statistician = str_detect(dat$Description, paste(c(ignore.case("Biostatistic"),
                                                  "Statistician", "Scientist",
                                                  "Staff", "Professional", "Analyst",
                                                  ignore.case("Researcher"), "Programmer",
                                                  "Research Associate", "Master",
                                                  "Manager", "Director", "Investigator",
                                                  "Specialist", "Consultant", "VP",
                                                  "Bioinformatician", "Biometrician",
                                                  "Computational"), collapse='|'))
```

Now, let's create a new column variable called "Position" with the job titles
```{r}
dat$Position <- ifelse(I_postdoc, "Postdoc", ifelse(I_faculty, "Faculty",
                                         ifelse(I_lecturer, "Lecturer",
                                         ifelse(I_statistician, "Statistician", "Other"))))
dat[which(dat$Position == "Other"),]
```

We see there are a few descriptions that were not able to be categorized using
the regex patterns provided above.  We'll use some google-fu next to determine
where they belong.

Turns out the "University of Pittsburgh" advertisement is for a postdoc. The
"Bloomington Campus" and "Statistics" advertisements are for faculty positions.
The "14 Early stage researchers" are for statistician positions. I removed
the last four ("Summer Internship", "Underwriting Associate",
"Grants Proposal Administrator", "Test and Evaluation Subject Matter Expert")
as I don't think they are relevant to the analysis here.

```{r}
dat[which(dat$Description == "University of Pittsburgh"),]$Position <- "Postdoc"
dat[which(dat$Description %in% c("Bloomington Campus", "Statistics")),]$Position <- "Faculty"
dat[which(dat$Description %in% c("14 Early stage researchers")),]$Position <- "Statistician"
dat = dat[!(dat$Description %in% c("Summer Internship", "Underwriting Associate",
                                      "Grants Proposal Administrator",
                                      "Test and Evaluation Subject Matter Expert")),]
dat[which(dat$Position == "Other"),]
```

OK, so now we have dealt with grouping all the positions. Let's use the
`lubridate` R package to make the Date column more R friendly.  I'm using the
`mdy()` function to tell R this column contains dates in the form of
"month/day/year". The `month()` function extracts the month from each of the
rows.

```{r}
table(month(mdy(dat$Date)))
```

Let's add a few other columns to our data frame.

```{r}
dat$Position = factor(dat$Position)
dat$Date = mdy(dat$Date)
dat$month = factor(month(dat$Date, label=TRUE),
                   levels = c("Aug", "Sep", "Oct", "Nov", "Dec", "Jan", "Feb"))
dat$dayOfWeek = wday(dat$Date, label = TRUE) # day of week
```

#### Data visualization

The frequency job postings by position, day of the week and month:

```{r}
ggplot(dat, aes(x = Position)) + geom_bar() # frequency job by type
ggplot(dat, aes(x = dayOfWeek)) + geom_bar(position="dodge")
ggplot(dat, aes(x = month)) + geom_bar() # frequency job by month
```

Job postings by date, day of the week and month (colors represent the type
of position).

```{r, message=FALSE}
ggplot(dat, aes(x = Date, fill = Position)) + geom_bar(position="dodge")
ggplot(dat, aes(x = dayOfWeek, fill = Position)) + geom_bar(position="dodge")
ggplot(dat, aes(x = month, fill = Position)) + geom_bar(position="dodge")
```

Most academic faculty positions are posted Sept-Nov and
most postdoc positions are posted after that time period.
	---
	title: "UF Department of Statistics Job Postings"
	author: "Stephanie Hicks"
	date: "23 Feb 2015"
	output: html_document
	keep_md: TRUE
	---

	## Purpose

	This Rmd uses the UF Department of Statistics Job Postings website to determine
	the frequency of faculty, postdoc, lecturer and statistican jobs over the
	academic year.

	One caveat: The website only has data starting from Aug 2014 up until now,
	so I cannot include the postings over the summer, but I am interested in seeing
	how these plots differ after including spring and summer of 2015.


	#### Load libraries

	```{r, message=FALSE}
	library(rvest)
	library(stringr)
	library(lubridate)
	library(ggplot2)
	```

	#### Scrape data

	First, we scrape the tables from the UF Statistics Jobs website.

	I'm using the `rvest` package to parse the html page. The data is contained in
	tables in the html pages, so I'm using the `html()` and `html_table()`
	functions to parse the html and parse the tables in the html pages,
	respectively.

	```{r}
	pgs = vector("list", 17)
	for(i in 1:17){
	jobs <- html(paste0("http://www.stat.ufl.edu/jobs/?page=", i))
	pgs[[i]] = do.call(rbind, html_table(jobs))
	}
	dat = do.call(rbind, pgs)
	colnames(dat) = c("Location", "Description", "Date")
	```

	These are the top 10 most frequent job description titles.

	```{r}
	head(sort(table(dat$Description), decreasing = TRUE), 10)
	```

	#### Data Cleaning

	Using the `str_detect()` function in the `stringr` R package, we can
	use regular expressions to subset the data frame for any jobs that match the
	pattern "Lecture".

	```{r}
	head(dat[str_detect(dat$Description, "Lecture"),])
	```

	Because the `str_detect()` function can only accept one pattern, we can
	use the `paste()` function to get around that fact and subset the rows matching
	either "Lecture" or "Instructor".

	```{r}
	head(dat[str_detect(dat$Description, paste(c("Lecture", "Instructor"), collapse='\|')),])
	```

	For simplicity, I grouped the data into four categories:

	1. faculty = tentured or non-tenured faculty position including chairs, deans
	and department heads.
	2. postdoc = postdoctoral fellows
	3. lecturer = lecturer or instructor
	4. statistican = a statistican whose primary role is data analysis or
	managing other data analysts.

	```{r}
	I_faculty = str_detect(dat$Description, paste(c("Professor", "Tenure", "tenure", "Faculty",
	"Assistant", "Chair", "Dean", "Department",
	"Head"), collapse='\|'))
	I_postdoc = str_detect(dat$Description, paste(c("Post", "Fellow"), collapse='\|'))
	I_lecturer = str_detect(dat$Description, paste(c("Lecture", "Instructor"), collapse='\|'))
	I_statistician = str_detect(dat$Description, paste(c(ignore.case("Biostatistic"),
	"Statistician", "Scientist",
	"Staff", "Professional", "Analyst",
	ignore.case("Researcher"), "Programmer",
	"Research Associate", "Master",
	"Manager", "Director", "Investigator",
	"Specialist", "Consultant", "VP",
	"Bioinformatician", "Biometrician",
	"Computational"), collapse='\|'))
	```

	Now, let's create a new column variable called "Position" with the job titles
	```{r}
	dat$Position <- ifelse(I_postdoc, "Postdoc", ifelse(I_faculty, "Faculty",
	ifelse(I_lecturer, "Lecturer",
	ifelse(I_statistician, "Statistician", "Other"))))
	dat[which(dat$Position == "Other"),]
	```

	We see there are a few descriptions that were not able to be categorized using
	the regex patterns provided above. We'll use some google-fu next to determine
	where they belong.

	Turns out the "University of Pittsburgh" advertisement is for a postdoc. The
	"Bloomington Campus" and "Statistics" advertisements are for faculty positions.
	The "14 Early stage researchers" are for statistician positions. I removed
	the last four ("Summer Internship", "Underwriting Associate",
	"Grants Proposal Administrator", "Test and Evaluation Subject Matter Expert")
	as I don't think they are relevant to the analysis here.

	```{r}
	dat[which(dat$Description == "University of Pittsburgh"),]$Position <- "Postdoc"
	dat[which(dat$Description %in% c("Bloomington Campus", "Statistics")),]$Position <- "Faculty"
	dat[which(dat$Description %in% c("14 Early stage researchers")),]$Position <- "Statistician"
	dat = dat[!(dat$Description %in% c("Summer Internship", "Underwriting Associate",
	"Grants Proposal Administrator",
	"Test and Evaluation Subject Matter Expert")),]
	dat[which(dat$Position == "Other"),]
	```

	OK, so now we have dealt with grouping all the positions. Let's use the
	`lubridate` R package to make the Date column more R friendly. I'm using the
	`mdy()` function to tell R this column contains dates in the form of
	"month/day/year". The `month()` function extracts the month from each of the
	rows.

	```{r}
	table(month(mdy(dat$Date)))
	```

	Let's add a few other columns to our data frame.

	```{r}
	dat$Position = factor(dat$Position)
	dat$Date = mdy(dat$Date)
	dat$month = factor(month(dat$Date, label=TRUE),
	levels = c("Aug", "Sep", "Oct", "Nov", "Dec", "Jan", "Feb"))
	dat$dayOfWeek = wday(dat$Date, label = TRUE) # day of week
	```

	#### Data visualization

	The frequency job postings by position, day of the week and month:

	```{r}
	ggplot(dat, aes(x = Position)) + geom_bar() # frequency job by type
	ggplot(dat, aes(x = dayOfWeek)) + geom_bar(position="dodge")
	ggplot(dat, aes(x = month)) + geom_bar() # frequency job by month
	```

	Job postings by date, day of the week and month (colors represent the type
	of position).

	```{r, message=FALSE}
	ggplot(dat, aes(x = Date, fill = Position)) + geom_bar(position="dodge")
	ggplot(dat, aes(x = dayOfWeek, fill = Position)) + geom_bar(position="dodge")
	ggplot(dat, aes(x = month, fill = Position)) + geom_bar(position="dodge")
	```

	Most academic faculty positions are posted Sept-Nov and
	most postdoc positions are posted after that time period.