Btibert3/pred-analytics-thoughts.rmd

## pred-analytics-thoughts.rmd
*Abstract:*  A simple example of predictive analytics for Enrollment Managers using **FREE** tools.

*TL;DR:*  Using `R` we can fit all sorts of complex models in Enrollment Management, quickly, and for no cost.  In truth, data modeling can help undercover complex relationships at your school that are not easily visible in our usual tables and charts.

However, predictive analytics is not the golden ticket to enrollment success.  You will need to understand not only what the model is telling you, but also the risks associated with being incorrect.

Lastly, once you have your "model", how do you actually use it?  You will need to think about how you incorporate your new model into your current decision making processes.

## Why Write this Post?

Over the last few weeks, I feel like the discussion around "Predictive Analytics" within Enrollment Management has really picked up steam.  There are a ton of great vendors out there, but the aim of this post is to show you how simple it can be do build predictive models internally, for the price of "on-the-house".

I don't mean to imply that machine learning is easy by any stretch, but I do intend to highlight how quickly models **can** be built.  If you know your data, and understand various techniques, model building isn't the hard part.

More than likely though, you will want to take some time to think about your data and the output you see.  Not to mention, how you would actually operationalize your model so that it runs quietly behind the scenes.

## Quick Overview

My goal is to try to write a post on how we can do `predictive analytics` in Enrollment Management using `R`.  In this example, we will fit a model to predict if an applicant is admitted.

In full disclosure, I am going to avoid the technical details as much as possible, although understanding **how** these models work is critically important.


## Previous Work and Discussion

Let the debate around predictive analytics begin!  I am just kidding, but there has been quite a bit of press recently on the usage of predictive analytics within higher ed and Enrollment Management.

Here are a few (self-edited, sometimes snarky) headlines.

- [Colleges are using Big Data to predict which students will do well](http://www.fastcoexist.com/3019859/futurist-forum/colleges-are-using-big-data-to-predict-which-students-will-do-well-before-the?partner=rss)

- [The Future of Predictive Analytics in Higher Ed](http://www.centerdigitaled.com/news/The-Future-of-Predictive-Analytics-Higher-Ed.html)

- [An article 5 years too late](http://www.insidehighered.com/news/2013/10/24/political-campaign-style-targeting-comes-student-search)

- [FAFSA and the 3 schools in the country use it](http://www.insidehighered.com/news/2013/10/28/colleges-use-fafsa-information-reject-students-and-potentially-lower-financial-aid#.Um5zCzvMmdI.twitter)


I do think it's worth noting that `predictive analytics` in actually not a new concept.  Technology is making it much easier to do, although the underlying methodologies have been applied to higher ed for some time now.  Below are just a few journal articles.

- [Enrollment Models Using Data Mining](http://nandeshwar.info/wp-content/uploads/2008/11/DMWVU_Project.pdf)

- [Data Mining: A Magic Technology for College Recruitment](http://www.ocair.org/files/presentations/paper2008_09/tongshan_chang_2009.pdf)

-  [Differential Pricing in Undergraduate Education](http://www.nber.org/papers/w19183.pdf?new_window=1)

I included the last link above because "pricing" is a pretty hot topic at the moment as well.

One one hand, you have school's blocking [College Abacus](https://collegeabacus.com/), which is basically [Kayak](www.kayak.com) for college pricing.  On the other, institutions are required to report all sorts of data to the government through [IPEDS](http://nces.ed.gov/ipeds/), where it is displayed on a number of sites including the [College Affordability and Transparency Center](http://collegecost.ed.gov).

My point?  There is an academic argument for each side of the debate, whether its predictive analytics or transparency.  Outside of the financial reporting of public companies, what other industry has to openly report their *performance* at this level of detail to the public?

As such, the trends of our industry are forcing us to think differently about how we do things.  Now that it's here, we need to start to get comfortable with what *it* can do.  More importantly though, we need to understand the risks associated with modeling our enrollment data.


## The Process

As mentioned a few times above, I am going to use the open-sourced statistical programming language, `R`, to download and model our data.

Here is our workflow:

1.  Grab a dataset from the web
2.  Fit a predictive model (logistic regression)
3.  Assess the accuracy of the model


#### 1) Lets grab the data

If you are reading this post and are a regular `SPSS` user, this next step is pretty cool.  `R` allows us to grab data from the web.  If you were just using `SPSS`, it would require that you scrape (or download) the data, and then fire up the software to read in the external dataset.

That's way too much effort!

The code below grabs a very small admissions dataset.  If you are an analyst, you should [check out](http://www.ats.ucla.edu/) UCLA's website.  It's a great resource for analytical methods and code examples.

Below, we will define the URL for the dataset, and then use this value to read in the CSV file from the web into a `data.frame` object called `df`.

```{r getdata, comment=NA, cache=TRUE}
URL = "http://www.ats.ucla.edu/stat/data/binary.csv"
df = read.csv(URL)
```

Let's confirm that the data are in our `R` session.

```{r comment=NA}
dim(df)
summary(df)
```

The command `dim(df)` simply asks `R` to print out the dimensions our dataset.  In this case, we have 400 rows and 4 columns.

The `head` command prints the first few rows of the data, so we can see what we have.

```{r comment=NA}
head(df)
```

I should have done this by now, so let's talk about the dataset.

The first column, `admit`, is the that variable we want to predict.  In this case, our variable represents a Yes/No decision.  Yes is coded as a `1`, No is coded as a `0`.

This type of variable is prevalent in Enrollment Management.  To name a few ...

-  Does a suspect respond to our search campaign?
-  Does a recruit apply?
-  Do we retain a student?
-  Will the student graduate in 4 years?
-  Does the student pay a deposit?
-  Does the student melt (between May and September)?
-  Does the recruit open up the next email we send them?

Even if the variable doesn't exist in a natural Yes/No state, we can usually force our data into this format.

The other 3 variables are our `features`, or `predictor` variables.  We will be using `gre`, `gpa`, and `rank` to predict the applicant's status into graduate school.

The variable `gre` is numeric and on an 800 scale, `gpa` is also numeric on a 4.0 scale, and rank appears to be categorical, with values ranging 1-4 based on the admission's counselors read of the student.


#### 2) Fit a Model

Now let's fit our predictive model.

`R` is really flexible.  All I have to do below is tell `R` to fit a model where I am trying to predict `admit` given every other value in the database.

Below, I indicate this concept using the syntax `admit ~ .`

```{r comment=NA}
yield_model = glm(admit ~  .,
                  data=df,
                  family=binomial())
summary(yield_model)

```

When we use the summary command above, we print out the "fit" of the model.  In the section called `Coefficients:`, we get the estimated weights, or effects, of each variable on the admission status.


#### 3) Assess the Model

Now that we have fit a model, let's "score" the our data.  Imagine that you were using last year's applicant pool to predict the admission status of this year's class.

In the code below, we are going to append the probability of being admitted.  We can then use this score to assess how "acccurate" our predicted value truly is.

```{r comment=NA}
df = transform(df,
               score = predict(yield_model,
                               newdata=df,
                               type="response"))
```

When we used the `summary` command earlier, we printed out some basic stats on the variables in our dataset.  Because `admit` is coded as `0/1`, the average of this variable is equivalent to the proportion of `admit = Yes` in the dataset.  In this case, 32% of the applicants were admitted.

This is important because our model will calibrate the scores relative to this proprtion. If our new data are wildly different, the model will not that well.

Let's print out the distribution of predicted scores.

```{r echo=FALSE}
hist(df$score,
     xlab="Our Predicted Probability",
     ylab="# of Students",
     main="Distribution of Predicted Admit Probabilities",
     col ="red",
     xlim = c(0,1),
     breaks=25)

```

Now let's look at the distrbution of the scores based on the *actual* admission status.

If you do not already have the library `ggplot2` installed, simply use the command `install.packages("ggplot2")` before executing the code below.

```{r}
library(ggplot2)
ggplot(df, aes(x=score, fill=factor(admit) )) + geom_density(alpha=.3)
```

It's nice to see that the peak for the predicted score on students is higher than for those that were rejected, but I am not thrilled by this plot.  Early on, it looks like the model was not able to accurately differentiate between admits and rejects.

Below, we are going to use another package, `ROCR` for some other "goodness-of-fit" metrics.  For help on this package, [go here](http://rocr.bioinf.mpi-sb.mpg.de/).  I highly recommend reviewing the `Powerpoint` file that is included on the site.

```{r comment=NA, warning=FALSE, message=FALSE}
library(ROCR)
pred <- prediction(df$score, df$admit)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, colorize=T, main="Lift Chart")
```

If you view this plot from left to right, ideally the line would have spiked "early" in the chart.  In general, you typically think of a 45-degree line, and the more "lift" above this line the better.

Finally, I am going to compute a metric, `AUC`.  The higher the number, the better.  To learn more about `AUC`, [check out this page](http://en.wikipedia.org/wiki/Receiver_operating_characteristic).  For a rule-of-thumb interpretation of the score, [look here](http://gim.unmc.edu/dxtests/roc3.htm).

```{r comment=NA}
auc = performance(pred, 'auc')
auc@y.values[[1]]
```

Real quick.  You may have noticed that I usually refer to access our key values using the `$` operator, but needed to use `@` above.  This is because the object returned from `performance` is of `S4` class in `R`.  The more you play around, you will see this object class appear from time-to-time, but usually can access your data using `$`.

From above, we see that the `AUC` for our model is `r auc@y.values[[1]]`.

In truth, the model doesn't fit that well.  Intuitively, we can confirm this by binning our scores into deciles and looking at the actual admit rate within each band.

```{r comment=NA}
library(plyr)

## add a new variable, band, which puts the score into 10 groups
df = transform(df,
               band = cut(score,
                          breaks = seq(0,1, .1),
                          right=FALSE))

## create a summary table, by group, that looks at some summary stats for each band
ddply(df, .(band), summarise,
      applicants = length(admit),
      admits = sum(admit),
      admit_rate = mean(admit))

```

For example, there were 2 applicants that had a predicted probability of admission status between 70-79%.  Of these 2 applicants, only 1 was admitted.

In a perfect world, the higher the score, we would have seen larger "true" admit rates.


### 4) Summary

Hopefully this was a fairly gentle introduction to how quickly you can fit a predictive model for your EM team.  Conceptually, it doesn't have to be hard, although interpreting the results can be tricky.  Regardless, you can explore what is possible for free with open-sourced statistical software.

Hey, you might even have some fun writing code!
	Abstract: A simple example of predictive analytics for Enrollment Managers using FREE tools.

	TL;DR: Using `R` we can fit all sorts of complex models in Enrollment Management, quickly, and for no cost. In truth, data modeling can help undercover complex relationships at your school that are not easily visible in our usual tables and charts.

	However, predictive analytics is not the golden ticket to enrollment success. You will need to understand not only what the model is telling you, but also the risks associated with being incorrect.

	Lastly, once you have your "model", how do you actually use it? You will need to think about how you incorporate your new model into your current decision making processes.

	## Why Write this Post?

	Over the last few weeks, I feel like the discussion around "Predictive Analytics" within Enrollment Management has really picked up steam. There are a ton of great vendors out there, but the aim of this post is to show you how simple it can be do build predictive models internally, for the price of "on-the-house".

	I don't mean to imply that machine learning is easy by any stretch, but I do intend to highlight how quickly models can be built. If you know your data, and understand various techniques, model building isn't the hard part.

	More than likely though, you will want to take some time to think about your data and the output you see. Not to mention, how you would actually operationalize your model so that it runs quietly behind the scenes.

	## Quick Overview

	My goal is to try to write a post on how we can do `predictive analytics` in Enrollment Management using `R`. In this example, we will fit a model to predict if an applicant is admitted.

	In full disclosure, I am going to avoid the technical details as much as possible, although understanding how these models work is critically important.



	## Previous Work and Discussion

	Let the debate around predictive analytics begin! I am just kidding, but there has been quite a bit of press recently on the usage of predictive analytics within higher ed and Enrollment Management.

	Here are a few (self-edited, sometimes snarky) headlines.

	- [Colleges are using Big Data to predict which students will do well](http://www.fastcoexist.com/3019859/futurist-forum/colleges-are-using-big-data-to-predict-which-students-will-do-well-before-the?partner=rss)

	- [The Future of Predictive Analytics in Higher Ed](http://www.centerdigitaled.com/news/The-Future-of-Predictive-Analytics-Higher-Ed.html)

	- [An article 5 years too late](http://www.insidehighered.com/news/2013/10/24/political-campaign-style-targeting-comes-student-search)

	- [FAFSA and the 3 schools in the country use it](http://www.insidehighered.com/news/2013/10/28/colleges-use-fafsa-information-reject-students-and-potentially-lower-financial-aid#.Um5zCzvMmdI.twitter)


	I do think it's worth noting that `predictive analytics` in actually not a new concept. Technology is making it much easier to do, although the underlying methodologies have been applied to higher ed for some time now. Below are just a few journal articles.

	- [Enrollment Models Using Data Mining](http://nandeshwar.info/wp-content/uploads/2008/11/DMWVU_Project.pdf)

	- [Data Mining: A Magic Technology for College Recruitment](http://www.ocair.org/files/presentations/paper2008_09/tongshan_chang_2009.pdf)

	- [Differential Pricing in Undergraduate Education](http://www.nber.org/papers/w19183.pdf?new_window=1)

	I included the last link above because "pricing" is a pretty hot topic at the moment as well.

	One one hand, you have school's blocking [College Abacus](https://collegeabacus.com/), which is basically [Kayak](www.kayak.com) for college pricing. On the other, institutions are required to report all sorts of data to the government through [IPEDS](http://nces.ed.gov/ipeds/), where it is displayed on a number of sites including the [College Affordability and Transparency Center](http://collegecost.ed.gov).

	My point? There is an academic argument for each side of the debate, whether its predictive analytics or transparency. Outside of the financial reporting of public companies, what other industry has to openly report their performance at this level of detail to the public?

	As such, the trends of our industry are forcing us to think differently about how we do things. Now that it's here, we need to start to get comfortable with what it can do. More importantly though, we need to understand the risks associated with modeling our enrollment data.


	## The Process

	As mentioned a few times above, I am going to use the open-sourced statistical programming language, `R`, to download and model our data.

	Here is our workflow:

	1. Grab a dataset from the web
	2. Fit a predictive model (logistic regression)
	3. Assess the accuracy of the model




	#### 1) Lets grab the data

	If you are reading this post and are a regular `SPSS` user, this next step is pretty cool. `R` allows us to grab data from the web. If you were just using `SPSS`, it would require that you scrape (or download) the data, and then fire up the software to read in the external dataset.

	That's way too much effort!

	The code below grabs a very small admissions dataset. If you are an analyst, you should [check out](http://www.ats.ucla.edu/) UCLA's website. It's a great resource for analytical methods and code examples.

	Below, we will define the URL for the dataset, and then use this value to read in the CSV file from the web into a `data.frame` object called `df`.

	```{r getdata, comment=NA, cache=TRUE}
	URL = "http://www.ats.ucla.edu/stat/data/binary.csv"
	df = read.csv(URL)
	```

	Let's confirm that the data are in our `R` session.

	```{r comment=NA}
	dim(df)
	summary(df)
	```

	The command `dim(df)` simply asks `R` to print out the dimensions our dataset. In this case, we have 400 rows and 4 columns.

	The `head` command prints the first few rows of the data, so we can see what we have.

	```{r comment=NA}
	head(df)
	```

	I should have done this by now, so let's talk about the dataset.

	The first column, `admit`, is the that variable we want to predict. In this case, our variable represents a Yes/No decision. Yes is coded as a `1`, No is coded as a `0`.

	This type of variable is prevalent in Enrollment Management. To name a few ...

	- Does a suspect respond to our search campaign?
	- Does a recruit apply?
	- Do we retain a student?
	- Will the student graduate in 4 years?
	- Does the student pay a deposit?
	- Does the student melt (between May and September)?
	- Does the recruit open up the next email we send them?

	Even if the variable doesn't exist in a natural Yes/No state, we can usually force our data into this format.

	The other 3 variables are our `features`, or `predictor` variables. We will be using `gre`, `gpa`, and `rank` to predict the applicant's status into graduate school.

	The variable `gre` is numeric and on an 800 scale, `gpa` is also numeric on a 4.0 scale, and rank appears to be categorical, with values ranging 1-4 based on the admission's counselors read of the student.


	#### 2) Fit a Model

	Now let's fit our predictive model.

	`R` is really flexible. All I have to do below is tell `R` to fit a model where I am trying to predict `admit` given every other value in the database.

	Below, I indicate this concept using the syntax `admit ~ .`

	```{r comment=NA}
	yield_model = glm(admit ~ .,
	data=df,
	family=binomial())
	summary(yield_model)

	```

	When we use the summary command above, we print out the "fit" of the model. In the section called `Coefficients:`, we get the estimated weights, or effects, of each variable on the admission status.


	#### 3) Assess the Model

	Now that we have fit a model, let's "score" the our data. Imagine that you were using last year's applicant pool to predict the admission status of this year's class.

	In the code below, we are going to append the probability of being admitted. We can then use this score to assess how "acccurate" our predicted value truly is.

	```{r comment=NA}
	df = transform(df,
	score = predict(yield_model,
	newdata=df,
	type="response"))
	```

	When we used the `summary` command earlier, we printed out some basic stats on the variables in our dataset. Because `admit` is coded as `0/1`, the average of this variable is equivalent to the proportion of `admit = Yes` in the dataset. In this case, 32% of the applicants were admitted.

	This is important because our model will calibrate the scores relative to this proprtion. If our new data are wildly different, the model will not that well.

	Let's print out the distribution of predicted scores.

	```{r echo=FALSE}
	hist(df$score,
	xlab="Our Predicted Probability",
	ylab="# of Students",
	main="Distribution of Predicted Admit Probabilities",
	col ="red",
	xlim = c(0,1),
	breaks=25)

	```

	Now let's look at the distrbution of the scores based on the actual admission status.

	If you do not already have the library `ggplot2` installed, simply use the command `install.packages("ggplot2")` before executing the code below.

	```{r}
	library(ggplot2)
	ggplot(df, aes(x=score, fill=factor(admit) )) + geom_density(alpha=.3)
	```

	It's nice to see that the peak for the predicted score on students is higher than for those that were rejected, but I am not thrilled by this plot. Early on, it looks like the model was not able to accurately differentiate between admits and rejects.

	Below, we are going to use another package, `ROCR` for some other "goodness-of-fit" metrics. For help on this package, [go here](http://rocr.bioinf.mpi-sb.mpg.de/). I highly recommend reviewing the `Powerpoint` file that is included on the site.

	```{r comment=NA, warning=FALSE, message=FALSE}
	library(ROCR)
	pred <- prediction(df$score, df$admit)
	perf <- performance(pred, measure = "tpr", x.measure = "fpr")
	plot(perf, colorize=T, main="Lift Chart")
	```

	If you view this plot from left to right, ideally the line would have spiked "early" in the chart. In general, you typically think of a 45-degree line, and the more "lift" above this line the better.

	Finally, I am going to compute a metric, `AUC`. The higher the number, the better. To learn more about `AUC`, [check out this page](http://en.wikipedia.org/wiki/Receiver_operating_characteristic). For a rule-of-thumb interpretation of the score, [look here](http://gim.unmc.edu/dxtests/roc3.htm).

	```{r comment=NA}
	auc = performance(pred, 'auc')
	auc@y.values[[1]]
	```

	Real quick. You may have noticed that I usually refer to access our key values using the `$` operator, but needed to use `@` above. This is because the object returned from `performance` is of `S4` class in `R`. The more you play around, you will see this object class appear from time-to-time, but usually can access your data using `$`.

	From above, we see that the `AUC` for our model is `r auc@y.values[[1]]`.

	In truth, the model doesn't fit that well. Intuitively, we can confirm this by binning our scores into deciles and looking at the actual admit rate within each band.

	```{r comment=NA}
	library(plyr)

	## add a new variable, band, which puts the score into 10 groups
	df = transform(df,
	band = cut(score,
	breaks = seq(0,1, .1),
	right=FALSE))

	## create a summary table, by group, that looks at some summary stats for each band
	ddply(df, .(band), summarise,
	applicants = length(admit),
	admits = sum(admit),
	admit_rate = mean(admit))

	```

	For example, there were 2 applicants that had a predicted probability of admission status between 70-79%. Of these 2 applicants, only 1 was admitted.

	In a perfect world, the higher the score, we would have seen larger "true" admit rates.



	### 4) Summary

	Hopefully this was a fairly gentle introduction to how quickly you can fit a predictive model for your EM team. Conceptually, it doesn't have to be hard, although interpreting the results can be tricky. Regardless, you can explore what is possible for free with open-sourced statistical software.

	Hey, you might even have some fun writing code!