jknowles/advanced_stata_syntax.do

## advanced_stata_syntax.do
// Additional tools for machine learning and predictive analytics in stata
/*
Author: Jared Knowles
Date: 09/12/2018
Purpose: Survey of some additional code helpful in conducting and explaining
or demonstrating predictive analytics to stakeholders.

You do not need to run all of this code - this is a survey of commands that
tackle different techniques. Pick and choose what might be most useful to you.
*/

// Discrimminant Analysis in stata
// You need to load this up with more variables for it to be informative
// Discrimminant analysis works like a regression tree to find cutpoints in
// key variables that best differentiate members of different groups, in this
// case, ontime graduates and non-graduates

discrim lda scale_score_7_math sch_g7_lep_per sch_g7_gifted_per pct_days_absent_7 ell_7 iep_7 frpl_7 male, group(ontime_grad)
discrim qda scale_score_7_math sch_g7_lep_per sch_g7_gifted_per pct_days_absent_7 ell_7 iep_7 frpl_7 male, group(ontime_grad)

/*
Variable importance can be calculated in Stata using standardized regression
coefficients in logistic regression. This allows for more direct comparison of
the magnitude and power of "effect sizes".
*/
// ssc install center if you need it

preserve // we are going to modify the data in place
center scale_score_7_math sch_g7_lep_per sch_g7_gifted_per pct_days_absent_7 ell_7 iep_7 frpl_7 male, inplace standardize
// fit your model
logit ontime_grad scale_score_7_math sch_g7_lep_per sch_g7_gifted_per pct_days_absent_7 ell_7 iep_7 frpl_7 male, noconstant
// can restore to the original data now
restore
// coefplot the model coefficients
coefplot, xline(0) xtitle(Standardized Coefficients)

/*
Cluster analysis can be useful to explore how variables interact with one another.
Principal component analysis is a form of factor analysis that you can use to
explore and graph how variables correlate together.
*/
pca scale_score_7_math male sch_g7_lep_per sch_g7_gifted_per
loadingplot
// look at individual observations scores
scoreplot

/*
Setting the probability cutoff for identifying students for additional attention
or "flagging" them in the EWS is a consequential decision. It can be helpful
to create a graph that shows the relationship between the probability threshold
and the percent of students who graduate. This is code you've already used to
explore relationships like this with predictors, adapted now to use the predicted
probabilities.
*/
// Fit an example model, you can substitute your own
logit ontime_grad scale_score_7_math male i.frpl_7 sch_g7_lep_per sch_g7_gifted_per
// predict
predict yhat
// Cut the predicted probability into categorical bins
egen prob_cat = cut(yhat), at(0(0.1)1)
tab prob_cat
// Compute the probability of graduating within each bin
egen prob_ontime_grad = mean(ontime_grad), by(prob_cat)
// plot
scatter prob_ontime_grad prob_cat

## pa_syntax_guide_stata.do
/*
This do file provides guidance and Stata syntax examples for the hands-on predictive analytics
session during the Fall 2018 Cohort 9 Strategic Data
Project Workshop in Denver.

During the workshop, we'll ask you work with stakeholders to develop a dropout
early warning system for 7th grade students for a simulated state education agency
(Montucky) using student data. This guide will provide you with a baseline model to get you
started and introduce you to the main concepts of predictive anlaytics. From
there, you will work together with stakeholders to incorporate their values
into the model, secure their support and adoption of the model, and practice
an inclusive and iterative design process to building analytics.

As a data analyst, your role in this process is to measure the accuracy of the
model and communicate the tradeoffs of different models in terms of accuracy
and usability. Together with stakeholders, you might find that the most accurate
model does not reflect the values of the end the users. Perhaps the most accurate
model requires variables not available for all students, or predicts all students
in certain categories to be at-risk. These are issues you will have to work
together with stakeholders to address.

Logistic regression is one tool you can use, and we'll demonstrate it here.
There are many other techniques of increasing complexity. (Many of the best
predictive analytics packages are written in the R programming language.) But
for a binary outcome variable, most data scientists start with logistic
regressions, and those are very straightforward to do in R.

Here are the steps:

1. explore the data, especially college enrollment predictors and outcomes
2. examine the relationship between predictors and outcomes
3. evaluate the predictive power of different variables and select predictors for your model
4. make predictions using logistic regression
5. convert the predicted probabilities into a 0/1 indicator
6. look at the effect of different probability cutoffs on prediction accuracy (develop a "confusion matrix")

This guide will take you through these steps using a baseline model for predicting
graduation using student grade 7 data. During the workshop you will iterate on
this baseline model using the process above and find an improved model that
meets stakeholder needs.

The commands in this script won't tell you everything you need to do to develop
your model, but they will give you command
syntax that you should be able to adjust and adapt to get the project done.

ou should also consider non-model approaches like the "checklist" approaches
described in the pre-reading created at the Chicago Consortium of School
Research (CCSR). With that "checklist" approach, you experiment with different
thresholds for your predictor variables, and combine them to directly predict
the outcome. This approach has the advantage of being easy to explain and
implement, but it might not yield the most accurate predictions. We won't
demonstrate that approach here, but if you want to try it you can draw on the
syntax examples here to develop it.

Before you get started, you need to think about variables, time, and datasets.
The sooner in a student's academic trajectory you can make a prediction, the
sooner you can intervene--but the less accurate your predictions, and hence your
intervention targeting, is likely to be. What data, and specifically which
variables, do you have available to make predictions? What outcome are you
trying to predict?

In the case of the workshop, we are limiting our focus to using data available
at the end of grade 7 to predict outcomes at the end of high school. A critical
step in developing a predictive model is to identify the time points that
different measures are collected during the process you are predicting -- you
can't use data fromthe future to make predictions. If you're planning to use
your model to make predictions for students at the end of 11th grade, for
instance, and if most students take AP classes as seniors, you can't use data
about AP coursetaking collected during senior year to predict the likelihood of
college enrollment, even if you have that data available for past groups of
students.

In terms of datasets, you can develop your model and then test its accuracy on
the dataset you used to develop the model, but that is bad practice--in the real
world, your model is only as good as its predictions on different, out of sample
datasets. It's good practice to split your data into three parts: one part for
developing your model, one for repeatedly testing different versions of your
model, and a third to use for a final out of sample test.

We're using multiple cohorts of middle-school students for the predictive analytics
task--students who were seventh graders between 2007 and in 2011. These are the
cohorts for which we have access to reliable information on their high school
graduation status (late graduate, on time graduate, dropout, transferout,
disappeared). For this workshop you are being given the entire set of data so
that you can explore different ways of organizing the data across cohorts.
An example is that you may choose to explore your models and data using the
earlier cohort, and evaluate their performance on more recent cohorts.

One last point -- even though the data is synthetic, we have simulated missing data
for you. In the real world, you'll need to make predictions for every
student, even if you're missing data for that student which your model needs in
order to run. Just making predictions using a logistic regression won't be
enough. You'll need to use decision rules based on good data exploration and
your best judgment to predict and fill in outcomes for students where you have
insufficient data.

// Getting Started

If you're using the do file version of these materials, start by saving a new
version of the do file with  your initials in the title, so you can edit it
without worrying about overwriting the original. Then  work through the do file
in Stata by highlighting one or a few command lines at a time, clicking the
"execute" icon in the toolbar above (or pressing control-D), and then looking at
the results in Stata's  results window. Edit or add commands as you wish. If
you're using a paper or PDF version of these  materials, just read on--the Stata
output appears below each section of commands.

This guide is built using the data you will be working with on the workshop.
This dataset includes simulated data for multiple cohorts of 7th graders along
with their corresponding high school outcomes. Each observation (row) is a student,
and includes associated information about the student's demographics, academic
performance in 7th grade, associated school and district (in grade 7), last
high school attended, and high school completion.

To work through the do file, you need to put the montucky.dta data file on your
computer desktop or in a working folder of your choice, and then edit the
username and basepath global commands below to tell Stata where to look for the
data. If you have trouble doing this, ask for help from other members of your
group.
*/

// Set up

	set more off
	set type double
	capture log close

// Open a log file. This stores a record of the commands and their output in a
// text file you can review later.

	log using "montucky_ews.log", replace

// Load the data.

	use data/montucky.dta, clear

// Validate the Data
// Verify that there is exactly one observation per student, and check the total
// number of observations.

	isid sid

// Wait, what is the issue here?

	tab sid

// Why might our IDs be repeated 45 times? Let's look at how many LEAs we have in
// our SEA dataset:

	levels sch_g7_lea_id

// We see that our student IDs are not unique by LEA. That's an easy enough
// fix.

	isid sid sch_g7_lea_id
	count

// Explore the Data

/*
A key initial task in building an EWS is to identify the cohort membership of
students. When we build a predictive model need to identify two timepoints -
when we will be making the prediction, and when the outcome we are predicting
will be observed. In this case, we will be making the prediction upon receiving
the 7th grade data on students (so the completion of 7th grade), and we will
be predicting their completion of high school.

Let's focus first on identifying the 7th grade year for each student. We have
three year variables, what is their relationship:
*/

	tab year
	tab cohort_year
	tab cohort_grad_year

/*
From the data dictionary we know that the first year variable is the year
that corresponds with the student entering 7th grade (`year`). The `cohort_year`
variable defines the 9th grade cohort a student belongs to for measuring ontime
graduation. Finally, the `cohort_grad_year` is the year that the cohort a
student is a member of should graduate to be considered "on-time".

If a student graduates, their year of graduation is recorded as `year_of_graduation`.

The definition of graduation types is an important design decision for our
predictive analytic system. We have to set a time by which students are graduated
so we know whether to count them as graduating or not completing. The state of
Montucky uses a 4-year cohort graduation rate for most reporting, defining
ontime graduation as graduation within 4 years of entering high school. This
variable is defined as:
*/

	gen year_test = 0
	replace year_test = 1 if year_of_graduation == cohort_grad_year
	tab year_test
	tab ontime_grad
	tab ontime_grad year_test
	drop year_test

/*
This is an example of a business rule - it's a restriction on the definition
of the data we make so that we can consistently use the data. In this case, it
is necessary so we can definitively group students for the purposes of predicting
their outcomes. You could consider alternative graduation timelines, for example:
*/

	gen year_test = 0
	replace year_test = 1 if year_of_graduation <= cohort_grad_year + 1
	drop year_test

/*
What does this rule say? How is it different than the definition of on-time
above?

Now that we have a sense of the time structure of the data, let's look at
geography. How many high schools and how many districts are? What are those
regional education services coops?

We are going to be building a model for the an entire state, but stakeholders
may have questions about how the model works for particular schools, districts,
or regions. Let's practice exploring the data by these differerent geographies.
*/

	tab coop_name_g7, mi nolabel
	tab first_hs_name

/*
For this exercise, districts in Montucky are organized into cooperative regions.
Cooperative regions are just groups of LEAs. It may be helpful to compare how
our model performs in a given school, LEA, or coop region to the rest of the
data in the state. As an example of this kind of analysis, select a specific
coop region and explore its data, drawing comparisons to the full dataset. Which
districts are part of this coop region and how many students do they have?
Substitute different abbreviation codes for different coops and then replace the
`my_coop` variable below.
*/

	tab sch_g7_lea_id if coop_name_g7 == `my_coop'

/*
What student subgroups are we interested in? Let's start by looking at student
subgroups. Here's whether a student is male.
*/

	tab male, mi

// Here's a shortcut command to look at one-way tabs of a lot of variables at once.
	tab1 male race_ethnicity frpl_7 iep_7 gifted_7 ell_7, mi

/*
Let's examine the distribution of student subgroups by geography. For this
command, we'll use Stata's looping syntax, which lets you avoid repetition by
applying commands to multiple variables at once. You can't use loops when you
are entering commands directly into the command window, but they are very
powerful in do files. You can type "help foreach" into the Stata command window
if you want to learn more about how to use loops in Stata.
*/

	foreach var of varlist male race_ethnicity frpl_7 iep_7 gifted_7 ell_7 {
		tab coop_name_g7 `var', row mi
	}
/*
Now, let's look at outcomes. We won't examine them all, but you should. Here's
a high school graduation outcome variable:
*/

	tab ontime_grad, mi

/*
Wait! What if the data includes students who transferred out of state? That
might bias the graduation rate and make it too low, because those seventh
graders might show up as having dropped out.
*/

	tab transferout, mi
	tab transferou ontime_grad, mi

/*
This is another case where we may want to consider a business rule. How should
students who transfer out be treated? We don't know whether they graduated
or not. Should they be excluded from the analysis? Coded as not completing?
The decision is yours, but it is important to consider all the possible high
school outcomes when building a model and how the model will treat them.

Let's look at the distribution of another outcome variable, `any_grad`, which
includes late graduation and ontime graduation by both geography and by
subgroup.
*/

	tab coop_name_g7 any_grad, row mi

	foreach var of varlist male race_ethnicity frpl_7 iep_7 gifted_7 ell_7 {
			tab any_grad `var', row mi
	}


// Let's look at the distribution of this outcome variable by geography and then by subgroup.
	tab coop_name_g7 ontime_grad, mi row

	foreach var of varlist male race_ethnicity frpl_7 iep_7 ell_7 gifted_7 {
		tab `var' ontime_grad, row mi
	}

// Review existing indicator

//	Now that we are oriented to the data, let's turn our attention to the model
// predictions provided by the vendor. First, let's check the format:

	codebook vendor_ews_score

/*
Instead of classifying students, each student receives a predicted probability
for graduating. We need a way to convert this into a classification measure.
One way to do this is to pick a threshold and declare all values greater than
that threshold as being of the positive (graduation) class, and all values
below as being members of the negative class.
*/

	gen vendor_grad_class = 0
	replace vendor_grad_class = 1 if vendor_ews_score > 0.5
	tab ontime_grad vendor_grad_class, cell mi

/*
This matrix tells us, for the threshold we have selected, how many graduates
we identified successfully, how many non-graduates we identified successfully,
and how many mistakes we made in each direction (this is known as a confusion
matrix and will be discussed at length at the workshop!).

How does this compare with the vendor's marketing material and stated accuracy?

Now, let's turn to identifying the predictors available for building an alternative
model. Let's examine the performance and behavioral variables that you can
use as predictors. These are mostly numerical variables, so you should use the
summary, histogram, and table commands to explore them. Here's some syntax for
examining 7th grade math scores. You can replicate and edit it to examine other
potential predictors and their distributions by different subgroups.
*/

	summ scale_score_7_math, detail
	hist scale_score_7_math, width(1)
	table coop_name_g7, c(mean scale_score_7_math)
	table frpl_7, c(mean scale_score_7_math)

// Finally, here's some sample code you can use to look at missingness patterns
// in the data. The "gen" command is used to generate a new variable.

	gen math7_miss = missing(scale_score_7_math)
	tab math7_miss
	foreach var of varlist coop_name_g7 male race_ethnicity frpl_7 iep_7 gifted_7 ell_7 {
		tab `var' math7_miss, mi row
	}

/*
Handling missing values is another case where business rules will come into play.

Did you see any outlier or impossible values while you were exploring the data?
If so, you might want to truncate them or change them to missing. Here's how you
can replace a numeric variable with a missing value if it is larger than a
certain number (in this case, 100 percent).
*/

	hist pct_days_absent_7
	replace pct_days_absent_7 = . if pct_days_absent_7 > 100
	hist pct_days_absent_7

/*
Trimming the data in this way is another example of a business rule. You
may wish to trim the absences even further in this data. You may also wish to
assign a different value other than missing for unusual values - such as the
mean or median value.

Now that you've explored the data, you can start to examine the relationship
between predictor and outcome variables. Here we'll continue to look at the high
school graduation outcome, and we'll restrict the predictors to just two: 7th
grade math scores and percent of enrolled days absent through 7th grade. For
your model, you can of course use more and different predictor
variables. First, check the correlation between outcome and predictors.
*/

corr ontime_grad scale_score_7_math pct_days_absent_7

/*
These correlations do not look very promising! But remember, a correlation
may not tell the whole story if other factors are involved, and correlations
between binary variables and continuous variables are noisy indicators.

It would be nice to have a better idea of
the overall relationship between outcomes and predictors.

But you can't make a meaningful scatterplot when the independent, or y value, is
a binary outcome variable (try it!). Let's look at a technique to identify
the relationship between a continuous variable and a binary outcome.

The idea behind this code is to show the mean of the outcome variable for each
value of the predictor, or for categories of the predictor variable if it has
too many values. First, define categories (in this case, round to the nearest
percentage) of the percent absent variable, and then truncate the variable so that
low-frequency values are grouped together.
*/

	egen pct_absent_cat = cut(pct_days_absent_7), at(0(1)100)
	tab pct_absent_cat
	replace pct_absent_cat = 30 if pct_absent_cat >= 30

/*
Next, define a variable which is the average ontime graduation rate for each
absence category, and then make a scatter plot of average graduation rates by
absence percent.
*/

	egen abs_ontime_grad = mean(ontime_grad), by(pct_absent_cat)
	scatter abs_ontime_grad pct_absent_cat

/*
You can do the same thing for 7th grade test scores, without having to group
them with the egen cut command.
*/

	egen math_7_cut = cut(scale_score_7_math), group(100)
	egen math_7_ontime_grad = mean(ontime_grad), by(math_7_cut)
	scatter math_7_ontime_grad scale_score_7_math

/*
You can see there are some 7th grade math score outliers--if you haven't
already, you might want to set them to zero.
*/

	replace scale_score_7_math = . if scale_score_7_math < 0
	drop math_7_cut
	drop math_7_ontime_grad
	egen math_7_cut = cut(scale_score_7_math), group(100)
	egen math_7_ontime_grad = mean(ontime_grad), by(math_7_cut)
	scatter math_7_ontime_grad scale_score_7_math

/*
Looking at the plot, if you think the relationship between eigth grade math
scores and ontime graduation is more of a curve than a line, you can define
variables for the square and cube of the math scores so that Stata will be able
to fit a polynomial equation to the data instead of a straight line when you
build your model.
*/

	gen math_7_squared = scale_score_7_math^2
	gen math_7_cubed = scale_score_7_math^3

/*
Now we're ready to call on the logit command to examine the relationship between
our binary outcome variable and our predictor variables. When you run a logistic
regression with the logit command, Stata calculates the parameters of an
equation that fits the relationship between the predictor variables and the
outcome. A regression model typically won't be able to explain all of the
variation in an outcome variable--any variation that is left over is treated as
unexplained noise in the data, or error, even if there are additional variables
not in the model which could explain more of the variation. Once you've run a
logit regression, you can have Stata generate a variable with new, predicted
outcomes for each observation in your data with the predict command. The
predictions are calculated using the model equation and ignore the unexplained
noise in the data. For logit regressions, the predicted outcomes take the form
of a probability number between 0 and 1. To start with, let's do a regression of
ontime graduation on seventh grade math scores.
*/

	logit ontime_grad scale_score_7_math

/*
Even before you use the predict command, you can use the logit output to learn
something about the relationship between the predictor and the outcome variable.
The Pseudo R2 (read R-squared) is a proxy for the share of variation in the
outcome variable that is explained by the predictor. Statisticians don't like it
when you take the pseudo R2 too seriously, but it can be useful in predictive
exercises to quickly get a sense of the explanatory power of variables in a
logit model. Does adding polynomial terms increase the pseudo R2? Not by very
much. Any time you add predictors to a model, the R2 will increase, even if the
variables are fairly meaningless, so it's best to focus on including predictors
that add meaningful explanatory power.
*/

	logit ontime_grad scale_score_7_math math_7_squared math_7_cubed

// Now take a look at the R2 for the absence variable. Absence rates have
// almost no explanatory power in 7th grade.

	logit ontime_grad pct_days_absent_7

// Let's combine our two predictors. This model has barely any more
// explanatory power than the test-score alone.

	logit ontime_grad pct_days_absent_7 scale_score_7_math

// Now, let's use the predict command. Stata applies the predict command to
// the most recent regression model.

	predict model1

/*
This generates a new variable `model1`, which is the predicted probability of
ontime high school
graduation, according to the model. But if you look at the number of
observations with predictions, you'll see that it is smaller than the total
number of students. This is because Stata doesn't use observations that have
missing data for any of the variables in the model.
*/

	summ model1, detail
	count

/*
Let's convert this probability to a 0/1 indicator for whether or not a student
is likely to graduate ontime. A good rule of thumb when starting out is to set
the probability cutoff at the mean of the outcome variable. In this example,
we can store this value as a variable:
*/

	summ ontime_grad // to get the threshold
	gen grad_indicator = 0 if model1 < .655 & model1 ~= .
	replace grad_indicator = 1 if model1 >= .655 & model1 ~= .
	tab grad_indicator, mi

// You can also plot the relationship between the probability and the outcome.
// Ideally, you should see the proportion of graduates steadily increase for each
// percentile of the probabilities. What does this relationship tell you?

	egen model_prob_rank = cut(model1), group(50)
	egen model_prob_ontime_grad = mean(ontime_grad), by(model_prob_rank)
	scatter model_prob_ontime_grad model1

/*
Lets evaluate the accuracy of the model by comparing the predictions to the
actual graduation outcomes for the students for whom we have predictions. This
type of crosstab is called a "confusion matrix." The observations in the upper
right corner, where the indicator and the actual outcome are both 0, are true
negatives. The observations in the lower right corner, where the indicator and
the outcome are both 1, are true positives. The upper right corner contains
false positives, and the lower left corner contains false negatives. Overall, if
you add up the cell percentages for true positives and true negatives, the model
got 58 percent of the predictions right.
*/

	tab ontime_grad grad_indicator, cell

/*
However, almost all of the wrong predictions are false negatives--these are
students who have been flagged as dropout risks even though they did graduate
ontime. If you want your indicator system to be have fewer false negatives, you
can change the probability cutoff. This cutoff has a lower share of false
positives and a higher share of false negatives, with a somewhat lower share of
correct predictions.
*/

	replace grad_indicator = 0 if model1 < .59445 & model1 ~= .
	replace grad_indicator = 1 if model1 >= .59445 & model1 ~= .
	tab ontime_grad grad_indicator, cell

// Missing Data

/*
Another key business rule is how we will handle students with missing data. A
predictive analytics system is more useful if it makes an actionable prediction
for every student. It is good to check, if it is available, the graduation rates
for students with missing data:
*/

	tab ontime_grad if math7_miss == 1

// There are a number of options. One is to run a model with fewer variables for
// only those students, and then use that model to fill in the missing
// indicators.

	logit ontime_grad pct_days_absent_7 if math7_miss == 1
	predict model2 if math7_miss == 1
	summ model2, detail
	replace grad_indicator = 0 if model2 < .59445 & model2 ~= . & model1 == .
	replace grad_indicator = 1 if model2 >= .59445 & model2 ~= . & model1 == .

/*
We now have predictions for all but a very small share of students, and those
students are split between graduates and non-graduates. We have to apply a rule
or a model to make predictions for them--we can't use information from the
future, except to develop the prediction system. We'll arbitrarily decide to
flag them as potential non-graduates, since students with lots of missing data
might merit some extra attention.
*/

	tab grad_indicator, mi
	replace grad_indicator = 0 if grad_indicator == .

// Evaluate Fit

// Now we have a complete set of predictions from our simple models. How well
// does the prediction system work? Can we do better?

	tab ontime_grad grad_indicator, cell

/*
A confusion matrix is one way to evaluate the success of a model and evaluate
tradeoffs as you are developing prediction systems, but there are others. We
will cover these more in the workshop, but in cases where we have an uneven
proportion of cases in each class (e.g. we have many more graduates than
non-graduates), it can be helpful to look at a metric like the AUC, which stands
for "area under the curve." You'll learn more about ways to evaluate a
prediction system, including the AUC metric, during Day 2 of the workshop, but
here's a sneak peak. First, look at row percentages instead of cell percentages
in the confusion matrix.
*/

	tab ontime_grad grad_indicator, row

/*
Next, use the "roctab" command to plot the true positive rate (sensitivity in
the graph) against the false positive rate (1-specificity in the graph). You can
see these percentages match the row percentages in the last table. The AUC is
the "area under ROC curve" in this graph, and it is a useful single-number
summary of predictive accuracy.
*/

	roctab ontime_grad grad_indicator, graph

/*
Model comparison in Stata for different logistic regressions is straightforward
as well. Fit two models and store their predictions. Let's compare a model that
includes two student demographic categories (FRPL and sex) with a model that
only includes math scores and attendance.
*/

	logit ontime_grad scale_score_7_math male i.frpl_7
	predict yhat_1
	// plot the roc curve
	lroc
	logit ontime_grad scale_score_7_math pct_days_absent_7
	predict yhat_2
	lroc

// this takes awhile :-) for speed you can subset down your data set to a test
// set or a specific year/lea/group

	roccomp ontime_grad yhat_1 yhat_2, graph summary

// more details here:
// https://stats.idre.ucla.edu/stata/faq/how-can-i-test-the-difference-in-area-under-roc-curve-for-two-logistic-regression-models/

/*
A couple of last thoughts and notes. First, note that so far we haven't done any
out-of-sample testing. We know from the pre-reading that we should never trust
our model fit measures on data the model was fit to -- statistical models are
overly confident. To combat this, you should subdivide your dataset. There are
many strategies you can choose from depending on how much data you have and the
nature of your problem - for the EWS case, we can use the first two cohorts to
build our models and the latter two cohorts to evaluate that fit.

Here is some code you can use to do that:
*/

	local split = 15000
	local train = "1/`=`split'-1'"
	local test = "`split'/`=_N'"
	logit ontime_grad pct_days_absent_7 scale_score_7_read in `train'
	predict grad3
	summ grad3
	gen grad_indicator3 = 0
	replace grad_indicator3 = 1 if grad3 >= 0.5994
	tab ontime_grad grad_indicator3 in `test', cell mi

// You can use the classtabi routine (ssc install classtabi) to get more details
// You type in the cell counts from left to right (top left, top right, bottom
// left ..) Due to dataset variation, your exact counts may vary, modify as
// needed

	classtabi 4314 32699 5873 61351

/*
You can also split your data by time cohorts, which is a good idea - fit your
model on older data and see how it performs on more recent data:
*/

	logit ontime_grad pct_days_absent_7 scale_score_7_read if year < 2006
	predict grad4
	summ grad4
	gen grad_indicator4 = 0
	replace grad_indicator4 = 1 if grad4 >= 0.5994
	tab ontime_grad grad_indicator4 if year > 2005, cell mi

/*
Second, should we use subgroup membership variables (such as demographics or
school of enrollment?) to make predictions, if they improve the accuracy of
predictions? This is more a policy question than a technical question, and you
should consider it when you are developing your models. You'll also want to
check to see how accurate your model is for different subgroups.
*/

// Bonus
// Warning: Here be dragons

// Some less tested code to try some more exotic modeling strategies using Stata
// Set up Stata to be memory efficient

	set matsize 11000
	set emptycells drop

// Install svmachines
// Uncomment the next two lines to install support vector machines in Stata
// net sj 16-4
// net install st0461

// Data preparation
// svmachines is more finnicky about data than logit, it needs data types clearly
// declared in advance and cannot handle any missing values at all

	drop if pct_days_absent_7 == .
	encode male, gen(male_fac)
	encode race_ethnicity, gen(race_fac)
	encode iep_7, gen(iep_fac)
	encode gifted_7, gen(gifted_fac)
	drop if male_fac == .
	drop if race_fac == .

	set seed 9876
// shuffle the data

  generate u = runiform()
  sort u

//  before the actual train and test split:
  local split = 15000 // restrict the sample to something Stata can compute quickly
  local train = "1/`=`split'-1'"
  local test = "`split'/`=_N'"

// recode your outcome variable to be the right type for svmachines

	tempvar B
  generate byte `B' = ontime_grad
  drop ontime_grad
  rename `B' ontime_grad

// Get the svmachine model fit using multiple predictors
	svmachines ontime_grad scale_score_7_math scale_score_7_read i.frpl_7 male_fac race_fac iep_fac gifted_fac sch_g7_frpl_per in `train', verbose probability

// generate predictions on the test, not training, data
	predict P in `test'
// Calculate the prediction error
	generate err = ontime_grad != P in `test'
// summarize the prediction error
	summarize err in `test'
	tab P ontime_grad in `test'

// Generate predicted probabilities
	predict P_prob in `test', prob

// calculate an ROC graph for the predicted probabilities
	roctab ontime_grad P_prob in `test', graph

// Calculate all accuracy metrics from the confusion matrix using
// classtabi
// Note due to random sampling your numbers may need to be modified

	 tab P ontime_grad in `test'
	 classtabi 11790 7654 25223 59570
	// Additional tools for machine learning and predictive analytics in stata
	/*
	Author: Jared Knowles
	Date: 09/12/2018
	Purpose: Survey of some additional code helpful in conducting and explaining
	or demonstrating predictive analytics to stakeholders.

	You do not need to run all of this code - this is a survey of commands that
	tackle different techniques. Pick and choose what might be most useful to you.
	*/

	// Discrimminant Analysis in stata
	// You need to load this up with more variables for it to be informative
	// Discrimminant analysis works like a regression tree to find cutpoints in
	// key variables that best differentiate members of different groups, in this
	// case, ontime graduates and non-graduates

	discrim lda scale_score_7_math sch_g7_lep_per sch_g7_gifted_per pct_days_absent_7 ell_7 iep_7 frpl_7 male, group(ontime_grad)
	discrim qda scale_score_7_math sch_g7_lep_per sch_g7_gifted_per pct_days_absent_7 ell_7 iep_7 frpl_7 male, group(ontime_grad)

	/*
	Variable importance can be calculated in Stata using standardized regression
	coefficients in logistic regression. This allows for more direct comparison of
	the magnitude and power of "effect sizes".
	*/
	// ssc install center if you need it

	preserve // we are going to modify the data in place
	center scale_score_7_math sch_g7_lep_per sch_g7_gifted_per pct_days_absent_7 ell_7 iep_7 frpl_7 male, inplace standardize
	// fit your model
	logit ontime_grad scale_score_7_math sch_g7_lep_per sch_g7_gifted_per pct_days_absent_7 ell_7 iep_7 frpl_7 male, noconstant
	// can restore to the original data now
	restore
	// coefplot the model coefficients
	coefplot, xline(0) xtitle(Standardized Coefficients)

	/*
	Cluster analysis can be useful to explore how variables interact with one another.
	Principal component analysis is a form of factor analysis that you can use to
	explore and graph how variables correlate together.
	*/
	pca scale_score_7_math male sch_g7_lep_per sch_g7_gifted_per
	loadingplot
	// look at individual observations scores
	scoreplot

	/*
	Setting the probability cutoff for identifying students for additional attention
	or "flagging" them in the EWS is a consequential decision. It can be helpful
	to create a graph that shows the relationship between the probability threshold
	and the percent of students who graduate. This is code you've already used to
	explore relationships like this with predictors, adapted now to use the predicted
	probabilities.
	*/
	// Fit an example model, you can substitute your own
	logit ontime_grad scale_score_7_math male i.frpl_7 sch_g7_lep_per sch_g7_gifted_per
	// predict
	predict yhat
	// Cut the predicted probability into categorical bins
	egen prob_cat = cut(yhat), at(0(0.1)1)
	tab prob_cat
	// Compute the probability of graduating within each bin
	egen prob_ontime_grad = mean(ontime_grad), by(prob_cat)
	// plot
	scatter prob_ontime_grad prob_cat