andrewheiss/fancy_logit.do

## fancy_logit.do
*------------------------------------------------
* Logistic regression done well
*
* Andrew Heiss (andrew.heiss@duke.edu)
* October 21, 2014
*------------------------------------------------

* Load data
use "http://www.ats.ucla.edu/stat/data/hsbdemo", clear

* Create basic model
logit honors read i.female i.prog
predict phat  // Save predicted values

* Create more complex model
logit honors read i.female i.prog i.ses
predict phat1  // Save predicted values


*------------------
* Check model fit
*------------------
* R2
*---
* Pseudo R2 values are pretty meaningless, so don't try to use them


* Contingency tables
*-------------------
* Pretend that anything with a predicted probability of > 50% should happen
* Use a table to compare whether the predicted outcomes line up with the actual outcomes
gen likely = (phat > 0.5)
tab hon likely


* Receiver Operating Characteristic (ROC) Curves
*-----------------------------------------------
* x-axis = false positive rate, or specificity; # of false positives / sum(false positives + true negatives), or all the incorrectly identified negatives / actual negatives
* y-axis = true positive rate, or sensitivity; # of true positives / sum(true positives + false negatives), or all the correctly identified positives / actual positives
* Diagonal line = 50% coin toss line
* AUC = between 0 and 1; 0.5 = coin toss; higher = better

* Run lroc after a logit command to see an individual plot
* or run roccomp on saved predicted values
logit honors read i.female i.prog
lroc

roccomp hon phat phat1, graph summary

* All that matters is the AUC number.
* The shape of the ROC curve doesn't tell you anything about the model fit.


* Separation plots
*-----------------
* These only as an R package for now, but they're really intuitive
* See http://mdwardlab.com/biblio/separation-plot-new-visual-method-evaluating-fit-binary-models
* and http://cran.r-project.org/web/packages/separationplot/separationplot.pdf

* After generating all the predicted values you want, export your data as a csv
* MAKE SURE that you check "Output numeric values (not labels) of labeled variables" so that you get 0s and 1s instead of text. The current version of separationplot cannot handle text.
* Open R and type the following commands
*
* install.packages("separationplot")  # If it's not already installed
* library(separationplot)  # Load the library
* df <- read.csv("~/Desktop/test.csv")  # Use the full path to the csv file
* separationplot(pred=df$phat, actual=df$honors, type="rect", show.expected=TRUE)
* # Note: expected = sum(phat)


*-----------------------------------
* Check effects of model variables
*-----------------------------------
* Run the second model again
logit honors read i.female i.prog i.ses

* Log odds don't make sense; odd ratios make a little more sense
* Calculate the odds ratio manually by using e^beta, or just add or to the logit command
logit , or  // Keep the previous model!


* Play with different variables
*------------------------------
* Check predicted probabilities for factors/categories
margins prog, atmeans
marginsplot, recast(scatter)

* Check predicted probabilities for numeric variables
margins , at(read=(28(2)76)) vsquish
marginsplot, recast(line) recastci(rarea)

* Play with multiple variables at the same time
margins female, at(read=(28(2)76)) vsquish
marginsplot, recast(line) recastci(rarea)
	*------------------------------------------------
	* Logistic regression done well
	*
	* Andrew Heiss (andrew.heiss@duke.edu)
	* October 21, 2014
	*------------------------------------------------

	* Load data
	use "http://www.ats.ucla.edu/stat/data/hsbdemo", clear

	* Create basic model
	logit honors read i.female i.prog
	predict phat // Save predicted values

	* Create more complex model
	logit honors read i.female i.prog i.ses
	predict phat1 // Save predicted values


	*------------------
	* Check model fit
	*------------------
	* R2
	*---
	* Pseudo R2 values are pretty meaningless, so don't try to use them


	* Contingency tables
	*-------------------
	* Pretend that anything with a predicted probability of > 50% should happen
	* Use a table to compare whether the predicted outcomes line up with the actual outcomes
	gen likely = (phat > 0.5)
	tab hon likely


	* Receiver Operating Characteristic (ROC) Curves
	*-----------------------------------------------
	* x-axis = false positive rate, or specificity; # of false positives / sum(false positives + true negatives), or all the incorrectly identified negatives / actual negatives
	* y-axis = true positive rate, or sensitivity; # of true positives / sum(true positives + false negatives), or all the correctly identified positives / actual positives
	* Diagonal line = 50% coin toss line
	* AUC = between 0 and 1; 0.5 = coin toss; higher = better

	* Run lroc after a logit command to see an individual plot
	* or run roccomp on saved predicted values
	logit honors read i.female i.prog
	lroc

	roccomp hon phat phat1, graph summary

	* All that matters is the AUC number.
	* The shape of the ROC curve doesn't tell you anything about the model fit.


	* Separation plots
	*-----------------
	* These only as an R package for now, but they're really intuitive
	* See http://mdwardlab.com/biblio/separation-plot-new-visual-method-evaluating-fit-binary-models
	* and http://cran.r-project.org/web/packages/separationplot/separationplot.pdf

	* After generating all the predicted values you want, export your data as a csv
	* MAKE SURE that you check "Output numeric values (not labels) of labeled variables" so that you get 0s and 1s instead of text. The current version of separationplot cannot handle text.
	* Open R and type the following commands
	*
	* install.packages("separationplot") # If it's not already installed
	* library(separationplot) # Load the library
	* df <- read.csv("~/Desktop/test.csv") # Use the full path to the csv file
	* separationplot(pred=df$phat, actual=df$honors, type="rect", show.expected=TRUE)
	* # Note: expected = sum(phat)


	*-----------------------------------
	* Check effects of model variables
	*-----------------------------------
	* Run the second model again
	logit honors read i.female i.prog i.ses

	* Log odds don't make sense; odd ratios make a little more sense
	* Calculate the odds ratio manually by using e^beta, or just add or to the logit command
	logit , or // Keep the previous model!


	* Play with different variables
	*------------------------------
	* Check predicted probabilities for factors/categories
	margins prog, atmeans
	marginsplot, recast(scatter)

	* Check predicted probabilities for numeric variables
	margins , at(read=(28(2)76)) vsquish
	marginsplot, recast(line) recastci(rarea)

	* Play with multiple variables at the same time
	margins female, at(read=(28(2)76)) vsquish
	marginsplot, recast(line) recastci(rarea)