Notes from Statistical Learning Course
via Stanford University Online
Textbook 1: Introduction to Statistical Learning
Textbook 2: The Elements of Statistical Learning
- outcome (y, dependent, response, target)
- predictors (x's, independent, regressors, covariates, features)
- regression: continuous outcome
Objective: based on the training data we would like to:
- accurately predict unseen test cases
- understand which inputs affect the outcome, and how
- asses the quality of our predictions
Philosophy: understand ideas behind the techniques and how/when to use them; asses the performance of a method.
- no outcome, just predictors (features) measured on a set of samples
- objective is more fuzzy - find groups samples that behave similarly and find linear combinations of features with the most variance (most dissimilar)
- difficult to know how we are doing
Statistical Learning vs. Machine Learning
- machine learning is around bigger, large scale problems - pure prediction accuracy
- arose as a subfield from AI
- Statistical learning tries to create models that are interpretable by scientists and precision and uncertainty, model performance
- arose as a subfield from Statistics
- both focus on supervised and unsupervised learning but are very similar.
Intro to Regression models
- Our model notation: Y = f(X) + e where e captures the measurement errors
- minimize average prediction error
- nearest neighborhood averaging is a good estimation of function for small predictors - can be lossy when predictors are large
dimensionality and structured models
- The curse of dimensionality
- linear model is an important example of parametric model: f(X) = B X ... + e
- does not rely on nearest neighbor so is a better approximation
- some trade offs:
- prediction accuracy vs interpretation
- good fit vs. over-fitting: splines
- parsimony vs black box: simple with few variables verses complex models
model selection and bias-variance trade-off
- assessing model accuracy:
- compute average squared prediction error (Mean Square Error): may be bias to more over-fit models
- compute using a test dataset to be a better indicator of model performance
- as the flexibility of the function increases, its variance increases and its bias decreases. so choosing the flexibility based on average test error amounts to a bias-variance trade-off
- While we typically expect a model with more predictors to have lower Training Set Error, it is not necessarily the case. An extreme counter example would be a case where you have a model with one predictor that is always equal to the response, compared to a model with many predictors that are random.
- Introducing the quadratic term will make your model more complicated. More complicated models typically have lower bias at the cost of higher variance. This has an unclear effect on Reducible Error (could go up or down) and no effect on Irreducible Error.
- A flexible model will allow us to take full advantage of our large sample size.
- The flexible model will cause over fitting due to our small sample size.
- A flexible model will be necessary to find the nonlinear effect.
- A flexible model will cause us to fit too much of the noise in the problem.
- response is qualitative: build a classifier C(X) that assigns a class label from C to a future unlabeled observation X.
- asses the uncertainty in each classification
- understand the roles of the different predictors among X = (X1, X2,..., Xp)
- conditional class probability
- bayes optimal classifier: classify a point to the majority
- measure performance using misclassification error rate
- about 1/3 of classification problems, nearest neighbor will be the bet tool
simple linear regression
- simple is very good! and important for other topics in supervised Learning
- questions to ask: is there a relationship, how strong is the relationship, which variables contribute most, how accurately can we predict future values, linear relationship, etc..
- formula: outcome is a function of the linear predictors (parameters or coefficients representing the intercept and slope) plus noise. hat symbols symbolized estimated parameters.
- residual = actual - estimated (Residual sum of Squares: RSS)
- looking to minimize (or least) the sum of squares distances
- confidence interval - range (~95%) contains the true value unknown value of the parameter (slope)
hypothesis testing and confidence intervals
- test of a certain value of a parameter (is the slope zero?)
- H0 : There is no relationship between X and Y versus the alternative hypothesis (Beta1 = 0)
- HA : There is some relationship between X and Y (Beta1 != 0)
- to test the null hypothesis we compute a t-statistic given by: t = Beta1 - 0 / SE(hat-Beta1)
- this will have a t-distribution with n - 2 degrees (just look this value up)
- p-value is based on this statistic and is the probability of getting the value of t at least as large as you got in the absolute value.
- in order to have a p-value of below 0.05 you need a t-statistic of about 2
- to interpret: the chance/probability of seeing this data under the assumption that the null hypothesis is true (i.e. X has no effect on Y) is the p-value (so if the p-value is small it is very unlikely to have seen that data like this) conversely meaning that X does have an effect on Y.
- assessing the overall accuracy of the model
- compute the Residual Standard Error RSE = sqrt(1 / n - 2 * RSS) = sqrt(1 / n -2 * sum(yi - y-hati)^2 )
- R-squared or fraction of variance explained is 1 - RSS / TSS where TSS = sum(yi - y-bar)^2 is the total sum of squares.
- R-squared = r-squared (r is the correlation between X and Y)
multiple linear regression
- regression = regress towards the mean. Further reading on the historical context of the term
- extends simple model to have than one predictor.
- Adding lots of extra predictors to the model can just as easily muddy the interpretation of β^1 as it can clarify it. One often reads in media reports of academic studies that "the investigators controlled for confounding variables," but be skeptical!
- Causal inference is a difficult and slippery topic, which cannot be answered with observational data alone without additional assumptions.
regression in real problems - questions to think about
- Is at least one of the predictors (X1, X2,...Xp) useful in predicting the response?
- F-statisitc = total drop in training error / # parameters fitted // RSS
- Residual Standard Error
- Do all of the predictors help explain Y or is only a subset of the predictors useful?
- all subsets / best subsets regression: compute the least squares fit for all possible subsets and choose between them based on some criterion that balances training error with model size.
- forward selection: start with a null model (no predictors) and just the intercept which is the mean of y and then you add variable one at a time - try to pick the best one, and fix the variable in the model.
- backward selection: start with all variables and remove a variable that has the least significance (look at the t-statistic)
- criteria: Mallow's Cp, AIC, BIC, adjusted R-squared, cross-validation
- qualitative predictors: also called categorical or factors - dummy variable can be created to represent the categorical feature.
- how well does the model fit the data?
- what is our prediction given the predictor values and how accurate they are?
extension of the linear model
- remove assumptions on interactions and non-linearity
- interactions: synergy effect in marketing, interaction effect in statistics
- put in a product term (multiply the variables together)
- other topics: outliers, non-constant varaince of error terms, high leverage points and collinearity.