jasdumas/Notes from Statistical Learning online course.md

## Notes from Statistical Learning online course.md

      
    Raw
  

              Notes from Statistical Learning online course.md
            
          
  title
  subtitle
  layout
  
  
  Notes from Statistical Learning Course
  via Stanford University Online
  default
  
  
Course Link
Textbook 1: Introduction to Statistical Learning
Textbook 2: The Elements of Statistical Learning
Supervised Learning

To start:

outcome (y, dependent, response, target)
predictors (x's, independent, regressors, covariates, features)
regression: continuous outcome

Objective: based on the training data we would like to:

accurately predict unseen test cases
understand which inputs affect the outcome, and how
asses the quality of our predictions

Philosophy: understand ideas behind the techniques and how/when to use them; asses the performance of a method.
Unsupervised Learning

To start:

no outcome, just predictors (features) measured on a set of samples
objective is more fuzzy - find groups samples that behave similarly and find linear combinations of features with the most variance (most dissimilar)
difficult to know how we are doing

Statistical Learning vs. Machine Learning


machine learning is around bigger, large scale problems - pure prediction accuracy

arose as a subfield from AI


Statistical learning tries to create models that are interpretable by scientists and precision and uncertainty, model performance

arose as a subfield from Statistics


both focus on supervised and unsupervised learning but are very similar.

Intro to Regression models


Our model notation: Y = f(X) + e  where e captures the measurement errors
minimize average prediction error

reducible
irreducible


nearest neighborhood averaging is a good estimation of function for small predictors - can be lossy when predictors are large

dimensionality and structured models


The curse of dimensionality
linear model is an important example of parametric model:   f(X) = B X ... + e

does not rely on nearest neighbor so is a better approximation


some trade offs:

prediction accuracy vs interpretation
good fit vs. over-fitting: splines
parsimony vs black box: simple with few variables verses complex models


model selection and bias-variance trade-off


assessing model accuracy:

compute average squared prediction error (Mean Square Error): may be bias to more over-fit models
compute using a test dataset to be a better indicator of model performance


as the flexibility of the function increases, its variance increases and its bias decreases. so choosing the flexibility based on average test error amounts to a bias-variance trade-off

Quiz responses:
While we typically expect a model with more predictors to have lower Training Set Error, it is not necessarily the case. An extreme counter example would be a case where you have a model with one predictor that is always equal to the response, compared to a model with many predictors that are random.
Introducing the quadratic term will make your model more complicated. More complicated models typically have lower bias at the cost of higher variance. This has an unclear effect on Reducible Error (could go up or down) and no effect on Irreducible Error.
A flexible model will allow us to take full advantage of our large sample size.
The flexible model will cause over fitting due to our small sample size.
A flexible model will be necessary to find the nonlinear effect.
A flexible model will cause us to fit too much of the noise in the problem.

classification


response is qualitative: build a classifier C(X) that assigns a class label from C to a future unlabeled observation X.

asses the uncertainty in each classification
understand the roles of the different predictors among X = (X1, X2,..., Xp)


conditional class probability
bayes optimal classifier: classify a point to the majority
measure performance using misclassification error rate
about 1/3 of classification problems, nearest neighbor will be the bet tool

simple linear regression


simple is very good! and important for other topics in supervised Learning
questions to ask: is there a relationship, how strong is the relationship, which variables contribute most, how accurately can we predict future values, linear relationship, etc..
formula: outcome is a function of the linear predictors (parameters or coefficients representing the intercept and slope) plus noise. hat symbols symbolized estimated parameters.
residual = actual - estimated  (Residual sum of Squares: RSS)
looking to minimize (or least) the sum of squares distances
confidence interval - range (~95%) contains the true value unknown value of the parameter (slope)

hypothesis testing and confidence intervals


test of a certain value of a parameter (is the slope zero?)

H0 : There is no relationship between X and Y versus the alternative hypothesis (Beta1 = 0)
HA : There is some relationship between X and Y (Beta1 != 0)


to test the null hypothesis we compute a t-statistic given by: t = Beta1 - 0 / SE(hat-Beta1)

this will have a t-distribution with n - 2 degrees (just look this value up)
p-value is based on this statistic and is the probability of getting the value of t at least as large as you got in the absolute value.
in order to have a p-value of below 0.05 you need a t-statistic of about 2
to interpret: the chance/probability of seeing this data under the assumption that the null hypothesis is true (i.e. X has no effect on Y) is the p-value (so if the p-value is small it is very unlikely to have seen that data like this) conversely meaning that X does have an effect on Y.


assessing the overall accuracy of the model

compute the Residual Standard Error RSE = sqrt(1 / n - 2 * RSS) = sqrt(1 / n -2 * sum(yi - y-hati)^2 )
R-squared or fraction of variance explained is 1 - RSS / TSS where TSS = sum(yi - y-bar)^2 is the total sum of squares.
R-squared = r-squared (r is the correlation between X and Y)


multiple linear regression


regression = regress towards the mean. Further reading on the historical context of the term
extends simple model to have than one predictor.
Adding lots of extra predictors to the model can just as easily muddy the interpretation of β^1 as it can clarify it. One often reads in media reports of academic studies that "the investigators controlled for confounding variables," but be skeptical!
Causal inference is a difficult and slippery topic, which cannot be answered with observational data alone without additional assumptions.

regression in real problems - questions to think about


Is at least one of the predictors (X1, X2,...Xp) useful in predicting the response?

F-statisitc = total drop in training error / # parameters fitted // RSS
R-squared
Residual Standard Error


Do all of the predictors help explain Y or is only a subset of the predictors useful?

all subsets / best subsets regression: compute the least squares fit for all possible subsets and choose between them based on some criterion that balances training error with model size.
forward selection: start with a null model (no predictors) and just the intercept which is the mean of y and then you add variable one at a time - try to pick the best one, and fix the variable in the model.
backward selection: start with all variables and remove a variable that has the least significance (look at the t-statistic)
criteria: Mallow's Cp, AIC, BIC, adjusted R-squared, cross-validation
qualitative predictors: also called categorical or factors - dummy variable can be created to represent the categorical feature.


how well does the model fit the data?
what is our prediction given the predictor values and how accurate they are?

extension of the linear model


remove assumptions on interactions and non-linearity
interactions: synergy effect in marketing, interaction effect in statistics

put in a product term (multiply the variables together)


other topics: outliers, non-constant varaince of error terms, high leverage points and collinearity.