vnisor/Linear Regression

## Linear Regression
1. Simple linear regression
	1. Assumes the dependence of response Y on predictors X1,…Xp is linear
	2. Simple is good
	3. residual: e = yi-yi_hat
	4. residual sum of squares = e1^2+…en^2
	5. optimisation problem to minimise total RSS, has closed form solution
	6. A measure of precision -> how close the estimator is close to 0 (no relationship)
		1. Standard error of the slope
		2. var(e)/spread around the mean of X
		3. SE of the intercept
		4. The SEs can be used to computer confidence intervals
		5. 95% CI is defined that with 95% probability, the range will contain the true unknown value of the parameter
			1. under repeated samplings
			2. 95% of the future it contains the true value
			3. Confidence interval is a frequentist concept: the interval, and not the true parameter, is considered random
2. Hypothesis testing and confidence intervals
	1. H0: beta1=0, H1: beta1!=0
	2. To test the null hypothesis, compute t-statistic, get p-value
	3. p-value is the probability of the data, observed or even extreme under null hypothesis
	4. Residual standard error (RSE)=sqrt((1/n-2)*RSS)
	5. R-squared=TSS-RSS/TSS, TSS = total sum of squares, no model error vs. model error (access the overall accuracy of the model)
	6. In simple LR, R2=r2(correlation)
	7. CI [beta1-1.96SE, beta1+1.96SE]
3. Multiple Linear Regression
	1. regression towards the mean
	2. multiple predictors, form a hyperplane
	3. Interprete regression coefficients
		1. If predictors are uncorrelated, a balanced design
			1. each coefficient can be estimated and tested separated
			2. the changes of Y on Xj with all other predictors held fixed
		2. correlation cause problems
			1. The variance of coefficients tends to increase
			2. interpretation become hazardous
		3. Only talk about correlation not causation
	4. Essentially all models are wrong, but some are useful
	5. The only way to find out what happen when a complex system is disturbed is to disturb the system, not merely observe it passively (Causal statement)
	6. Multiple least square estimates
		1. minimise the RSS
		2. t statistics>2 is p<0.05
		3. one effect may be insignificant in the presence of others but significant by itself
		4. the correlation between predictors may make one of the predictor insignificant
4. Model evaluation
	1. Use F statistics to quantify if at least one predictor useful in predicting the response
		1. F = (TSS-RSS)/p/(RSS/(n-p-1))
	2. Decide on the important variables
		1. all subsets or best subsets regression, can’t examine all possible models
	3. Forward selection
		1. Tractable and gives a good sequence of models
		2. Begin with the null model with only intercept
		3. add variable one by one that result in lowest RSS
	4. Backward selection
		1. Start with all variables
		2. remove least significant variable (using t-statistics)
	5. Model Selection
		1. AIC, BIC, CV
	6. Qualitative predictors
		1. categorical predictors or factor variables
		2. Dummy variable to represent categorical predictors
			1. more dummy variables 0, and 1 for multiple categories
			2. one category with no dummy variable will be baseline
			3. the selection of baseline only changes contrast
5. Extension of the Linear Model
	1. Interaction and nonlinearity
	2. interaction effect
		1. put product of variables to show interactions
		2. x% of unexplained variance be explained by interaction terms
		3. Hierarchy principle
			1. If we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant.
			2. The interactions are hard to interpret in a model without main effects, their meaning is changed
	3. nonlinear effects
	4. Generalization
		1. classification: logistic regression, SVM
		2. Nonlinearity: kernel smoothing, splines and generalized additive modesl, Nearest neighbour
		3. Interactions: Tree-based methods, bagging, random forests, boosting
		4. Regularized fitting: ridge regression and lasso
6. Linear Regression in R
7. Positie correlation only means that the univariate regression has a positive correlation(positive regression coefficient), but not with several variables.
	1. Simple linear regression
	1. Assumes the dependence of response Y on predictors X1,…Xp is linear
	2. Simple is good
	3. residual: e = yi-yi_hat
	4. residual sum of squares = e1^2+…en^2
	5. optimisation problem to minimise total RSS, has closed form solution
	6. A measure of precision -> how close the estimator is close to 0 (no relationship)
	1. Standard error of the slope
	2. var(e)/spread around the mean of X
	3. SE of the intercept
	4. The SEs can be used to computer confidence intervals
	5. 95% CI is defined that with 95% probability, the range will contain the true unknown value of the parameter
	1. under repeated samplings
	2. 95% of the future it contains the true value
	3. Confidence interval is a frequentist concept: the interval, and not the true parameter, is considered random
	2. Hypothesis testing and confidence intervals
	1. H0: beta1=0, H1: beta1!=0
	2. To test the null hypothesis, compute t-statistic, get p-value
	3. p-value is the probability of the data, observed or even extreme under null hypothesis
	4. Residual standard error (RSE)=sqrt((1/n-2)*RSS)
	5. R-squared=TSS-RSS/TSS, TSS = total sum of squares, no model error vs. model error (access the overall accuracy of the model)
	6. In simple LR, R2=r2(correlation)
	7. CI [beta1-1.96SE, beta1+1.96SE]
	3. Multiple Linear Regression
	1. regression towards the mean
	2. multiple predictors, form a hyperplane
	3. Interprete regression coefficients
	1. If predictors are uncorrelated, a balanced design
	1. each coefficient can be estimated and tested separated
	2. the changes of Y on Xj with all other predictors held fixed
	2. correlation cause problems
	1. The variance of coefficients tends to increase
	2. interpretation become hazardous
	3. Only talk about correlation not causation
	4. Essentially all models are wrong, but some are useful
	5. The only way to find out what happen when a complex system is disturbed is to disturb the system, not merely observe it passively (Causal statement)
	6. Multiple least square estimates
	1. minimise the RSS
	2. t statistics>2 is p<0.05
	3. one effect may be insignificant in the presence of others but significant by itself
	4. the correlation between predictors may make one of the predictor insignificant
	4. Model evaluation
	1. Use F statistics to quantify if at least one predictor useful in predicting the response
	1. F = (TSS-RSS)/p/(RSS/(n-p-1))
	2. Decide on the important variables
	1. all subsets or best subsets regression, can’t examine all possible models
	3. Forward selection
	1. Tractable and gives a good sequence of models
	2. Begin with the null model with only intercept
	3. add variable one by one that result in lowest RSS
	4. Backward selection
	1. Start with all variables
	2. remove least significant variable (using t-statistics)
	5. Model Selection
	1. AIC, BIC, CV
	6. Qualitative predictors
	1. categorical predictors or factor variables
	2. Dummy variable to represent categorical predictors
	1. more dummy variables 0, and 1 for multiple categories
	2. one category with no dummy variable will be baseline
	3. the selection of baseline only changes contrast
	5. Extension of the Linear Model
	1. Interaction and nonlinearity
	2. interaction effect
	1. put product of variables to show interactions
	2. x% of unexplained variance be explained by interaction terms
	3. Hierarchy principle
	1. If we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant.
	2. The interactions are hard to interpret in a model without main effects, their meaning is changed
	3. nonlinear effects
	4. Generalization
	1. classification: logistic regression, SVM
	2. Nonlinearity: kernel smoothing, splines and generalized additive modesl, Nearest neighbour
	3. Interactions: Tree-based methods, bagging, random forests, boosting
	4. Regularized fitting: ridge regression and lasso
	6. Linear Regression in R
	7. Positie correlation only means that the univariate regression has a positive correlation(positive regression coefficient), but not with several variables.