Shen Ren xxsang

## Classification
1. Introduction
	1. response is qualitative variable
	2. estimate the probabilities that X belongs to each category C
	3. logistic regression->binary
	4. multiclass logistic regression/discriminant analysis -> multi-class
2. Logistic regression
	1. p(X) = e^(linear model)/(1+e^(linear model))
	2. Transformation of linear model to range[0,1]
	3. log(p(X)/(1-P(X)) = beta0+beta1X, log odds/logit transformation of p(X)
	4. Maximum Likelihood (Fisher)

## Linear Regression
1. Simple linear regression
	1. Assumes the dependence of response Y on predictors X1,…Xp is linear
	2. Simple is good
	3. residual: e = yi-yi_hat
	4. residual sum of squares = e1^2+…en^2
	5. optimisation problem to minimise total RSS, has closed form solution
	6. A measure of precision -> how close the estimator is close to 0 (no relationship)
		1. Standard error of the slope
		2. var(e)/spread around the mean of X
		3. SE of the intercept

## Overview of Statistical Learning
1. Regression model
	1. Target/response to predict Y
	2. features/input/predictor X = vector(X1, X2, X3)
	3. Y = f(X) + e, e captures measurement errors and other discrepancies
	4. Good for
		1. make prediction
		2. understand which components are important
		3. maybe able to understand how each component affects Y (depending on the complexity of f)
	5. at a particular point, f(4) = E(Y|X=4) E is expected value
	6. regression function f(x) == E(Y|X=x) conditional expectation

## Statistical Learning Introduction
1. Look at the data first before jumping to analysis
2. Supervised learning problem
	1. tasks
		1. Accurately predict unseen test cases
		2. Understand which inputs affect the outcome and how
		3. Assess the quality of our predictions and inferences
	2. know when and how to use them
	3. evaluate the model
3. Unsupervised learning
	1. Data is unlabeled

## Gaussian Process and Bayesian Optimisation
1. Nonparametric methods
	1. Parametric methods
		1. fit to fixed number of parameters
	2. Nonparametric
		1. number of parameters depends on dataset size
		2. k-nearest neighbours
		3. Gaussian/uniform kernels
	3. Comparison
		1. Parametric
			1. limited complexity

## Variational Autoencoder + Variational Dropout
1. Scaling Variational Inference & unbiased estimates

		a. Scale to big datasets

			i. Traditionally, Too slow for big data

			ii. Not very beneficial

		b. Mixture model (Bayesian + deep learning)

## Markov Chain Monte Carlo
1. Monte Carlo Estimation
	1. Approximation Simulation
	2. Easy to program, easy to parallel, can be slow for some problems
	3. quick and dirty
	4. unbiased
	5. like an infinite large emsemble of neural networks
	6. full bayesian modelling
	7. approximate intractable
	8. M-step of EM algorithm
2. Sampling from 1-d distributions

## Latent Dirichlet Allocation
1. Topic Modelling
	1. Decompose books to distributional topics
	2. Assign topics to texts
	3. Compute similarity/distance between vectors of texts
		1. Euclidean distance
		2. Cosine similarity
2. Dirichlet distribution
	1. support: unitary simplex
	2. A distribution over triangle
3. Latent Dirichlet Allocation

## Variation Inference
1. Goal: compute approximate posterior probability
2. Steps:
	1. select a family of distributions Q as variational family, a product of qi(zi)
	2. find best approximation q(z) of p*(z), minimize KL divergence
3. Mean-field approximation
	1. Coordinate descend to minimize KL divergence
	2. Ising model
4. Variational EM
	1. Use variational inference at the E step, instead of minimizing full posterior, minimizing meaningful approximation of posterior as a family of distributions Q
	2. Called variational EM

## Expectation Maximisation Algorithm
1. General Form of EM
	1. Concave functions
	2. Satisfy Jensen’s equality f(Et)>=Ef(t)
	3. kullback-leibler divergence: measure differnce of two probabilistic distributions
		1. KL divergence
		2. how different is each data point at any point of x-axis in the log scale, take expectation
		3. non symmetric
		4. = 0 if compare to self
		5. always non-negative
	4. EM
	1. Introduction
	1. response is qualitative variable
	2. estimate the probabilities that X belongs to each category C
	3. logistic regression->binary
	4. multiclass logistic regression/discriminant analysis -> multi-class
	2. Logistic regression
	1. p(X) = e^(linear model)/(1+e^(linear model))
	2. Transformation of linear model to range[0,1]
	3. log(p(X)/(1-P(X)) = beta0+beta1X, log odds/logit transformation of p(X)
	4. Maximum Likelihood (Fisher)
	1. Simple linear regression
	1. Assumes the dependence of response Y on predictors X1,…Xp is linear
	2. Simple is good
	3. residual: e = yi-yi_hat
	4. residual sum of squares = e1^2+…en^2
	5. optimisation problem to minimise total RSS, has closed form solution
	6. A measure of precision -> how close the estimator is close to 0 (no relationship)
	1. Standard error of the slope
	2. var(e)/spread around the mean of X
	3. SE of the intercept
	1. Regression model
	1. Target/response to predict Y
	2. features/input/predictor X = vector(X1, X2, X3)
	3. Y = f(X) + e, e captures measurement errors and other discrepancies
	4. Good for
	1. make prediction
	2. understand which components are important
	3. maybe able to understand how each component affects Y (depending on the complexity of f)
	5. at a particular point, f(4) = E(Y\|X=4) E is expected value
	6. regression function f(x) == E(Y\|X=x) conditional expectation
	1. Look at the data first before jumping to analysis
	2. Supervised learning problem
	1. tasks
	1. Accurately predict unseen test cases
	2. Understand which inputs affect the outcome and how
	3. Assess the quality of our predictions and inferences
	2. know when and how to use them
	3. evaluate the model
	3. Unsupervised learning
	1. Data is unlabeled
	1. Nonparametric methods
	1. Parametric methods
	1. fit to fixed number of parameters
	2. Nonparametric
	1. number of parameters depends on dataset size
	2. k-nearest neighbours
	3. Gaussian/uniform kernels
	3. Comparison
	1. Parametric
	1. limited complexity
	1. Scaling Variational Inference & unbiased estimates

	a. Scale to big datasets

	i. Traditionally, Too slow for big data

	ii. Not very beneficial

	b. Mixture model (Bayesian + deep learning)
	1. Monte Carlo Estimation
	1. Approximation Simulation
	2. Easy to program, easy to parallel, can be slow for some problems
	3. quick and dirty
	4. unbiased
	5. like an infinite large emsemble of neural networks
	6. full bayesian modelling
	7. approximate intractable
	8. M-step of EM algorithm
	2. Sampling from 1-d distributions
	1. Topic Modelling
	1. Decompose books to distributional topics
	2. Assign topics to texts
	3. Compute similarity/distance between vectors of texts
	1. Euclidean distance
	2. Cosine similarity
	2. Dirichlet distribution
	1. support: unitary simplex
	2. A distribution over triangle
	3. Latent Dirichlet Allocation
	1. Goal: compute approximate posterior probability
	2. Steps:
	1. select a family of distributions Q as variational family, a product of qi(zi)
	2. find best approximation q(z) of p*(z), minimize KL divergence
	3. Mean-field approximation
	1. Coordinate descend to minimize KL divergence
	2. Ising model
	4. Variational EM
	1. Use variational inference at the E step, instead of minimizing full posterior, minimizing meaningful approximation of posterior as a family of distributions Q
	2. Called variational EM
	1. General Form of EM
	1. Concave functions
	2. Satisfy Jensen’s equality f(Et)>=Ef(t)
	3. kullback-leibler divergence: measure differnce of two probabilistic distributions
	1. KL divergence
	2. how different is each data point at any point of x-axis in the log scale, take expectation
	3. non symmetric
	4. = 0 if compare to self
	5. always non-negative
	4. EM