chenchun/machine_learning.md

## machine_learning.md

      
    Raw
  

              machine_learning.md
            
          
    title: 机器学习学习笔记
第一周


Superivsed Learning

regression（规划）类问题，根据已有的房子销售情况预测房子估价
classification（分类）问题，将乳腺癌根据肿瘤大小、患者年龄等标签分为良性和恶心肿瘤


Unsupervised Learning 给出一些数据，自动发现数据的数据结构、分类

谷歌新闻分类
基因分类，分为人群，国家
从一个party的录音带中提取出每个人的声音


Gradient Descent For Linear Regression

$$
\begin{align*} \text{repeat until convergence: } \lbrace & \newline \theta_0 := & \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}(h_\theta(x_{i}) - y_{i}) \newline \theta_1 := & \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x_{i}) - y_{i}) x_{i}\right) \newline \rbrace& \end{align*}
$$
对$\theta_j$求偏导的过程

第三周

Advanced Optimization

"Conjugate gradient", "BFGS", and "L-BFGS" are more sophisticated, faster ways to optimize θ that can be used instead of gradient descent. We suggest that you should not write these more sophisticated algorithms yourself (unless you are an expert in numerical computing) but use the libraries instead, as they're already tested and highly optimized.
We first need to provide a function that evaluates the following two functions for a given input value θ:
$$
J(θ)
$$
$$
\frac{∂}{∂θ_j}J(θ)
$$
We can write a single function that returns both of these:
function [jVal, gradient] = costFunction(theta)
  jVal = [...code to compute J(theta)...];
  gradient = [...code to compute derivative of J(theta)...];
end

Then we can use octave's "fminunc()" optimization algorithm along with the "optimset()" function that creates an object containing the options we want to send to "fminunc()".
options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2,1);
   [optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);

We give to the function "fminunc()" our cost function, our initial vector of theta values, and the "options" object that we created beforehand.
Overfitting

overfitting, or high variance, is caused by a hypothesis function that fits the available data but does not generalize well to predict new data. It is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data.
There are two main options to address the issue of overfitting:

Reduce the number of features:


Manually select which features to keep.
Use a model selection algorithm (studied later in the course).


Regularization


Keep all the features, but reduce the magnitude of parameters $θ_j$

Regularization works well when we have a lot of slightly useful features.

Cost Function

If we have overfitting from our hypothesis function, we can reduce the weight that some of the terms in our function carry by increasing their cost.
We could regularize all of our theta parameters in a single summation as:
$$
min_θ\frac{1}{2m}∑^m_{i=1}(h_θ(x^{(i)})-y^{(i)})^2+λ∑^n_{j=1}θ^2_j
$$
The λ, or lambda, is the regularization parameter. It determines how much the costs of our theta parameters are inflated.
Using the above cost function with the extra summation, we can smooth the output of our hypothesis function to reduce overfitting. If lambda is chosen to be too large, it may smooth out the function too much and cause underfitting.
Hence, what would happen if λ=0 or is too small? it's overfitting.
Regularized Linear Regression

Gradient Descent

Regularized Logistic Regression


第五周

Backpropagation in practice

Training a Neural Network

Randomly initialize the weights
Implement forward propagation to get $h_\Theta(x^{(i)})$ for any $x^{(i)}$

Implement the cost function
Implement backpropagation to compute partial derivatives
Use gradient checking to confirm that your backpropagation works. Then disable gradient checking.
Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta.
When we perform forward and back propagation, we loop on every training example:

for i = 1:m,
   Perform forward propagation and backpropagation using example (x(i),y(i))
   (Get activations a(l) and delta terms d(l) for l = 2,...,L

第六周

Bias vs variance


Getting more training examples: Fixes high variance
Trying smaller sets of features: Fixes high variance
Adding features: Fixes high bias
Adding polynomial features: Fixes high bias
Decreasing λ: Fixes high bias
Increasing λ: Fixes high variance.

Diagnosing Neural Networks

A neural network with fewer parameters is prone to underfitting. It is also computationally cheaper.
A large neural network with more parameters is prone to overfitting. It is also computationally expensive. In this case you can use regularization (increase λ) to address the overfitting.
Using a single hidden layer is a good starting default. You can train your neural network on a number of hidden layers using your cross validation set. You can then select the one that performs best.
Model Complexity Effects:
Lower-order polynomials (low model complexity) have high bias and low variance. In this case, the model fits poorly consistently.
Higher-order polynomials (high model complexity) fit the training data extremely well and the test data extremely poorly. These have low bias on the training data, but very high variance.
In reality, we would want to choose a model somewhere in between, that can generalize well but also fits the data reasonably well.
Handle Skewed Data

Precision
真值在预测真值中的比例
(Of all patients where we predicted y=1 , what fraction actually has cancer?)
Recall
预测真值在所有真值中的比例
(Of all patients that actually have cancer, what fraction did we correctly detect as having cancer?)
第七周

Support Vector Machines


Kernel I


Using an SVM


第八周

K-means algorithm


Principal Component Analysis (PCA) algorithm


PCA function
function [U, S] = pca(X)
%PCA Run principal component analysis on the dataset X
%   [U, S, X] = pca(X) computes eigenvectors of the covariance matrix of X
%   Returns the eigenvectors U, the eigenvalues (on diagonal) in S
%

% Useful values
[m, n] = size(X);

% You need to return the following variables correctly.
U = zeros(n);
S = zeros(n);

% ====================== YOUR CODE HERE ======================
% Instructions: You should first compute the covariance matrix. Then, you
%               should use the "svd" function to compute the eigenvectors
%               and eigenvalues of the covariance matrix. 
%
% Note: When computing the covariance matrix, remember to divide by m (the
%       number of examples).
%
Sigma = 1/m*(X'*X);
[U,S,V] = svd(Sigma);

% =========================================================================

end

projectData
function Z = projectData(X, U, K)
%PROJECTDATA Computes the reduced data representation when projecting only 
%on to the top k eigenvectors
%   Z = projectData(X, U, K) computes the projection of 
%   the normalized inputs X into the reduced dimensional space spanned by
%   the first K columns of U. It returns the projected examples in Z.
%

% You need to return the following variables correctly.
Z = zeros(size(X, 1), K);

% ====================== YOUR CODE HERE ======================
% Instructions: Compute the projection of the data using only the top K 
%               eigenvectors in U (first K columns). 
%               For the i-th example X(i,:), the projection on to the k-th 
%               eigenvector is given as follows:
%                    x = X(i, :)';
%                    projection_k = x' * U(:, k);
%
for i=1:size(X, 1),
    for j=1:K,
        x = X(i, :)';
        projection_k = x' * U(:, j);
        Z(i, j) = projection_k;
    end
end

% =============================================================
end

recoverData
function X_rec = recoverData(Z, U, K)
%RECOVERDATA Recovers an approximation of the original data when using the 
%projected data
%   X_rec = RECOVERDATA(Z, U, K) recovers an approximation the 
%   original data that has been reduced to K dimensions. It returns the
%   approximate reconstruction in X_rec.
%

% You need to return the following variables correctly.
X_rec = zeros(size(Z, 1), size(U, 1));

% ====================== YOUR CODE HERE ======================
% Instructions: Compute the approximation of the data by projecting back
%               onto the original space using the top K eigenvectors in U.
%
%               For the i-th example Z(i,:), the (approximate)
%               recovered data for dimension j is given as follows:
%                    v = Z(i, :)';
%                    recovered_j = v' * U(j, 1:K)';
%
%               Notice that U(j, 1:K) is a row vector.
%               


% =============================================================
for i=1:size(Z, 1),
    for j=1:size(U,1),
        v = Z(i, :)';
        recovered_j = v' * U(j, 1:K)';
        X_rec(i, j) = recovered_j;
    end
end

end


第九周

Building an Anomaly Detection System


Anomaly Detection vs. Supervised Learning


Multivariate Gaussian Distribution


Original model vs. Multivariate Gaussian

Recommender Systems

Collaborative filtering

Problem motivation

Optimization algorithm


第十周

Gradient Descent with large datasets
α一般是不变的，对于Stochastic gradient descent Learning来说，如果我们想让θ收敛可以适当降低α


## 神经网络.md

      
    Raw
  

              神经网络.md
            
          
    卷积神经网络CNN

https://zhuanlan.zhihu.com/p/27908027

观察这两张X图，可以发现尽管像素值无法一一对应，但也存在着某些共同点。

如上图所示，两张图中三个同色区域的结构完全一致！
因此，我们就考虑，要将这两张图联系起来，无法进行全体像素对应，但是否能进行局部地匹配？
答案当然是肯定的。
相当于如果我要在一张照片中进行人脸定位，但是CNN不知道什么是人脸，我就告诉它：人脸上有三个特征，眼睛鼻子嘴巴是什么样，再告诉它这三个长啥样，只要CNN去搜索整张图，找到了这三个特征在的地方就定位到了人脸。
同理，从标准的X图中我们提取出三个特征（feature）

我们发现只要用这三个feature便可定位到X的某个局部。

feature在CNN中也被成为卷积核（filter），一般是3X3，或者5X5的大小。
卷积运算

卷积神经网络和信号处理里面那个卷积运算！毛关系都没有啊！当初我还特意去复习了一下高数里的卷积运算！摔!

卷积神经网络在本质和原理上还是和卷积运算有一定的联系的，只是之前本人才疏学浅未能看出它们二者实质相关联的地方
取 feature里的（1，1）元素值，再取图像上蓝色框内的（1，1）元素值，二者相乘等于1。把这个结果1填入新的图中。


好了,经过一系列卷积对应相乘，求均值运算后，我们终于把一张完整的feature map填满了。

feature map是每一个feature从原始图像中提取出来的“特征”。其中的值，越接近为1表示对应位置和feature的匹配越完整，越是接近-1，表示对应位置和feature的反面匹配越完整，而值接近0的表示对应位置没有任何匹配或者说没有什么关联。
一个feature作用于图片产生一张feature map，对这张X图来说，我们用的是3个feature，因此最终产生3个 feature map。

非线性激活层

卷积层对原图运算多个卷积产生一组线性激活响应，而非线性激活层是对之前的结果进行一个非线性的激活响应。
在神经网络中用到最多的非线性激活函数是Relu函数，它的公式定义如下：
f(x)=max(0,x)
即，保留大于等于0的值，其余所有小于0的数值直接改写为0。
为什么要这么做呢？上面说到，卷积后产生的特征图中的值，越靠近1表示与该特征越关联，越靠近-1表示越不关联，而我们进行特征提取时，为了使得数据更少，操作更方便，就直接舍弃掉那些不相关联的数据。
如下图所示：>=0的值不变，而<0的值一律改写为0

pooling池化层

卷积操作后，我们得到了一张张有着不同值的feature map，尽管数据量比原图少了很多，但还是过于庞大（比较深度学习动不动就几十万张训练图片），因此接下来的池化操作就可以发挥作用了，它最大的目标就是减少数据量。
池化分为两种，Max Pooling 最大池化、Average Pooling平均池化。顾名思义，最大池化就是取最大值，平均池化就是取平均值。
拿最大池化举例：选择池化尺寸为2x2，因为选定一个2x2的窗口，在其内选出最大值更新进新的feature map。


最终得到池化后的feature map。可明显发现数据量减少了很多。
因为最大池化保留了每一个小块内的最大值，所以它相当于保留了这一块最佳匹配结果（因为值越接近1表示匹配越好）。这也就意味着它不会具体关注窗口内到底是哪一个地方匹配了，而只关注是不是有某个地方匹配上了。这也就能够看出，CNN能够发现图像中是否具有某种特征，而不用在意到底在哪里具有这种特征。这也就能够帮助解决之前提到的计算机逐一像素匹配的死板做法。
总结

CNN的基本配置---卷积层、Relu层、池化层。
在常见的几种CNN中，这三层都是可以堆叠使用的，将前一层的输入作为后一层的输出。比如：

也可以自行添加更多的层以实现更为复杂的神经网络。