Skip to content

Instantly share code, notes, and snippets.

@leechor
Last active August 4, 2017 06:15
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save leechor/8b620d731c44a6b8ec4bc831a954c305 to your computer and use it in GitHub Desktop.
Save leechor/8b620d731c44a6b8ec4bc831a954c305 to your computer and use it in GitHub Desktop.

R^2

  • ESS(explained sum of squares)

    is a quantity used in describing how well a model, often a regression model, represents the data being modelled. In particualr, the explained sum of squares measures how much variation there is in th modelled values and this is compared to the totoal sum of squares, which measures how much variation there is in the observed data, and to the residual sum of squares, which measures the variation in the modelling errors.

if $\hat{a}$ and $b^2$ are the estimated coefficients, then

\hat{y_i} = \hat{a}+\hat{b_1}x_{1i}+\hat{b_2}x_{2i}+... 

is the $i^{th}$ predicted value of the response variable, The ESS is the sum o fthe squres of the differences of the values and the mean value of the response variable:

$$ESS = \sum_{i=1}^n(\hat{y}-\bar{y})^2$$
  • RSS(residual sum of squares)

    is the sum of the squares of residuals(deviations predicted from actual empirical values of data), It is a measure of the discrepancy between the data and an estimation model, A small RSS indicates a tight fit of the model to the data. It is used as optimality criterion in parameter selection and model selection. In a model with a single explanatory variable, RSS is given by:
$$RSS=\sum_{i=1}^n(y_i-\hat{y_i})^2$$

where $y_i$is the $i^{th}$ value of the vairable to be predicted,$x_i$ is the $i^{th}$ value of the explanatory variable,$$\hat{y_i}$is the predicted value,

  • TSS(total sum of squares)

    is a quantity that apears as part of a standard way of presenting results of such analyses. It is defined as being the sum, over all observations, of the squared differences of each observation from the overall mean.
$$TSS = \sum_{i=1}^n(y_i-\bar{y})^2$$

In linear regression:

$$\sum_{i=1}^n(y_i-\bar{y})^2 = \sum_{i=1}^n(y_i-\hat{y_i})^2+\sum_{i=1}^n(\hat{y_i}-\bar{y})^2 TSS = RSS + ESS$$

Derivation


[t检验]


Softmax function

(1) Definition

WIKI: The softmax function is a generalization of the logistic function that 'squashes' a K-dimensional vector $z$ of arbitrary real values to a K-dimensional vector $\sigma(z)$ of real values in the range [0,1] that add up to 1. The function is given by

$$\sigma(z)_j =\frac{e^{z_j}}{\sum_{k=1}^{K} e^{z_k} } \ \ \ \ \ for \ \ \ j=1,...,K.$$

In probability theory, the output of the softmax function can be used to represent a probability distribution over K different possible outcomes.

In other words:

softmax 回归用于解决多分类问题(相对于 logistic 回归解决的二分类问题),即类标 $\textstyle y$ 可以取 $\textstyle k$ 个不同的值(而不是 2 个)。因此,对于训练集 $\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}$,我们有 $y^{(i)} \in \{1, 2, \ldots, k\}$

对于给定的测试输入 $\textstyle x$,我们想用假设函数针对每一个类别 $\textstyle j$ 估算出概率值 $\textstyle p(y=j | x)$。也就是说,我们想估计 $\textstyle x$ 的每一种分类结果出现的概率。因此,我们的假设函数将要输出一个$\textstyle k$ 维的向量(向量元素的和为1)来表示这 $\textstyle k$ 个估计的概率值。 具体地说,我们的假设函数 $\textstyle h_{\theta}(x)$ 形式如下:

$$h_\theta(x^{(i)}) = \begin{bmatrix} p(y^{(i)} = 1 | x^{(i)}; \theta) \\\ p(y^{(i)} = 2 | x^{(i)}; \theta) \\\ \vdots \\\ p(y^{(i)} = k | x^{(i)}; \theta) \end{bmatrix} = \frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } \begin{bmatrix} e^{ \theta_1^T x^{(i)} } \\\ e^{ \theta_2^T x^{(i)} } \\\ \vdots \\\ e^{ \theta_k^T x^{(i)} } \\\ \end{bmatrix}$$

注:每一个类别 $\textstyle j$ 都有自己的参数向量$\theta_j$

(2) cost function

the cost function for softmax function is:

$$J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]$$

Minimize this cost function should lead to this objective because it penalizes the model when it estimates a low probability for a target class. Since we can not get the solution from derivating all parameters at one time, we use gradient descent to find the parameter maxtrix that minimizes the cost function, which gives partial derivate:

$$\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\} - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right] }$$

(3) Overfiting

对于Softmax 回归有一个不寻常的特点:它有一个“冗余”的参数集。更正式一点来说, Softmax 模型被过度参数化了。对于任意一个用于拟合数据的假设函数,可以求出多组参数值,这些参数得到的是完全相同的假设函数 $\textstyle h_\theta$

进一步而言,如果参数$\textstyle (\theta_1, \theta_2,\ldots, \theta_k)$ 是代价函数 $\textstyle J(\theta)$ 的极小值点,那么 $\textstyle (\theta_1 - \psi, \theta_2 - \psi,\ldots, \theta_k - \psi)$ 同样也是它的极小值点,其中 $ \textstyle \psi$ 可以为任意向量。因此使 $\textstyle J(\theta)$ 最小化的解不是唯一的。

所以我们通过添加一个权重衰减项$\textstyle \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^{n} \theta_{ij}^2$ 来修改代价函数,这个衰减项会惩罚过大的参数值,现在我们的代价函数变为:

$$J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }} \right] + \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^n \theta_{ij}^2$$

有了这个权重衰减项以后 ($\textstyle \lambda > 0$),代价函数就变成了严格的凸函数,这样就可以保证得到唯一的解了。

为了使用优化算法,我们需要求得这个新函数 $\textstyle J(\theta)$ 的导数,如下:

$$\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} ( 1\{ y^{(i)} = j\} - p(y^{(i)} = j | x^{(i)}; \theta) ) \right] } + \lambda \theta_j$$
  • 交叉熵(Cross entroy)

Wiki: In information theory, the cross entroy between two probability distributions p and q over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an "unnatural" probability distribution q, rather than the "true" distribution p.

$$H(p, q) = E_p[-\log q] = H(p) + D_{KL}(p||q)$$

where H(p) is the entropy of p, and $D_{KL}(||q)$ is the Kullback-Leibler divergence of q from p (also known as the relative entropy of p with respect to q -- note the reversal of emphasis)

For discrete p and q this means

$$H(p, q) = -\sum p(x)\log q(x)$$

另解: 设y是预测的概率分布, y' 是真实的概率分布(即Label的one-hot 编码), 通常可以用来判断模型对真实概率分布估计的准确程度.

$$H_{y'}(y) = - \sum y_i'\log y_i$$

Xavier(参数初始化方法)

特点: 会根据某一层网络的输入、输出节点数量自动调整最合适的分布。

问题: 如果尝试学习模型的权重初始化得太小,那信号在每层间传递时逐渐缩小而难以产生作用,但如果权重初始化太大,那信号将在每层传递时逐渐放大并导致发散和失效.Xaiver 初始化做的事情就是让权重被初始化得不大不小,正好合适 --《Tensorflow 实战》P60。


ReLU function(激活函数)

image

rectifier:

$$f(x) = x^+ = \max (0, x)$$

softplus:(是对ReLU的平滑逼近的解析函数形式)

$$f(x) = ln[1 + \exp (x)]$$

softplus vs sigmoid:

$$f'(x) = \frac {e^x} {1+e^x} = \frac {1} {1+e^{-x}}$$

相关系数(corelation coefficient)

用以反映变量之间相关关系密切程度的统计指标.相关系数是按积差方法计算,同样以两变量与各自平均值的离差为基础,通过两个离差相乘反映两变量之间相关程度.

简单相关系数: 又叫相关系数或纯属相关系数,一般用字母r表示,用来度量两个变量间的线性关系.

$$r(X, Y) = \frac {Cov(X, Y)} {\sqrt {Var[X]Var[Y]}}$$

其中Cov(X, Y) 为X 与 Y 的协方差,Var[X]为X的方差, Var[Y]为Y的方差.


神经网络

activation function

激活函数是用来加入非线性因素的,因为线性模型表达能力不够. 另外一种方法是引入非线性函数。我们来看异或问题(xor problem)。这个真值表不是线性可分的,所以不能使用线性模型.

我们可以设计一种神经网络,通过激活函数来使得这组数据线性可分。 激活函数我们选择阀值函数(threshold function),也就是大于某个值输出1(被激活了),小于等于则输出0(没有激活)。这个函数是非线性函数。 image

其中直线上的数字为权重。圆圈中的数字为阀值。第二层,如果输入大于1.5则输出1,否则0;第三层,如果输入大于0.5,则输出1,否则0.

第一层到第二层(阀值1.5)

第二层到第三层(阀值0.5)

可以看到第三层输出就是我们所要的xor的答案。

经过变换后的数据是线性可分的(n维,比如本例中可以用平面),如图所示:


移动平均法

移动平均法又称滑动平均法、滑动平均模型法(Moving average, MA), 是用一级最近的实际数据值来预测未来一期或几期内公司产品的需求量、公司产能等的一种常用方法。


卷积神经网络

全0填充后结果矩阵大小的计算:

$$out_{length} = \left \lceil in_{length}/stride_{length} \right \rceil out_{width} = \left \lceil in_{width}/stride_{length} \right \rceil$$

非全0填充结果矩阵大小的计算:

$$out_{length} = \left \lceil (in_{length} - filter_{length} + 1)/stride_{width} \right \rceil out_{width} = \left \lceil (in_{width} - filter_{width} + 1)/stride_{width} \right \rceil$$
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment