secsilm/standardization-vs-normalization.ipynb

## standardization-vs-normalization.ipynb
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import preprocessing\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Standardization（标准化）\n",
    "### 使用 `scale`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = np.array([[ 1., -1.,  2., 0.],\n",
    "              [ 2.,  0.,  0., 1.],\n",
    "              [ 0.,  1., -1., 2.]])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 在列上进行 scale，即行是样本，列是特征，对特征进行 scale\n",
    "X_scaled = preprocessing.scale(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0.        , -1.22474487,  1.33630621, -1.22474487],\n",
       "       [ 1.22474487,  0.        , -0.26726124,  0.        ],\n",
       "       [-1.22474487,  1.22474487, -1.06904497,  1.22474487]])"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_scaled"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0., 0., 0., 0.])"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_scaled.mean(axis=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1., 1., 1., 1.])"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_scaled.var(axis=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 使用 `StandardScaler`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "StandardScaler(copy=True, with_mean=True, with_std=True)"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "scaler = preprocessing.StandardScaler().fit(X)\n",
    "scaler"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0.        , -1.22474487,  1.33630621, -1.22474487],\n",
       "       [ 1.22474487,  0.        , -0.26726124,  0.        ],\n",
       "       [-1.22474487,  1.22474487, -1.06904497,  1.22474487]])"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 和上面的 scale 结果一样\n",
    "scaler.transform(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1.        , 0.        , 0.33333333, 1.        ])"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# X 的 mean\n",
    "scaler.mean_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0.66666667, 0.66666667, 1.55555556, 0.66666667])"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# X 的 var\n",
    "scaler.var_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[-2.44948974,  1.22474487, -0.26726124,  0.        ]])"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 使用同样地做法对新样本进行标准化，即使用和 X 相同的 mean 和 var\n",
    "X_test = [[-1., 1., 0., 1.]]\n",
    "scaler.transform(X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "当然也有其他的标准化方法，例如 [`MinMaxScaler`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler)，具体可参见 [Standardization, or mean removal and variance scaling](http://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Normalization（归一化）\n",
    "### 使用 `normalize`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 1., -1.,  2.,  0.],\n",
       "       [ 2.,  0.,  0.,  1.],\n",
       "       [ 0.,  1., -1.,  2.]])"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_normalized, norms = preprocessing.normalize(X, axis=0, return_norm=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0.4472136 , -0.70710678,  0.89442719,  0.        ],\n",
       "       [ 0.89442719,  0.        ,  0.        ,  0.4472136 ],\n",
       "       [ 0.        ,  0.70710678, -0.4472136 ,  0.89442719]])"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_normalized"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([2.23606798, 1.41421356, 2.23606798, 2.23606798])"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# X 中列向量的模长\n",
    "norms"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1., 1., 1., 1.])"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.linalg.norm(X_normalized, axis=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.44721359499991586"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# X[0, 0] / norms[0]\n",
    "1 / 2.23606798"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以看到 Normalization 就是每个值除以所在列向量的模长。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 使用 `Normalizer`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Normalizer(copy=True, norm='l2')"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "normalizer = preprocessing.Normalizer().fit(X)  # fit does nothing\n",
    "normalizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0.40824829, -0.40824829,  0.81649658,  0.        ],\n",
       "       [ 0.89442719,  0.        ,  0.        ,  0.4472136 ],\n",
       "       [ 0.        ,  0.40824829, -0.40824829,  0.81649658]])"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "normalizer.transform(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"from sklearn import preprocessing\n",
	"import numpy as np"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Standardization（标准化）\n",
	"### 使用 `scale`"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 26,
	"metadata": {},
	"outputs": [],
	"source": [
	"X = np.array([[ 1., -1., 2., 0.],\n",
	" [ 2., 0., 0., 1.],\n",
	" [ 0., 1., -1., 2.]])"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 27,
	"metadata": {},
	"outputs": [],
	"source": [
	"# 在列上进行 scale，即行是样本，列是特征，对特征进行 scale\n",
	"X_scaled = preprocessing.scale(X)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 28,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([[ 0. , -1.22474487, 1.33630621, -1.22474487],\n",
	" [ 1.22474487, 0. , -0.26726124, 0. ],\n",
	" [-1.22474487, 1.22474487, -1.06904497, 1.22474487]])"
	]
	},
	"execution_count": 28,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"X_scaled"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 29,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([0., 0., 0., 0.])"
	]
	},
	"execution_count": 29,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"X_scaled.mean(axis=0)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 30,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([1., 1., 1., 1.])"
	]
	},
	"execution_count": 30,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"X_scaled.var(axis=0)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### 使用 `StandardScaler`"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 31,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"StandardScaler(copy=True, with_mean=True, with_std=True)"
	]
	},
	"execution_count": 31,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"scaler = preprocessing.StandardScaler().fit(X)\n",
	"scaler"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 32,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([[ 0. , -1.22474487, 1.33630621, -1.22474487],\n",
	" [ 1.22474487, 0. , -0.26726124, 0. ],\n",
	" [-1.22474487, 1.22474487, -1.06904497, 1.22474487]])"
	]
	},
	"execution_count": 32,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# 和上面的 scale 结果一样\n",
	"scaler.transform(X)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 33,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([1. , 0. , 0.33333333, 1. ])"
	]
	},
	"execution_count": 33,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# X 的 mean\n",
	"scaler.mean_"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 34,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([0.66666667, 0.66666667, 1.55555556, 0.66666667])"
	]
	},
	"execution_count": 34,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# X 的 var\n",
	"scaler.var_"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 37,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([[-2.44948974, 1.22474487, -0.26726124, 0. ]])"
	]
	},
	"execution_count": 37,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# 使用同样地做法对新样本进行标准化，即使用和 X 相同的 mean 和 var\n",
	"X_test = [[-1., 1., 0., 1.]]\n",
	"scaler.transform(X_test)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"当然也有其他的标准化方法，例如 [`MinMaxScaler`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler)，具体可参见 [Standardization, or mean removal and variance scaling](http://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Normalization（归一化）\n",
	"### 使用 `normalize`"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 38,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([[ 1., -1., 2., 0.],\n",
	" [ 2., 0., 0., 1.],\n",
	" [ 0., 1., -1., 2.]])"
	]
	},
	"execution_count": 38,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"X"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 51,
	"metadata": {},
	"outputs": [],
	"source": [
	"X_normalized, norms = preprocessing.normalize(X, axis=0, return_norm=True)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 52,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([[ 0.4472136 , -0.70710678, 0.89442719, 0. ],\n",
	" [ 0.89442719, 0. , 0. , 0.4472136 ],\n",
	" [ 0. , 0.70710678, -0.4472136 , 0.89442719]])"
	]
	},
	"execution_count": 52,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"X_normalized"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 53,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([2.23606798, 1.41421356, 2.23606798, 2.23606798])"
	]
	},
	"execution_count": 53,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# X 中列向量的模长\n",
	"norms"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 48,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([1., 1., 1., 1.])"
	]
	},
	"execution_count": 48,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"np.linalg.norm(X_normalized, axis=0)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 50,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"0.44721359499991586"
	]
	},
	"execution_count": 50,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# X[0, 0] / norms[0]\n",
	"1 / 2.23606798"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"可以看到 Normalization 就是每个值除以所在列向量的模长。"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### 使用 `Normalizer`"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 54,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"Normalizer(copy=True, norm='l2')"
	]
	},
	"execution_count": 54,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"normalizer = preprocessing.Normalizer().fit(X) # fit does nothing\n",
	"normalizer"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 55,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([[ 0.40824829, -0.40824829, 0.81649658, 0. ],\n",
	" [ 0.89442719, 0. , 0. , 0.4472136 ],\n",
	" [ 0. , 0.40824829, -0.40824829, 0.81649658]])"
	]
	},
	"execution_count": 55,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"normalizer.transform(X)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python [default]",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.4"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}