Skip to content

Instantly share code, notes, and snippets.

@secsilm
Created April 25, 2018 03:32
Show Gist options
  • Save secsilm/4b11a8e91c138dcf9b3d64a0e14c8aa3 to your computer and use it in GitHub Desktop.
Save secsilm/4b11a8e91c138dcf9b3d64a0e14c8aa3 to your computer and use it in GitHub Desktop.
Standardization and Normalization in sklearn
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from sklearn import preprocessing\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Standardization(标准化)\n",
"### 使用 `scale`"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"X = np.array([[ 1., -1., 2., 0.],\n",
" [ 2., 0., 0., 1.],\n",
" [ 0., 1., -1., 2.]])"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"# 在列上进行 scale,即行是样本,列是特征,对特征进行 scale\n",
"X_scaled = preprocessing.scale(X)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 0. , -1.22474487, 1.33630621, -1.22474487],\n",
" [ 1.22474487, 0. , -0.26726124, 0. ],\n",
" [-1.22474487, 1.22474487, -1.06904497, 1.22474487]])"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_scaled"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0., 0., 0., 0.])"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_scaled.mean(axis=0)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1., 1., 1., 1.])"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_scaled.var(axis=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 使用 `StandardScaler`"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"StandardScaler(copy=True, with_mean=True, with_std=True)"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"scaler = preprocessing.StandardScaler().fit(X)\n",
"scaler"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 0. , -1.22474487, 1.33630621, -1.22474487],\n",
" [ 1.22474487, 0. , -0.26726124, 0. ],\n",
" [-1.22474487, 1.22474487, -1.06904497, 1.22474487]])"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 和上面的 scale 结果一样\n",
"scaler.transform(X)"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1. , 0. , 0.33333333, 1. ])"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# X 的 mean\n",
"scaler.mean_"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0.66666667, 0.66666667, 1.55555556, 0.66666667])"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# X 的 var\n",
"scaler.var_"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[-2.44948974, 1.22474487, -0.26726124, 0. ]])"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 使用同样地做法对新样本进行标准化,即使用和 X 相同的 mean 和 var\n",
"X_test = [[-1., 1., 0., 1.]]\n",
"scaler.transform(X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"当然也有其他的标准化方法,例如 [`MinMaxScaler`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler),具体可参见 [Standardization, or mean removal and variance scaling](http://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Normalization(归一化)\n",
"### 使用 `normalize`"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 1., -1., 2., 0.],\n",
" [ 2., 0., 0., 1.],\n",
" [ 0., 1., -1., 2.]])"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [],
"source": [
"X_normalized, norms = preprocessing.normalize(X, axis=0, return_norm=True)"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 0.4472136 , -0.70710678, 0.89442719, 0. ],\n",
" [ 0.89442719, 0. , 0. , 0.4472136 ],\n",
" [ 0. , 0.70710678, -0.4472136 , 0.89442719]])"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_normalized"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([2.23606798, 1.41421356, 2.23606798, 2.23606798])"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# X 中列向量的模长\n",
"norms"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1., 1., 1., 1.])"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.linalg.norm(X_normalized, axis=0)"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.44721359499991586"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# X[0, 0] / norms[0]\n",
"1 / 2.23606798"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"可以看到 Normalization 就是每个值除以所在列向量的模长。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 使用 `Normalizer`"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Normalizer(copy=True, norm='l2')"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"normalizer = preprocessing.Normalizer().fit(X) # fit does nothing\n",
"normalizer"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 0.40824829, -0.40824829, 0.81649658, 0. ],\n",
" [ 0.89442719, 0. , 0. , 0.4472136 ],\n",
" [ 0. , 0.40824829, -0.40824829, 0.81649658]])"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"normalizer.transform(X)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [default]",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment