Created
June 28, 2016 15:27
-
-
Save tvorogme/83e90a4f8c13d6de8231848e61067c59 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Уменьшение размерности и Embedding (manifold)\n", | |
"* Today we are going to learn how to visualize and explore the data\n", | |
"![pictcha](http://sarahannelawless.com/wp-content/uploads/2015/03/tw-1-600x449.jpg)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"lasagne.layers.EmbeddingLayer" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"from time import time\n", | |
"import numpy as np\n", | |
"\n", | |
"import matplotlib.pyplot as plt\n", | |
"%matplotlib inline" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Загружаем MNIST\n", | |
" * Если кто не знает, это выборка рукописных цифер" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"from sklearn.datasets import load_digits\n", | |
"digits = load_digits(n_class=6)\n", | |
"X = digits.data\n", | |
"y = digits.target\n", | |
"n_samples, n_features = X.shape\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"print X.shape\n", | |
"print y.shape" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# a few testimonials\n", | |
"plt.subplot(1,2,1)\n", | |
"plt.imshow(X[0].reshape(8,8))\n", | |
"plt.subplot(1,2,2)\n", | |
"plt.imshow(X[1].reshape(8,8))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Визуализация данных\n", | |
"\n", | |
"Для начала возьмём случайную пару проекций" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"from matplotlib import offsetbox\n", | |
"def plot_embedding(X,y=None,ax=None,show_images=True,min_dist=5e-3,figsize=[12,10]):\n", | |
" \n", | |
" #нормализуем данные\n", | |
" x_min, x_max = np.min(X, 0), np.max(X, 0)\n", | |
" X = (X - x_min) / (x_max - x_min)\n", | |
"\n", | |
" if ax is None:\n", | |
" plt.figure(figsize=figsize)\n", | |
" ax = plt.subplot(1,1,1)\n", | |
"\n", | |
" # рисуем scatter\n", | |
" if y is None:\n", | |
" plt.scatter(*X.T)\n", | |
" else:\n", | |
" assert y is not None\n", | |
" #рисуем циферки a-la scatter\n", | |
" for i in range(X.shape[0]):\n", | |
" ax.text(X[i, 0], X[i, 1], str(y[i]),\n", | |
" color= plt.cm.Set1(y[i] / 10.),\n", | |
" fontdict={'weight': 'bold', 'size': 9})\n", | |
"\n", | |
" if not show_images:\n", | |
" return\n", | |
" \n", | |
" shown_images = np.array([[1., 1.]]) # just something big\n", | |
" for i in range(X.shape[0]):\n", | |
" dist = np.sum((X[i] - shown_images) ** 2, 1)\n", | |
" if np.min(dist) < min_dist: continue\n", | |
" shown_images = np.r_[shown_images, [X[i]]]\n", | |
" imagebox = offsetbox.AnnotationBbox(\n", | |
" offsetbox.OffsetImage(digits.images[i], cmap=plt.cm.gray_r),\n", | |
" X[i])\n", | |
" ax.add_artist(imagebox)\n", | |
" plt.xticks([]), plt.yticks([])\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### GaussianRandomProjection\n", | |
" * Выбирает несколько (2) случайных осей в многомерном пространстве данных\n", | |
" * проецирует выборку на эти оси\n", | |
" * Сам по себе довольно бесполезен" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"from sklearn.random_projection import GaussianRandomProjection\n", | |
"\n", | |
"Xrp = GaussianRandomProjection(n_components=2).fit_transform(X)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"plot_embedding(Xrp)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Wut?!\n", | |
"Нищева нипанятна нащайника!" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"plot_embedding(Xrp,y)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"После перезапуска Xrp=... результат может поменяться" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Singular Value Decomposition\n", | |
"\n", | |
"* Идея: мы пытаемся отобразить данные на такие оси, с которых их потом можно восстановить с минимальными потерями\n", | |
"* Количество этих \"новых осей\" обычно меньше, чем у изначальных данных" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"from sklearn.decomposition import TruncatedSVD" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"svd = TruncatedSVD(n_components=2)\n", | |
"Xsvd = svd.fit_transform(X)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"plot_embedding(Xsvd[:,:2],y)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"n = svd.n_components\n", | |
"plt.figure(figsize=[16,8])\n", | |
"for i in range(2):\n", | |
" plt.subplot(1,2,i+1)\n", | |
" plt.imshow(svd.components_[i].reshape(8,8),\n", | |
" cmap='gray',interpolation='none')\n", | |
" plt.colorbar()\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### PCA ака анализ главных компонент\n", | |
"* Идея: мы пытаемся найти оси, вдоль которых расположены данные\n", | |
"* Опять же, количество таких осей меньше, чем в оригинальных данных\n", | |
"\n", | |
"__По коду крайне похоже на SVD__ , засим предлагаю применить его самостоятельно" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"from sklearn.decomposition import PCA\n", | |
"pca = <Создай PCA с n_components=2>\n", | |
"Xsvd = <Сделай PCA-преобразование входных данных>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"plot_embedding(Xpca,y)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"```\n", | |
"\n", | |
"```\n", | |
"\n", | |
"```\n", | |
"\n", | |
"```\n", | |
"\n", | |
"```\n", | |
"\n", | |
"```\n", | |
"\n", | |
"```\n", | |
"\n", | |
"```\n", | |
"\n", | |
"```\n", | |
"\n", | |
"```\n", | |
"\n", | |
"```\n", | |
"\n", | |
"```\n", | |
"\n", | |
"```\n", | |
"\n", | |
"```\n", | |
"\n", | |
"```\n", | |
"\n", | |
"```\n", | |
"\n", | |
"```\n", | |
"\n", | |
"```\n", | |
"\n", | |
"```\n", | |
"\n", | |
"```\n", | |
"\n", | |
"```\n", | |
"\n", | |
"```\n", | |
"\n", | |
"\n", | |
"#### В этот момент хорошей идеей будет сравнить 3 полученные картинки" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"plt.figure(figsize=[12,4])\n", | |
"plot_embedding(Xrp,y,ax=plt.subplot(1,3,1),min_dist=3e-2)\n", | |
"plot_embedding(Xsvd,y,ax=plt.subplot(1,3,2),min_dist=3e-2)\n", | |
"plot_embedding(Xpca,y,ax=plt.subplot(1,3,3),min_dist=3e-2)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# LDA aka Linear Discriminant Analysis\n", | |
"\n", | |
"* Идея: давайте найдём такие оси, на которых классы (циферки) будут лучше всего разделены\n", | |
"* На самом деле, это самый обычный классификатор, уменьшение размерности - просто побочный эффект" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"from sklearn.discriminant_analysis import LinearDiscriminantAnalysis\n", | |
"lda = LinearDiscriminantAnalysis(n_components=2)\n", | |
"Xlda = lda.fit_transform(X,y)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"plot_embedding(Xlda,y)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Кажется, мы победили!\n", | |
"\n", | |
"##### Ан нет, Hard mode enabled\n", | |
"\n", | |
"__Сходите, пожалуйста, в начало, и поменяйте у функции load_digits n_class на 10__\n", | |
"\n", | |
"После чего, понятное дело, перепрогоните тесты\n", | |
"\n", | |
"```\n", | |
"\n", | |
"```\n", | |
"\n", | |
"```\n", | |
"\n", | |
"```\n", | |
"\n", | |
"```\n", | |
"\n", | |
"```\n", | |
"\n", | |
"```\n", | |
"\n", | |
"```\n", | |
"\n", | |
"```\n", | |
"\n", | |
"```\n", | |
"\n", | |
"```\n", | |
"\n", | |
"```\n", | |
"\n", | |
"```\n", | |
"\n", | |
"```\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Embedding ака Manifold learning\n", | |
"\n", | |
"* В отличие от предыдущих не имеют ограничений вида \"только линейное\"/\"квадратичное\"/\"синусоидальное\" преобразование\n", | |
"\n", | |
"* Общая идея: давайте мы попытаемся сопоставить каждой точке в данных новые координаты такие, чтобы \"было хорошо\"\n", | |
"\n", | |
"* Что такое хорошо и как сопоставлять - зависит от конкретного метода" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Multidimensional Scaling\n", | |
"\n", | |
"* Идея - давайте расположим новые точки так, чтобы близкие в многомерном пространстве точки получили близки друг к друку координаты, а удалённые - соответственно, далёкие друг от друга\n", | |
"\n", | |
"##### Чуть сложнее\n", | |
"* Давайте зададим точкам такие новые координаты, чтобы попарные расстояния сохранялись как можно точнее.\n", | |
" * Есть точки в 64-мерном пространстве. Между ними есть расстояние, например\n", | |
"$$ r(a,b) = \\sqrt { (a_0 - b_0)^2 + (a_1 - b_1)^2 + ... + (a_63 - b_63)^2}$$\n", | |
" * Есть наши новые точки в двумерном пространстве, между ними есть расстояние\n", | |
"$$ r_{new}(a',b') = \\sqrt { (a'_x - b'_x)^2 + (a'_y - b'_y)^2 } $$\n", | |
"\n", | |
" * Хотим, чтобы $r_{new}$ было максимально близко к $r$\n", | |
" * Двигаем новые точки так, чтобы в среднем по всем a,b\n", | |
"$$ r_{new}(a',b') - r(a,b) \\to min $$\n", | |
"\n", | |
" * Далее по тексту разладка между старым и новым расстоянием называется __stress__" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"from sklearn.manifold import MDS\n", | |
"mds = MDS(n_components=2,verbose=2,n_init=1)\n", | |
"Xmds = mds.fit_transform(X)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"plot_embedding(Xmds,y)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# t-SNE\n", | |
"t-distributed Stochasitc Neiborhood Embedding бла-бла-бла\n", | |
"\n", | |
"* Идейно похож на MDS, но чуть более пофигист - ему плевать на далёкие точки.\n", | |
"\n", | |
"* Хотим, чтобы сохранялось расстояние до K ближайших соседей, остальные - как повезёт\n", | |
"\n", | |
"Иными словами, хотим как можно лучше сохранить отношение соседства, забивая на глобальную структуру." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"from sklearn.manifold import TSNE\n", | |
"tsne = TSNE(n_components=2,verbose=2,perplexity=50)\n", | |
"Xtsne = tsne.fit_transform(X)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"plot_embedding(Xtsne,y)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Так кто же лучше?\n", | |
" * t-SNE зависит от параметра perplexity - на сколько ближайших соседей смотреть (доля)\n", | |
" * Посмотрите, как результат меняется при изменении perplexity от 1 до 100\n", | |
" * Все точки брать не обязательно, опорных значений вида [1,5,...50,100] должно хватить \n", | |
" \n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"<compare tSNEs>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" * Попробуйте сперва выделить несколько главных компонент с PCA, и только потом применить TSNE\n", | |
" * Для начала, 64D изначально -> 16D после PCA -> 2D после TSNE\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"collapsed": true | |
}, | |
"source": [ | |
" * Попробуйте схожие методы - Isomap, LocallyLinearEmbedding или SpectralEmbedding\n", | |
" * Все лежат в sklearn.manifold" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"<Isomap, LLE and Spectral Embedding>" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 2", | |
"language": "python", | |
"name": "python2" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 2 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython2", | |
"version": "2.7.6" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment