Zamony/gist:66f8ca977d2cbba18d106ef8b6408271

## gistfile1.txt
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# NumPy: продолжение"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import random"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Важный момент!!!**\n",
    "\n",
    "На прошлой лекции было сказано, как создать матрицу из случайных чисел (например для инициализации весов), но не было сказано, как сделать так, чтобы результат такой случайности можно было повторить.\n",
    "\n",
    "То есть создать матрицу из таких же случайных элементов."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.random.seed(0)\n",
    "random.seed(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Другие библиотеки имеют свои сиды, которые надо устанавливать, если нужна воспроизводимость:\n",
    "\n",
    "import tensorflow as tf\n",
    "tf.set_random_seed(0)\n",
    "\n",
    "import torch\n",
    "torch.manual_seed(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Broadcast operations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[1 1 1 1 1]\n",
      " [1 1 1 1 1]\n",
      " [1 1 1 1 1]]\n",
      "\n",
      "[[0 1 2 3 4]]\n",
      "\n",
      "[[1 2 3 4 5]\n",
      " [1 2 3 4 5]\n",
      " [1 2 3 4 5]]\n",
      "(3, 5) (1, 5) (3, 5)\n"
     ]
    }
   ],
   "source": [
    "x = np.ones((3,5), dtype=int)\n",
    "y = np.arange(5).reshape(1,5)\n",
    "print(x, y, x+y, sep=\"\\n\\n\")\n",
    "print(x.shape, y.shape, (x+y).shape, sep=\" \")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "operands could not be broadcast together with shapes (5,3) (1,5) ",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-4-e53abc1f71a5>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0mxt\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mT\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mxt\u001b[0m\u001b[0;34m+\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msep\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"\\n\\n\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;31mValueError\u001b[0m: operands could not be broadcast together with shapes (5,3) (1,5) "
     ]
    }
   ],
   "source": [
    "xt = x.T\n",
    "print(xt+y, sep=\"\\n\\n\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[1 1 1 1 1]\n",
      " [1 1 1 1 1]\n",
      " [1 1 1 1 1]]\n",
      "\n",
      "[0 1 2 3 4]\n",
      "\n",
      "[[1 2 3 4 5]\n",
      " [1 2 3 4 5]\n",
      " [1 2 3 4 5]]\n",
      "(3, 5) (5,) (3, 5)\n"
     ]
    }
   ],
   "source": [
    "v = np.arange(5)\n",
    "print(x, v, x+v, sep=\"\\n\\n\")\n",
    "print(x.shape, v.shape, (x+v).shape, sep=\" \")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Как происходит broadcasting (для поэлементных операций, для matmul/dot см. документацию)?\n",
    "a. shape-ы аргументов (туплы) выравниваются по правому краю\n",
    "\n",
    "|  |  |\n",
    "|------|------|\n",
    "| 3  | 5|\n",
    "|   | 5|\n",
    "\n",
    "b. shape меньшей длины дополняется слева единицами\n",
    "\n",
    "|  |  |\n",
    "|------|------|\n",
    "| 3  | 5|\n",
    "| 1  | 5|\n",
    "\n",
    "c. Аргументы совместимы, если i-ые элементы shape-ов равны, либо один из них равен 1. В последнем случае, соответствующий аргумент (в примере - второй) дублируется по соответствующей оси (axis=0) нужное число раз (3).\n",
    "\n",
    "|  |  |\n",
    "|------|------|\n",
    "| 3  | 5|\n",
    "| 3  | 5|\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Дублирование реализовано без использования лишней памяти, физически копирования не происходит. \n",
    "В отличие от np.repeat()."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[[172  47 117]\n",
      "  [192  67 251]\n",
      "  [195 103   9]\n",
      "  [211  21 242]\n",
      "  [ 36  87  70]]\n",
      "\n",
      " [[216  88 140]\n",
      "  [ 58 193 230]\n",
      "  [ 39  87 174]\n",
      "  [ 88  81 165]\n",
      "  [ 25  77  72]]\n",
      "\n",
      " [[  9 148 115]\n",
      "  [208 243 197]\n",
      "  [254  79 175]\n",
      "  [192  82  99]\n",
      "  [216 177 243]]\n",
      "\n",
      " [[ 29 147 147]\n",
      "  [142 167  32]\n",
      "  [193   9 185]\n",
      "  [127  32  31]\n",
      "  [202 244 151]]\n",
      "\n",
      " [[163 254 203]\n",
      "  [114 183  28]\n",
      "  [ 34 128 128]\n",
      "  [164  53 133]\n",
      "  [ 38 232 244]]]\n",
      "\n",
      "[132.68 121.16 143.24]\n",
      "\n",
      "[[[  39.32  -74.16  -26.24]\n",
      "  [  59.32  -54.16  107.76]\n",
      "  [  62.32  -18.16 -134.24]\n",
      "  [  78.32 -100.16   98.76]\n",
      "  [ -96.68  -34.16  -73.24]]\n",
      "\n",
      " [[  83.32  -33.16   -3.24]\n",
      "  [ -74.68   71.84   86.76]\n",
      "  [ -93.68  -34.16   30.76]\n",
      "  [ -44.68  -40.16   21.76]\n",
      "  [-107.68  -44.16  -71.24]]\n",
      "\n",
      " [[-123.68   26.84  -28.24]\n",
      "  [  75.32  121.84   53.76]\n",
      "  [ 121.32  -42.16   31.76]\n",
      "  [  59.32  -39.16  -44.24]\n",
      "  [  83.32   55.84   99.76]]\n",
      "\n",
      " [[-103.68   25.84    3.76]\n",
      "  [   9.32   45.84 -111.24]\n",
      "  [  60.32 -112.16   41.76]\n",
      "  [  -5.68  -89.16 -112.24]\n",
      "  [  69.32  122.84    7.76]]\n",
      "\n",
      " [[  30.32  132.84   59.76]\n",
      "  [ -18.68   61.84 -115.24]\n",
      "  [ -98.68    6.84  -15.24]\n",
      "  [  31.32  -68.16  -10.24]\n",
      "  [ -94.68  110.84  100.76]]]\n"
     ]
    }
   ],
   "source": [
    "x = np.random.randint(0,255,size=(5, 5, 3))\n",
    "m = x.mean(axis=0).mean(axis=0)\n",
    "print(x, m, x-m, sep=\"\\n\\n\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Поделим каждую строку матрицы на ее норму (нормируем вектора примеров)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[2 0 0 1 1 0 1 0 3 1]\n",
      " [1 1 2 1 1 1 0 2 2 0]\n",
      " [3 0 1 0 2 0 1 2 1 2]\n",
      " [2 1 0 0 1 1 1 1 1 1]\n",
      " [2 1 2 1 0 1 0 2 1 0]]\n"
     ]
    }
   ],
   "source": [
    "x = np.random.binomial(n=5, p=0.2, size=(5,10)) # 5 examples, 20 features\n",
    "print(x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[4.12310563 4.12310563 4.89897949 3.31662479 4.        ]\n",
      "(5, 10) (5,)\n"
     ]
    }
   ],
   "source": [
    "x_norms = np.sqrt((x**2).sum(axis=-1))\n",
    "print(x_norms)\n",
    "print(x.shape, x_norms.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "operands could not be broadcast together with shapes (5,10) (5,) ",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-23-d2d6f73d8620>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mx\u001b[0m\u001b[0;34m/\u001b[0m\u001b[0mx_norms\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;31mValueError\u001b[0m: operands could not be broadcast together with shapes (5,10) (5,) "
     ]
    }
   ],
   "source": [
    "x/x_norms"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**np.newaxis** - позволяет добавить размерность для тензора. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0.48507125 0.         0.         0.24253563 0.24253563 0.\n",
      "  0.24253563 0.         0.72760688 0.24253563]\n",
      " [0.24253563 0.24253563 0.48507125 0.24253563 0.24253563 0.24253563\n",
      "  0.         0.48507125 0.48507125 0.        ]\n",
      " [0.61237244 0.         0.20412415 0.         0.40824829 0.\n",
      "  0.20412415 0.40824829 0.20412415 0.40824829]\n",
      " [0.60302269 0.30151134 0.         0.         0.30151134 0.30151134\n",
      "  0.30151134 0.30151134 0.30151134 0.30151134]\n",
      " [0.5        0.25       0.5        0.25       0.         0.25\n",
      "  0.         0.5        0.25       0.        ]]\n"
     ]
    }
   ],
   "source": [
    "print(x/x_norms[:,np.newaxis])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(5, 10) (5, 1)\n"
     ]
    }
   ],
   "source": [
    "x_norms = np.sqrt((x**2).sum(axis=-1, keepdims=True))\n",
    "print(x.shape, x_norms.shape)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0.48507125 0.         0.         0.24253563 0.24253563 0.\n",
      "  0.24253563 0.         0.72760688 0.24253563]\n",
      " [0.24253563 0.24253563 0.48507125 0.24253563 0.24253563 0.24253563\n",
      "  0.         0.48507125 0.48507125 0.        ]\n",
      " [0.61237244 0.         0.20412415 0.         0.40824829 0.\n",
      "  0.20412415 0.40824829 0.20412415 0.40824829]\n",
      " [0.60302269 0.30151134 0.         0.         0.30151134 0.30151134\n",
      "  0.30151134 0.30151134 0.30151134 0.30151134]\n",
      " [0.5        0.25       0.5        0.25       0.         0.25\n",
      "  0.         0.5        0.25       0.        ]]\n"
     ]
    }
   ],
   "source": [
    "print(x/x_norms)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Думайте об осях и их смысле! Используйте keepdims=True !"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Subtract per-row maximum: \n",
      "\n",
      "[[  0 100]\n",
      " [  2 102]]\n",
      "\n",
      "[  2 102]\n",
      "\n",
      "[[-2 -2]\n",
      " [ 0  0]]\n"
     ]
    }
   ],
   "source": [
    "# Subtract per-row maximum\n",
    "x = np.array([[0,100],[2,102]])\n",
    "m = x.max(axis=0)\n",
    "print(x, m, x-m, sep='\\n\\n')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[  0 100]\n",
      " [  2 102]]\n",
      "\n",
      "[100 102]\n",
      "\n",
      "[[-100   -2]\n",
      " [ -98    0]]\n"
     ]
    }
   ],
   "source": [
    "# Subtract per-column maximum, INCORRECT!\n",
    "m = x.max(axis=1)\n",
    "print(x, m, x-m, sep='\\n\\n')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[  0 100]\n",
      " [  2 102]]\n",
      "\n",
      "[[100]\n",
      " [102]]\n",
      "\n",
      "[[-100    0]\n",
      " [-100    0]]\n"
     ]
    }
   ],
   "source": [
    "# Subtract per-column maximum, correct!\n",
    "m = x.max(axis=1, keepdims=True)\n",
    "print(x, m, x-m, sep='\\n\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SciPy\n",
    "\n",
    "* Python-библиотека для научных вычислений\n",
    "\n",
    "* Одна из главных причин использования этой библиотеки является поддержка работы с разреженными матрицами\n",
    "\n",
    "* Подробное описание функционала с примерами\n",
    "https://www.tutorialspoint.com/scipy/index.htm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. Установка SciPy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WARNING: The conda.compat module is deprecated and will be removed in a future release.\n",
      "Collecting package metadata: done\n",
      "Solving environment: done\n",
      "\n",
      "\n",
      "==> WARNING: A newer version of conda exists. <==\n",
      "  current version: 4.6.11\n",
      "  latest version: 4.7.12\n",
      "\n",
      "Please update conda by running\n",
      "\n",
      "    $ conda update -n base -c defaults conda\n",
      "\n",
      "\n",
      "\n",
      "# All requested packages already installed.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "! conda install --yes scipy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. Импорт библиотеки"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'1.3.1'"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import scipy as sp\n",
    "sp.__version__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3. Создание разреженных матриц"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.sparse import dok_matrix, \\\n",
    "                         lil_matrix, \\\n",
    "                         coo_matrix, \\\n",
    "                         csr_matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Способы представления разреженных матриц"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### dok - dictionary of keys\n",
    "словарь (i, j): element  \n",
    "Быстрая вставка и взятие элемента по индексам"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### lil - list of lists \n",
    "хранится построчно в виде двух списков $[l_1, \\cdots, l_s]$, $[v_1, \\cdots, v_s]$  \n",
    "$l_i$ список столбцов с ненулевыми значений в i строке. $v_i$ список значений  \n",
    "В худшем случае вставка элемента за линейное время.\n",
    "Быстрый доступ к строкам  \n",
    "Медленный доступ к столбцам"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### coo - coordinate list\n",
    "хранятся тройки (row, colomn, value) или три массива rows, columns, values.  \n",
    "Быстрый перевод в другие форматы особенно в сsr/csv  \n",
    "Быстрая вставка.  \n",
    "Нет доступа по индексам  \n",
    "Разрешены дубли. При переводе в другой формат суммируются "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### CSR(CSC) Compressed sparse row(column)  <- default choice!\n",
    "Хранится в виде трех массивов:  \n",
    "1. data -- все ненулевые элементы\n",
    "2. indices -- номера столбцов соответствующих ненулевых элементов data\n",
    "3. indptr[i] -- индекс начала данных i-ой строки в data/indices\n",
    "\n",
    "Эффективно выполняются: +,*, умножение на вектор  \n",
    "Быстрый доступ к строкам для CSR, к столбцам для CSC    \n",
    "Медленное добавление элементов  \n",
    "Меделнное обращение к столбцам для CSR, строкам для CSC"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 115,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n",
      " [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n",
      " [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n",
      " [0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]\n",
      " [0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n",
      " [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0]\n",
      " [0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n",
      " [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n",
      " [0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]\n",
      " [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]]\n",
      "  0 1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829\n",
      "int64 Size: 0.745G\n"
     ]
    }
   ],
   "source": [
    "# Create matrix with random values sampled from Binomial distribution\n",
    "# n - number of experiments\n",
    "# p - probability of success\n",
    "matrix = np.random.binomial(n=5, p=0.01, size=(10**3, 10**5))\n",
    "print(matrix[:10,:30])\n",
    "print('',''.join(('%2s'%i for i in range(30))))\n",
    "print( matrix.dtype, 'Size: %.3fG' % (matrix.nbytes/2**30))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Разреженую матрицу можно создать из плотной матрицы**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 119,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "int64 Size: 0.055G\n"
     ]
    }
   ],
   "source": [
    "\n",
    "smatrix = csr_matrix(matrix)\n",
    "\n",
    "print( smatrix.dtype, 'Size: %.3fG' % ((smatrix.data.nbytes + smatrix.indices.nbytes + smatrix.indptr.nbytes) / 2**30))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 125,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[      0    4960    9840 ... 4890806 4895719 4900712]\n",
      "Row 0\n",
      "data:  [1 1 1 ... 1 1 1]\n",
      "column indices:  [    3     5    40 ... 99908 99910 99945]\n",
      "Row 1\n",
      "data:  [1 1 1 ... 1 1 1]\n",
      "column indices:  [   31    45   123 ... 99966 99968 99972]\n",
      "Row 5\n",
      "data:  [2 1 1 ... 1 1 1]\n",
      "column indices:  [   17    36    43 ... 99915 99929 99991]\n"
     ]
    }
   ],
   "source": [
    "print(smatrix.indptr)\n",
    "for i in (0,1,5):\n",
    "    print('Row', i)\n",
    "    print('data: ', smatrix.data[smatrix.indptr[i]:smatrix.indptr[i+1]])\n",
    "    print('column indices: ', smatrix.indices[smatrix.indptr[i]:smatrix.indptr[i+1]])\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 136,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-3779.614441053174\n",
      "-3779.6144410531656\n"
     ]
    }
   ],
   "source": [
    "w = np.random.randn(10**5)\n",
    "print(matrix.dot(w).sum())\n",
    "print(smatrix.dot(w).sum())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 141,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.1 s ± 515 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "matrix.dot(w).sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 142,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "30.7 ms ± 392 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit # 30x faster, doesn't process 0s!\n",
    "smatrix.dot(w).sum() "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Лучше создавать CSR(CSC) matrix не из плотной матрицы**\n",
    "(плотная может не влезть в память, уходит время на создание / конвертацию)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 154,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<3x3 sparse matrix of type '<class 'numpy.int64'>'\n",
       "\twith 6 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 154,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = np.array([1, 2, 3, 4, 5, 6]) # Non-zero elements\n",
    "row = np.array([0, 0, 1, 2, 2, 2]) # their row indices\n",
    "col = np.array([0, 2, 2, 0, 1, 2]) # their column indices\n",
    "csr_mat = csr_matrix((data, (row, col)), shape=(3, 3))\n",
    "csr_mat"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Превратить разреженную матрицу в плотную можно с помошью функции todense()**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 150,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[1 0 2]\n",
      " [0 0 3]\n",
      " [4 5 6]]\n"
     ]
    }
   ],
   "source": [
    "print(csr_mat.todense())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Есть еще один способ задания CSR matrix**\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 155,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[1 0 2]\n",
      " [0 0 3]\n",
      " [4 5 6]]\n"
     ]
    }
   ],
   "source": [
    "data = [1,2,3,4,5,6] # non-zero elements\n",
    "indices = [0,2,2,0,1,2] # their column indices\n",
    "indptr = [0, 2, 3, 6] # i-th row is stored in data[indptr[i]:indptr[i+1]], indices[indptr[i]:indptr[i+1]]\n",
    "csr_mat = csr_matrix((data, indices, indptr), shape=(3, 3))\n",
    "print(csr_mat.todense())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4. Операции в SciPy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[1 2 0]\n",
      " [4 0 3]\n",
      " [0 5 6]]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "row = np.array([0, 0, 1, 1, 2, 2])\n",
    "col = np.array([0, 1, 2, 0, 1, 2])\n",
    "data = np.array([1, 2, 3, 4, 5, 6])\n",
    "\n",
    "sparse_matrix1 = csr_matrix((data, (row, col)), shape=(3, 3))\n",
    "print(sparse_matrix1.todense(), end=\"\\n\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0 2 1]\n",
      " [4 3 0]\n",
      " [5 0 6]]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "row = np.array([0, 0, 1, 1, 2, 2])\n",
    "col = np.array([2, 1, 1, 0, 0, 2])\n",
    "data = np.array([1, 2, 3, 4, 5, 6])\n",
    "sparse_matrix2 = csr_matrix((data, (row, col)), shape=(3, 3))\n",
    "print(sparse_matrix2.todense(), end=\"\\n\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Matrix multiplication:\n",
      "[[ 8  8  1]\n",
      " [15  8 22]\n",
      " [50 15 36]]\n"
     ]
    }
   ],
   "source": [
    "print(\"Matrix multiplication:\")\n",
    "res = sparse_matrix1.dot(sparse_matrix2).todense()\n",
    "print(res)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Matrix element-wise addition:\n",
      "[[ 1  4  1]\n",
      " [ 8  3  3]\n",
      " [ 5  5 12]]\n"
     ]
    }
   ],
   "source": [
    "print(\"Matrix element-wise addition:\")\n",
    "res = (sparse_matrix1 + sparse_matrix2).todense()\n",
    "print(res)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Matrix element-wise multiplication:\n",
      "[[ 0  4  0]\n",
      " [16  0  0]\n",
      " [ 0  0 36]]\n"
     ]
    }
   ],
   "source": [
    "print(\"Matrix element-wise multiplication:\")\n",
    "res = sparse_matrix1.multiply(sparse_matrix2).todense()\n",
    "print(res)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Attention!!!\n",
    "Функция np.vstack, np.hstack не работает, если один из параметров разреженная матрица"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[66 35 95]\n",
      " [ 0 22  1]\n",
      " [68 48 84]]\n",
      "(3, 3) (3, 3)\n"
     ]
    }
   ],
   "source": [
    "x = np.random.randint(0, 100, (3, 3))\n",
    "print(x)\n",
    "print(x.shape, csr_mat.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 3 and the array at index 1 has size 1",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-32-0908583e775d>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvstack\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcsr_mat\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m<__array_function__ internals>\u001b[0m in \u001b[0;36mvstack\u001b[0;34m(*args, **kwargs)\u001b[0m\n",
      "\u001b[0;32m~/anaconda3/envs/tutor/lib/python3.7/site-packages/numpy/core/shape_base.py\u001b[0m in \u001b[0;36mvstack\u001b[0;34m(tup)\u001b[0m\n\u001b[1;32m    280\u001b[0m     \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marrs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlist\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    281\u001b[0m         \u001b[0marrs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0marrs\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 282\u001b[0;31m     \u001b[0;32mreturn\u001b[0m \u001b[0m_nx\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mconcatenate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marrs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    283\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    284\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m<__array_function__ internals>\u001b[0m in \u001b[0;36mconcatenate\u001b[0;34m(*args, **kwargs)\u001b[0m\n",
      "\u001b[0;31mValueError\u001b[0m: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 3 and the array at index 1 has size 1"
     ]
    }
   ],
   "source": [
    "np.vstack([x, csr_mat])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Нужно использовать функции из SciPy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[66 35 95  1  0  2]\n",
      " [ 0 22  1  0  0  3]\n",
      " [68 48 84  4  5  6]]\n",
      "\n",
      "[[66 35 95]\n",
      " [ 0 22  1]\n",
      " [68 48 84]\n",
      " [ 1  0  2]\n",
      " [ 0  0  3]\n",
      " [ 4  5  6]]\n"
     ]
    }
   ],
   "source": [
    "from scipy.sparse import vstack, hstack\n",
    "\n",
    "print(hstack([x, csr_mat]).todense(), end=\"\\n\\n\")\n",
    "print(vstack([x, csr_mat]).todense())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Attention!!!\n",
    "\n",
    "Будьте аккуратны когда умножаете Sparse матрицу на Dense матрицу.\n",
    "\n",
    "Вы можете случайно при умножении конвертировать Sparse матрицу в Dense, и в результате возможны следующие исходы:\n",
    "\n",
    "* Sparse матрица не влезет в память и вылетит MemoryError, но прежде всё зависнет, так как матрица попытается влезть в ОЗУ\n",
    "* Умножение будет длиться долго, так мы не будем учитывать, что умножаем нули"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 179,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(1000000, 1000000) float64\n",
      "Size: 0.015G\n",
      "Dense size: 7450.581G\n"
     ]
    }
   ],
   "source": [
    "import scipy\n",
    "csr_mat = csr_matrix(scipy.sparse.eye(10**6))\n",
    "shape = csr_mat.shape\n",
    "print(shape, csr_mat.dtype)\n",
    "print( 'Size: %.3fG' % ((csr_mat.data.nbytes + csr_mat.indices.nbytes + csr_mat.indptr.nbytes)/2**30))\n",
    "print('Dense size: %.3fG' % (shape[0]*shape[1]*8/2**30))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 180,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(1000000,)\n"
     ]
    }
   ],
   "source": [
    "w = np.random.randn(10**6)\n",
    "print(w.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 182,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "7.4 ms ± 1.07 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit -n 1 -r 10\n",
    "csr_mat.dot(w)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Здесь Jupyter умирает (Kernel dead)\n",
    "w.dot(csr_mat)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}