Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save j-nila/cad01edc47f369547e9ea07b034a42cd to your computer and use it in GitHub Desktop.
Save j-nila/cad01edc47f369547e9ea07b034a42cd to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# KNN Based Collaborative Filtering In Python Using Surprise "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Exploration "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### About the dataset\n",
"Dataset for collaborative filtering usually contains 3 columns. user id , itemid and rating. We have a related dataset for details of users and items separately so that we can do some analysis on those items and users. items can either br movies , books, products or anything that a user can rate.\n",
"In our case we are going to use a movies dataset and we 'll keep our dataset simple for illustration purpose.\n",
"\n",
"\n",
"Our dataset is for illustration purpose so I have kept it simple and easily comprehendible. In practice you 'll find much more complex dataset. In our case we have 5 users rating 10 movies. There are total 50 ratings possible out of which 30 are available since some of the users have not rated some of the movies.The movies selected are top 10 movies from imdb top 250 movies. User names and ratings are fictitious."
]
},
{
"cell_type": "code",
"execution_count": 440,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"ratings_ds = pd.read_csv('data/sample/movies_ratings.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"lets explore the full dataset. Since number of records is small we can have a look on all of them otherwise you'd want to use head() , describe or other functions from pandas to get a sense of the data."
]
},
{
"cell_type": "code",
"execution_count": 441,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>userName</th>\n",
" <th>userId</th>\n",
" <th>movieName</th>\n",
" <th>movieId</th>\n",
" <th>rating</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Alice</td>\n",
" <td>1</td>\n",
" <td>The Shawshank Redemption</td>\n",
" <td>1</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Alice</td>\n",
" <td>1</td>\n",
" <td>The Godfather</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Alice</td>\n",
" <td>1</td>\n",
" <td>The Godfather: Part II</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Alice</td>\n",
" <td>1</td>\n",
" <td>The Dark Knight</td>\n",
" <td>4</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Alice</td>\n",
" <td>1</td>\n",
" <td>12 Angry Men</td>\n",
" <td>5</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Alice</td>\n",
" <td>1</td>\n",
" <td>Schindler's List</td>\n",
" <td>6</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Alice</td>\n",
" <td>1</td>\n",
" <td>The Lord of the Rings: The Return of the King</td>\n",
" <td>7</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>Alice</td>\n",
" <td>1</td>\n",
" <td>Pulp Fiction</td>\n",
" <td>8</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>Alice</td>\n",
" <td>1</td>\n",
" <td>The Good, the Bad and the Ugly</td>\n",
" <td>9</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>Alice</td>\n",
" <td>1</td>\n",
" <td>Fight Club</td>\n",
" <td>10</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>Bob</td>\n",
" <td>2</td>\n",
" <td>The Godfather: Part II</td>\n",
" <td>3</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>Bob</td>\n",
" <td>2</td>\n",
" <td>The Dark Knight</td>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>Bob</td>\n",
" <td>2</td>\n",
" <td>12 Angry Men</td>\n",
" <td>5</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>Bob</td>\n",
" <td>2</td>\n",
" <td>Fight Club</td>\n",
" <td>10</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>Bob</td>\n",
" <td>2</td>\n",
" <td>The Good, the Bad and the Ugly</td>\n",
" <td>9</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>Carl</td>\n",
" <td>3</td>\n",
" <td>The Shawshank Redemption</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>Carl</td>\n",
" <td>3</td>\n",
" <td>The Godfather</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>Carl</td>\n",
" <td>3</td>\n",
" <td>The Godfather: Part II</td>\n",
" <td>3</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>Carl</td>\n",
" <td>3</td>\n",
" <td>Pulp Fiction</td>\n",
" <td>8</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>Carl</td>\n",
" <td>3</td>\n",
" <td>The Good, the Bad and the Ugly</td>\n",
" <td>9</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20</th>\n",
" <td>Deb</td>\n",
" <td>4</td>\n",
" <td>The Godfather</td>\n",
" <td>2</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21</th>\n",
" <td>Deb</td>\n",
" <td>4</td>\n",
" <td>The Godfather: Part II</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>22</th>\n",
" <td>Deb</td>\n",
" <td>4</td>\n",
" <td>The Dark Knight</td>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23</th>\n",
" <td>Deb</td>\n",
" <td>4</td>\n",
" <td>12 Angry Men</td>\n",
" <td>5</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24</th>\n",
" <td>Earl</td>\n",
" <td>5</td>\n",
" <td>Schindler's List</td>\n",
" <td>6</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25</th>\n",
" <td>Earl</td>\n",
" <td>5</td>\n",
" <td>Pulp Fiction</td>\n",
" <td>8</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>26</th>\n",
" <td>Earl</td>\n",
" <td>5</td>\n",
" <td>The Good, the Bad and the Ugly</td>\n",
" <td>9</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27</th>\n",
" <td>Earl</td>\n",
" <td>5</td>\n",
" <td>The Shawshank Redemption</td>\n",
" <td>1</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28</th>\n",
" <td>Earl</td>\n",
" <td>5</td>\n",
" <td>The Dark Knight</td>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29</th>\n",
" <td>Earl</td>\n",
" <td>5</td>\n",
" <td>Fight Club</td>\n",
" <td>10</td>\n",
" <td>3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" userName userId movieName movieId \\\n",
"0 Alice 1 The Shawshank Redemption   1 \n",
"1 Alice 1 The Godfather 2 \n",
"2 Alice 1 The Godfather: Part II 3 \n",
"3 Alice 1 The Dark Knight  4 \n",
"4 Alice 1 12 Angry Men 5 \n",
"5 Alice 1 Schindler's List 6 \n",
"6 Alice 1 The Lord of the Rings: The Return of the King 7 \n",
"7 Alice 1 Pulp Fiction 8 \n",
"8 Alice 1 The Good, the Bad and the Ugly  9 \n",
"9 Alice 1 Fight Club 10 \n",
"10 Bob 2 The Godfather: Part II 3 \n",
"11 Bob 2 The Dark Knight  4 \n",
"12 Bob 2 12 Angry Men 5 \n",
"13 Bob 2 Fight Club 10 \n",
"14 Bob 2 The Good, the Bad and the Ugly  9 \n",
"15 Carl 3 The Shawshank Redemption   1 \n",
"16 Carl 3 The Godfather 2 \n",
"17 Carl 3 The Godfather: Part II 3 \n",
"18 Carl 3 Pulp Fiction 8 \n",
"19 Carl 3 The Good, the Bad and the Ugly  9 \n",
"20 Deb 4 The Godfather 2 \n",
"21 Deb 4 The Godfather: Part II 3 \n",
"22 Deb 4 The Dark Knight  4 \n",
"23 Deb 4 12 Angry Men 5 \n",
"24 Earl 5 Schindler's List 6 \n",
"25 Earl 5 Pulp Fiction 8 \n",
"26 Earl 5 The Good, the Bad and the Ugly  9 \n",
"27 Earl 5 The Shawshank Redemption   1 \n",
"28 Earl 5 The Dark Knight  4 \n",
"29 Earl 5 Fight Club 10 \n",
"\n",
" rating \n",
"0 4 \n",
"1 3 \n",
"2 1 \n",
"3 5 \n",
"4 3 \n",
"5 3 \n",
"6 2 \n",
"7 2 \n",
"8 3 \n",
"9 1 \n",
"10 5 \n",
"11 4 \n",
"12 4 \n",
"13 4 \n",
"14 2 \n",
"15 1 \n",
"16 1 \n",
"17 5 \n",
"18 2 \n",
"19 3 \n",
"20 5 \n",
"21 1 \n",
"22 1 \n",
"23 4 \n",
"24 2 \n",
"25 3 \n",
"26 4 \n",
"27 5 \n",
"28 1 \n",
"29 3 "
]
},
"execution_count": 441,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ratings_ds"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Follwoing are the different number of ratings by each user. As we can see rating 1 is given 7 times and 5 only 5 times."
]
},
{
"cell_type": "code",
"execution_count": 442,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x1a2598f690>"
]
},
"execution_count": 442,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAWoAAAEDCAYAAAAcI05xAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjAsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+17YcXAAANAklEQVR4nO3dfYxld13H8c+H3Vb6BDXutaLLMJiQNUXrFqfFpgZti3XpkhojmjYRhKATE4FtRM0SYwx/mDQxUWuCiUNpEYUSqTRgKw9N6EpQujD7QO122wTqVtdSOo2F7QNps9uPf9w7ndnZu8yZ7T33fHfu+5VM9j6ce/PdszvvnPzmnLlOIgBAXS/regAAwA9GqAGgOEINAMURagAojlADQHGEGgCK29jGm27atCnT09NtvDUArEt79ux5Iklv2HOthHp6elrz8/NtvDUArEu2HznZcyx9AEBxhBoAiiPUAFAcoQaA4gg1ABS3aqhtb7G9f9nXEds3jGM4AECD0/OSPCRpqyTZ3iDpfyXd0fJcAICBtS59XCXpW0lOer4fAGC01nrBy3WSbhv2hO1ZSbOSNDU19ZKGmt5510t6/agcunF71yNgGf5fYFI1PqK2faakayV9atjzSeaSzCSZ6fWGXgUJADgFa1n6eIukvUm+09YwAIATrSXU1+skyx4AgPY0CrXtsyX9sqRPtzsOAGClRj9MTPKspB9peRYAwBBcmQgAxRFqACiOUANAcYQaAIoj1ABQHKEGgOIINQAUR6gBoDhCDQDFEWoAKI5QA0BxhBoAiiPUAFAcoQaA4gg1ABRHqAGgOEINAMURagAojlADQHGEGgCKa/op5Ofbvt32g7YP2r6s7cEAAH2NPoVc0k2SPp/kbbbPlHR2izMBAJZZNdS2XyHpTZLeKUlJnpf0fLtjAQAWNVn6+ElJC5Jutb3P9s22z2l5LgDAQJOlj42S3iDpvUl2275J0k5Jf7p8I9uzkmYlaWpqatRzAlhmeuddXY8gSTp04/auR5gITY6oD0s6nGT34P7t6of7OEnmkswkmen1eqOcEQAm2qqhTvKYpP+xvWXw0FWSHmh1KgDAi5qe9fFeSR8fnPHxsKR3tTcSAGC5RqFOsl/STMuzAACG4MpEACiOUANAcYQaAIoj1ABQHKEGgOIINQAUR6gBoDhCDQDFEWoAKI5QA0BxhBoAiiPUAFAcoQaA4gg1ABRHqAGgOEINAMURagAojlADQHGEGgCKI9QAUByhBoDiGn0Kue1Dkp6SdEzS0SR8IjkAjEmjUA9ckeSJ1iYBAAzF0gcAFNc01JH0Rdt7bM+2ORAA4HhNlz4uT/Ko7R+VdLftB5N8efkGg4DPStLU1NSIx5xc0zvv6noESdKhG7d3PQIw1CR8jzQ6ok7y6ODPxyXdIenSIdvMJZlJMtPr9UY7JQBMsFVDbfsc2+ct3pZ0taT72x4MANDXZOnjAkl32F7c/hNJPt/qVACAF60a6iQPS/rZMcwCABiC0/MAoDhCDQDFEWoAKI5QA0BxhBoAiiPUAFAcoQaA4gg1ABRHqAGgOEINAMURagAojlADQHGEGgCKI9QAUByhBoDiCDUAFEeoAaA4Qg0AxRFqACiOUANAcY1DbXuD7X2272xzIADA8dZyRL1D0sG2BgEADNco1LY3S9ou6eZ2xwEArNT0iPqvJf2xpBdanAUAMMSqobb9VkmPJ9mzynaztudtzy8sLIxsQACYdE2OqC+XdK3tQ5I+KelK2/+4cqMkc0lmksz0er0RjwkAk2vVUCf5QJLNSaYlXSfpS0l+q/XJAACSOI8aAMrbuJaNk+yStKuVSQAAQ3FEDQDFEWoAKI5QA0BxhBoAiiPUAFAcoQaA4gg1ABRHqAGgOEINAMURagAojlADQHGEGgCKI9QAUByhBoDiCDUAFEeoAaA4Qg0AxRFqACiOUANAcYQaAIoj1ABQ3Kqhtv1y21+z/Q3bB2x/cByDAQD6NjbY5jlJVyZ52vYZkr5i+3NJ7m15NgCAGoQ6SSQ9Pbh7xuArbQ4FAFjSaI3a9gbb+yU9LunuJLvbHQsAsKhRqJMcS7JV0mZJl9r+6ZXb2J61PW97fmFhYdRzAsDEWtNZH0m+K2mXpG1DnptLMpNkptfrjWg8AECTsz56ts8f3D5L0pslPdj2YACAviZnfbxK0t/b3qB+2P8pyZ3tjgUAWNTkrI/7JF08hlkAAENwZSIAFEeoAaA4Qg0AxRFqACiOUANAcYQaAIoj1ABQHKEGgOIINQAUR6gBoDhCDQDFEWoAKI5QA0BxhBoAiiPUAFAcoQaA4gg1ABRHqAGgOEINAMURagAobtVQ23617XtsH7R9wPaOcQwGAOhb9VPIJR2V9P4ke22fJ2mP7buTPNDybAAANTiiTvLtJHsHt5+SdFDST7Q9GACgb01r1LanJV0saXcbwwAATtQ41LbPlfTPkm5IcmTI87O2523PLywsjHJGAJhojUJt+wz1I/3xJJ8etk2SuSQzSWZ6vd4oZwSAidbkrA9L+oikg0n+sv2RAADLNTmivlzS2yVdaXv/4OualucCAAysenpekq9I8hhmAQAMwZWJAFAcoQaA4gg1ABRHqAGgOEINAMURagAojlADQHGEGgCKI9QAUByhBoDiCDUAFEeoAaA4Qg0AxRFqACiOUANAcYQaAIoj1ABQHKEGgOIINQAUR6gBoDhCDQDFrRpq27fYftz2/eMYCABwvCZH1B+VtK3lOQAAJ7FqqJN8WdL/jWEWAMAQrFEDQHEjC7XtWdvztucXFhZG9bYAMPFGFuokc0lmksz0er1RvS0ATDyWPgCguCan590m6auSttg+bPvd7Y8FAFi0cbUNklw/jkEAAMOx9AEAxRFqACiOUANAcYQaAIoj1ABQHKEGgOIINQAUR6gBoDhCDQDFEWoAKI5QA0BxhBoAiiPUAFAcoQaA4gg1ABRHqAGgOEINAMURagAojlADQHGEGgCKI9QAUFyjUNveZvsh29+0vbPtoQAAS1YNte0Nkj4k6S2SLpR0ve0L2x4MANDX5Ij6UknfTPJwkuclfVLSr7Y7FgBgkZP84A3st0naluR3BvffLumNSd6zYrtZSbODu1skPTT6cddkk6QnOp6hCvbFEvbFEvbFkgr74jVJesOe2NjgxR7y2Al1TzInaW6Ng7XG9nySma7nqIB9sYR9sYR9saT6vmiy9HFY0quX3d8s6dF2xgEArNQk1F+X9Drbr7V9pqTrJH223bEAAItWXfpIctT2eyR9QdIGSbckOdD6ZC9dmWWYAtgXS9gXS9gXS0rvi1V/mAgA6BZXJgJAcYQaAIoj1ABQHKFeh2z/lO2rbJ+74vFtXc3UFduX2r5kcPtC239g+5qu5+qa7Y91PUMVtn9h8P/i6q5nOZl1/8NE2+9KcmvXc4yL7fdJ+n1JByVtlbQjyWcGz+1N8oYu5xsn23+m/u+o2SjpbklvlLRL0pslfSHJn3c33fjYXnk6rSVdIelLkpTk2rEP1SHbX0ty6eD276r//XKHpKsl/UuSG7ucb5hJCPV/J5nqeo5xsf2fki5L8rTtaUm3S/qHJDfZ3pfk4k4HHKPBvtgq6YckPSZpc5Ijts+StDvJRZ0OOCa290p6QNLN6l9VbEm3qX9NhJL8W3fTjd/y7wPbX5d0TZIF2+dIujfJz3Q74YmaXEJenu37TvaUpAvGOUsBG5I8LUlJDtn+JUm3236Nhv86gPXsaJJjkp61/a0kRyQpyfdtv9DxbOM0I2mHpD+R9EdJ9tv+/qQFepmX2f5h9Zd+nWRBkpI8Y/tot6MNty5CrX6Mf0XSkyset6T/GP84nXrM9tYk+yVpcGT9Vkm3SCp3pNCy522fneRZST+3+KDtV0qamFAneUHSX9n+1ODP72j9fO+fildK2qN+H2L7x5I8NviZTsmDmfXyj3WnpHMX47Sc7V3jH6dT75B03FFBkqOS3mH777oZqTNvSvKc9GKsFp0h6be7Gak7SQ5L+g3b2yUd6XqeriSZPslTL0j6tTGO0ti6X6MGgNMdp+cBQHGEGgCKI9RY12zfYPvsZff/1fb5Xc4ErBVr1Djt2bb6/5dPOJPD9iFJM0m6/pgl4JRxRI3Tku1p2wdt/62kvZI+Ynve9gHbHxxs8z5JPy7pHtv3DB47ZHvTstd/ePCaLw4uhJHtS2zfZ/urtv/C9v1d/T0BiVDj9LZF0scGV5m9f/CZdxdJ+kXbFyX5G/U/Nu6KJFcMef3rJH0oyeslfVfSrw8ev1XS7yW5TNKx1v8WwCoINU5njyS5d3D7NweXSu+T9HpJFzZ4/X8tO/d+j6Tpwfr1eUkWL5T6xEgnBk7BerngBZPpGUmy/VpJfyjpkiRP2v6opJc3eP1zy24fk3SWil6ZhsnGETXWg1eoH+3v2b5A/d+Yt+gpSec1faMkT0p6yvbPDx66bmRTAqeII2qc9pJ8w/Y+SQckPSzp35c9PSfpc7a/fZJ16mHeLenDtp9R/9eifm+U8wJrxel5wAq2z138DYS2d0p6VZIdHY+FCcYRNXCi7bY/oP73xyOS3tntOJh0HFEDQHH8MBEAiiPUAFAcoQaA4gg1ABRHqAGgOEINAMX9PxR3EjOxzPDrAAAAAElFTkSuQmCC\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"ratings_ds.groupby('rating').count()['userId'].plot.bar()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Following graph shows how many users have rated a particular movie. As we can see The lord of the rings is rated least amount of time."
]
},
{
"cell_type": "code",
"execution_count": 443,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x1a25cca110>"
]
},
"execution_count": 443,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"ratings_ds.groupby('movieName').count()['userId'].plot.bar()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The below graph shows that the mean rating is highest for the 12 Angry Men and lowest for the Lord of the rings."
]
},
{
"cell_type": "code",
"execution_count": 444,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x1a25c669d0>"
]
},
"execution_count": 444,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"ratings_ds.groupby('movieName').mean()['rating'].plot.bar()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets check in heatmap what are the different ratings given by each user to the movies."
]
},
{
"cell_type": "code",
"execution_count": 445,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x1a25f44c90>"
]
},
"execution_count": 445,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 2 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"sns.heatmap(pd.crosstab(ratings_ds['userName'], ratings_ds['rating']), cmap=\"YlGnBu\", cbar=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As We can see Alice has rated most of the moves as 3 and Bob as four. This is another property for user based rating data. Some of the users rate very casually and some very strictly. We factor in this scenario by subtracting mean rating of a user from all the ratings of that user. "
]
},
{
"cell_type": "code",
"execution_count": 446,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x1a2604c190>"
]
},
"execution_count": 446,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 2 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"sns.heatmap(pd.crosstab(ratings_ds['movieName'], ratings_ds['rating']), cmap=\"YlGnBu\", cbar=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From Above plot we can see that Godfather part 2 has got most number of 5 ratings as well as most number of 1 rating. That means either users have highly liked it they they have highly disliked it. \"12 Angry Men \" mostly got ratings as 3 or 4. It will be interesting to see if these facts are somehow reflected in our modelling"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Modeling "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets create a recommendation model using surprise package. I am going to use KNNWithMeans model. I'll be tweaking some of the parameters of the model."
]
},
{
"cell_type": "code",
"execution_count": 447,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"from surprise import SVD\n",
"from surprise import Dataset\n",
"from surprise import accuracy\n",
"from surprise.model_selection import train_test_split\n",
"from surprise import KNNBasic, KNNWithMeans, KNNBaseline\n",
"from surprise.model_selection import KFold\n",
"from surprise import Reader\n",
"from surprise import NormalPredictor\n",
"from surprise.model_selection import cross_validate\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"from surprise.model_selection import GridSearchCV"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets create our dataset based on which we 'll train and test our model. We'll also create a antiset that will be used for recommendations."
]
},
{
"cell_type": "code",
"execution_count": 448,
"metadata": {},
"outputs": [],
"source": [
"reader = Reader(rating_scale=(1, 5))\n",
"# The columns must correspond to user id, item id and ratings (in that order).\n",
"data = Dataset.load_from_df(ratings_ds[['userId', 'movieId', 'rating']], reader)\n",
"anti_set = data.build_full_trainset().build_anti_testset()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"An antiset is a set of those user and item pairs for which a rating doesn't exist in original dataset. This is the set for which we are trying to predict ratings. For example in following example userId 2 that is Bob has not rated movieID 1 that is The Shawshank Redemption. \n",
"Surprise creates a set of such combinations by providing a default average rating. We 'll be calculating an estimated rating for this set using our model."
]
},
{
"cell_type": "code",
"execution_count": 449,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"[(2, 1, 2.9),\n",
" (2, 2, 2.9),\n",
" (2, 6, 2.9),\n",
" (2, 7, 2.9),\n",
" (2, 8, 2.9),\n",
" (3, 4, 2.9),\n",
" (3, 5, 2.9),\n",
" (3, 6, 2.9),\n",
" (3, 7, 2.9),\n",
" (3, 10, 2.9),\n",
" (4, 1, 2.9),\n",
" (4, 6, 2.9),\n",
" (4, 7, 2.9),\n",
" (4, 8, 2.9),\n",
" (4, 9, 2.9),\n",
" (4, 10, 2.9),\n",
" (5, 2, 2.9),\n",
" (5, 3, 2.9),\n",
" (5, 5, 2.9),\n",
" (5, 7, 2.9)]"
]
},
"execution_count": 449,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"anti_set"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll create movies and user's dataframe to join them with the dataframes that contain user and movieid's."
]
},
{
"cell_type": "code",
"execution_count": 450,
"metadata": {},
"outputs": [],
"source": [
"movies = ratings_ds[['movieId' , 'movieName']].drop_duplicates(['movieId' , 'movieName'])\n",
"users = ratings_ds[['userId' , 'userName']].drop_duplicates(['userId' , 'userName'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"KNN basic with default parameters"
]
},
{
"cell_type": "code",
"execution_count": 451,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Computing the msd similarity matrix...\n",
"Done computing similarity matrix.\n",
"RMSE: 1.7945\n",
"Computing the msd similarity matrix...\n",
"Done computing similarity matrix.\n",
"RMSE: 2.5418\n",
"Computing the msd similarity matrix...\n",
"Done computing similarity matrix.\n",
"RMSE: 1.3928\n"
]
}
],
"source": [
"kf = KFold(n_splits=3)\n",
"algo = KNNBasic()\n",
"best_algo = None\n",
"best_rmse = 1000.0\n",
"best_pred = None\n",
"for trainset, testset in kf.split(data):\n",
" # train and test algorithm.\n",
" algo.fit(trainset)\n",
" predictions = algo.test(testset)\n",
" # Compute and print Root Mean Squared Error\n",
" rmse = accuracy.rmse(predictions, verbose=True)\n",
" if rmse < best_rmse:\n",
" best_algo = algo\n",
" best_pred = predictions\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 452,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Prediction(uid=1, iid=2, r_ui=3.0, est=5, details={'actual_k': 1, 'was_impossible': False}),\n",
" Prediction(uid=4, iid=3, r_ui=1.0, est=1, details={'actual_k': 1, 'was_impossible': False}),\n",
" Prediction(uid=5, iid=9, r_ui=4.0, est=2.9999999999999996, details={'actual_k': 2, 'was_impossible': False}),\n",
" Prediction(uid=2, iid=5, r_ui=4.0, est=3.05, details={'was_impossible': True, 'reason': 'Not enough neighbors.'}),\n",
" Prediction(uid=2, iid=9, r_ui=2.0, est=3.0000000000000004, details={'actual_k': 2, 'was_impossible': False}),\n",
" Prediction(uid=5, iid=6, r_ui=2.0, est=3.0, details={'actual_k': 1, 'was_impossible': False}),\n",
" Prediction(uid=3, iid=2, r_ui=1.0, est=3.05, details={'was_impossible': True, 'reason': 'Not enough neighbors.'}),\n",
" Prediction(uid=1, iid=5, r_ui=3.0, est=4.0, details={'actual_k': 1, 'was_impossible': False}),\n",
" Prediction(uid=2, iid=4, r_ui=4.0, est=1.5161290322580645, details={'actual_k': 2, 'was_impossible': False}),\n",
" Prediction(uid=3, iid=8, r_ui=2.0, est=2.354430379746835, details={'actual_k': 2, 'was_impossible': False})]"
]
},
"execution_count": 452,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"best_pred"
]
},
{
"cell_type": "code",
"execution_count": 453,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Computing the cosine similarity matrix...\n",
"Done computing similarity matrix.\n",
"RMSE: 2.1756\n",
"Computing the cosine similarity matrix...\n",
"Done computing similarity matrix.\n",
"RMSE: 2.0055\n",
"Computing the cosine similarity matrix...\n",
"Done computing similarity matrix.\n",
"RMSE: 2.2148\n",
"Computing the cosine similarity matrix...\n",
"Done computing similarity matrix.\n",
"RMSE: 2.3694\n",
"Computing the cosine similarity matrix...\n",
"Done computing similarity matrix.\n",
"RMSE: 2.2428\n",
"2.0055133068899793\n"
]
}
],
"source": [
"kf = KFold(n_splits=5)\n",
"sim_options = {'name':'cosine'}\n",
"algo = KNNWithMeans(sim_options = sim_options)\n",
"best_algo = None\n",
"best_rmse = 1000.0\n",
"best_pred = None\n",
"for trainset, testset in kf.split(data):\n",
" # train and test algorithm.\n",
" algo.fit(trainset)\n",
" predictions = algo.test(testset)\n",
" # Compute and print Root Mean Squared Error\n",
" rmse = accuracy.rmse(predictions, verbose=True)\n",
" if rmse < best_rmse:\n",
" best_algo = algo\n",
" best_rmse= rmse\n",
" best_pred = predictions\n",
"print(best_rmse)"
]
},
{
"cell_type": "code",
"execution_count": 454,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"[Prediction(uid=1, iid=3, r_ui=1.0, est=3.3846312031704846, details={'actual_k': 3, 'was_impossible': False}),\n",
" Prediction(uid=2, iid=9, r_ui=2.0, est=4.72213595499958, details={'actual_k': 2, 'was_impossible': False}),\n",
" Prediction(uid=1, iid=9, r_ui=3.0, est=3.3075798466104294, details={'actual_k': 2, 'was_impossible': False}),\n",
" Prediction(uid=3, iid=2, r_ui=1.0, est=4.1875, details={'actual_k': 2, 'was_impossible': False}),\n",
" Prediction(uid=5, iid=8, r_ui=3.0, est=2.125, details={'actual_k': 1, 'was_impossible': False}),\n",
" Prediction(uid=3, iid=8, r_ui=2.0, est=2.125, details={'actual_k': 1, 'was_impossible': False})]"
]
},
"execution_count": 454,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"best_pred"
]
},
{
"cell_type": "code",
"execution_count": 455,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Estimating biases using als...\n",
"Computing the msd similarity matrix...\n",
"Done computing similarity matrix.\n",
"RMSE: 2.1314\n",
"Estimating biases using als...\n",
"Computing the msd similarity matrix...\n",
"Done computing similarity matrix.\n",
"RMSE: 2.2056\n",
"Estimating biases using als...\n",
"Computing the msd similarity matrix...\n",
"Done computing similarity matrix.\n",
"RMSE: 2.3116\n"
]
}
],
"source": [
"kf = KFold(n_splits=3)\n",
"algo = KNNBaseline(k=3)\n",
"best_algo = None\n",
"best_rmse = 1000.0\n",
"best_pred = None\n",
"for trainset, testset in kf.split(data):\n",
" # train and test algorithm.\n",
" algo.fit(trainset)\n",
" predictions = algo.test(testset)\n",
" # Compute and print Root Mean Squared Error\n",
" rmse = accuracy.rmse(predictions, verbose=True)\n",
" if rmse < best_rmse:\n",
" best_rmse = rmse\n",
" best_algo = algo\n",
" best_pred = predictions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Changing Similarity options for Algorithm\n",
"\n",
"We can see from the output of above modelling that the similarity measre used was MSD. IF we want to use cosine similarity instead of default MSD based similarity then we can pass sim_options dictionary to in following way.\n",
"Also if We cant to use user_based collaborative filtering instead of item based based collabarative filtering we can set user_based option to True."
]
},
{
"cell_type": "code",
"execution_count": 464,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Computing the cosine similarity matrix...\n",
"Done computing similarity matrix.\n",
"RMSE: 2.2057\n",
"Computing the cosine similarity matrix...\n",
"Done computing similarity matrix.\n",
"RMSE: 0.9548\n",
"Computing the cosine similarity matrix...\n",
"Done computing similarity matrix.\n",
"RMSE: 2.1218\n",
"Computing the cosine similarity matrix...\n",
"Done computing similarity matrix.\n",
"RMSE: 2.4499\n",
"Computing the cosine similarity matrix...\n",
"Done computing similarity matrix.\n",
"RMSE: 1.7273\n"
]
}
],
"source": [
"sim_options = { 'name': 'cosine' ,'user_based': False}\n",
"kf = KFold(n_splits=5)\n",
"algo = KNNWithMeans(k =3 , sim_options = sim_options)\n",
"best_algo = None\n",
"best_rmse = 1000.0\n",
"best_pred = None\n",
"for trainset, testset in kf.split(data):\n",
" # train and test algorithm.\n",
" algo.fit(trainset)\n",
" predictions = algo.test(testset)\n",
" # Compute and print Root Mean Squared Error\n",
" rmse = accuracy.rmse(predictions, verbose=True)\n",
" if rmse < best_rmse:\n",
" best_rmse= rmse\n",
" best_algo = algo\n",
" best_pred = predictions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Analysis of Predictions "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets see how our best prediction has faired on existing set of ratings test data"
]
},
{
"cell_type": "code",
"execution_count": 465,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>uid</th>\n",
" <th>iid</th>\n",
" <th>userName</th>\n",
" <th>userId</th>\n",
" <th>movieName</th>\n",
" <th>movieId</th>\n",
" <th>est</th>\n",
" <th>rating</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>5</td>\n",
" <td>Alice</td>\n",
" <td>1</td>\n",
" <td>12 Angry Men</td>\n",
" <td>5</td>\n",
" <td>3.000000</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>5</td>\n",
" <td>6</td>\n",
" <td>Earl</td>\n",
" <td>5</td>\n",
" <td>Schindler's List</td>\n",
" <td>6</td>\n",
" <td>2.861111</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>Bob</td>\n",
" <td>2</td>\n",
" <td>The Godfather: Part II</td>\n",
" <td>3</td>\n",
" <td>3.171382</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>5</td>\n",
" <td>8</td>\n",
" <td>Earl</td>\n",
" <td>5</td>\n",
" <td>Pulp Fiction</td>\n",
" <td>8</td>\n",
" <td>2.416667</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Alice</td>\n",
" <td>1</td>\n",
" <td>The Shawshank Redemption</td>\n",
" <td>1</td>\n",
" <td>3.305556</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>Alice</td>\n",
" <td>1</td>\n",
" <td>The Godfather</td>\n",
" <td>2</td>\n",
" <td>3.750000</td>\n",
" <td>3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" uid iid userName userId movieName movieId est \\\n",
"0 1 5 Alice 1 12 Angry Men 5 3.000000 \n",
"1 5 6 Earl 5 Schindler's List 6 2.861111 \n",
"2 2 3 Bob 2 The Godfather: Part II 3 3.171382 \n",
"3 5 8 Earl 5 Pulp Fiction 8 2.416667 \n",
"4 1 1 Alice 1 The Shawshank Redemption   1 3.305556 \n",
"5 1 2 Alice 1 The Godfather 2 3.750000 \n",
"\n",
" rating \n",
"0 3 \n",
"1 2 \n",
"2 5 \n",
"3 3 \n",
"4 4 \n",
"5 3 "
]
},
"execution_count": 465,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pred_df = pd.DataFrame(best_pred).merge(ratings_ds , left_on = ['uid', 'iid'], right_on = ['userId', 'movieId'])\n",
"pred_df[['uid', 'iid', 'userName', 'userId', 'movieName', 'movieId', 'est','rating']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Well not very well for predicting The Dark Knight for Earl but for other movies it is fairly close.\n",
"Now lets try to predict ratings for movies for those users who have not watched them earlier.\n",
"Remember we created an antiset for such pairs. Lets apply our best algo on that set"
]
},
{
"cell_type": "code",
"execution_count": 466,
"metadata": {},
"outputs": [],
"source": [
"anti_pre = best_algo.test(anti_set)\n",
"pred_df = pd.DataFrame(anti_pre).merge(movies , left_on = ['iid'], right_on = ['movieId'])\n",
"pred_df = pd.DataFrame(pred_df).merge(users , left_on = ['uid'], right_on = ['userId'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Following are the predicted rating for the users"
]
},
{
"cell_type": "code",
"execution_count": 467,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>uid</th>\n",
" <th>iid</th>\n",
" <th>r_ui</th>\n",
" <th>est</th>\n",
" <th>details</th>\n",
" <th>movieId</th>\n",
" <th>movieName</th>\n",
" <th>userId</th>\n",
" <th>userName</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>2.9</td>\n",
" <td>3.311469</td>\n",
" <td>{'actual_k': 3, 'was_impossible': False}</td>\n",
" <td>1</td>\n",
" <td>The Shawshank Redemption</td>\n",
" <td>2</td>\n",
" <td>Bob</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>2.9</td>\n",
" <td>4.991123</td>\n",
" <td>{'actual_k': 3, 'was_impossible': False}</td>\n",
" <td>2</td>\n",
" <td>The Godfather</td>\n",
" <td>2</td>\n",
" <td>Bob</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>6</td>\n",
" <td>2.9</td>\n",
" <td>2.944444</td>\n",
" <td>{'actual_k': 3, 'was_impossible': False}</td>\n",
" <td>6</td>\n",
" <td>Schindler's List</td>\n",
" <td>2</td>\n",
" <td>Bob</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2</td>\n",
" <td>7</td>\n",
" <td>2.9</td>\n",
" <td>3.000000</td>\n",
" <td>{'actual_k': 3, 'was_impossible': False}</td>\n",
" <td>7</td>\n",
" <td>The Lord of the Rings: The Return of the King</td>\n",
" <td>2</td>\n",
" <td>Bob</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2</td>\n",
" <td>8</td>\n",
" <td>2.9</td>\n",
" <td>2.686264</td>\n",
" <td>{'actual_k': 3, 'was_impossible': False}</td>\n",
" <td>8</td>\n",
" <td>Pulp Fiction</td>\n",
" <td>2</td>\n",
" <td>Bob</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>2.9</td>\n",
" <td>3.529850</td>\n",
" <td>{'actual_k': 3, 'was_impossible': False}</td>\n",
" <td>1</td>\n",
" <td>The Shawshank Redemption</td>\n",
" <td>4</td>\n",
" <td>Deb</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>4</td>\n",
" <td>6</td>\n",
" <td>2.9</td>\n",
" <td>2.277778</td>\n",
" <td>{'actual_k': 3, 'was_impossible': False}</td>\n",
" <td>6</td>\n",
" <td>Schindler's List</td>\n",
" <td>4</td>\n",
" <td>Deb</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>4</td>\n",
" <td>7</td>\n",
" <td>2.9</td>\n",
" <td>1.777778</td>\n",
" <td>{'actual_k': 3, 'was_impossible': False}</td>\n",
" <td>7</td>\n",
" <td>The Lord of the Rings: The Return of the King</td>\n",
" <td>4</td>\n",
" <td>Deb</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>4</td>\n",
" <td>8</td>\n",
" <td>2.9</td>\n",
" <td>2.216539</td>\n",
" <td>{'actual_k': 3, 'was_impossible': False}</td>\n",
" <td>8</td>\n",
" <td>Pulp Fiction</td>\n",
" <td>4</td>\n",
" <td>Deb</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>4</td>\n",
" <td>10</td>\n",
" <td>2.9</td>\n",
" <td>3.083333</td>\n",
" <td>{'was_impossible': True, 'reason': 'User and/o...</td>\n",
" <td>10</td>\n",
" <td>Fight Club</td>\n",
" <td>4</td>\n",
" <td>Deb</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>4</td>\n",
" <td>9</td>\n",
" <td>2.9</td>\n",
" <td>2.178106</td>\n",
" <td>{'actual_k': 2, 'was_impossible': False}</td>\n",
" <td>9</td>\n",
" <td>The Good, the Bad and the Ugly</td>\n",
" <td>4</td>\n",
" <td>Deb</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>5</td>\n",
" <td>2</td>\n",
" <td>2.9</td>\n",
" <td>3.611111</td>\n",
" <td>{'actual_k': 3, 'was_impossible': False}</td>\n",
" <td>2</td>\n",
" <td>The Godfather</td>\n",
" <td>5</td>\n",
" <td>Earl</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>5</td>\n",
" <td>7</td>\n",
" <td>2.9</td>\n",
" <td>1.611111</td>\n",
" <td>{'actual_k': 3, 'was_impossible': False}</td>\n",
" <td>7</td>\n",
" <td>The Lord of the Rings: The Return of the King</td>\n",
" <td>5</td>\n",
" <td>Earl</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>5</td>\n",
" <td>5</td>\n",
" <td>2.9</td>\n",
" <td>4.277778</td>\n",
" <td>{'actual_k': 3, 'was_impossible': False}</td>\n",
" <td>5</td>\n",
" <td>12 Angry Men</td>\n",
" <td>5</td>\n",
" <td>Earl</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>5</td>\n",
" <td>3</td>\n",
" <td>2.9</td>\n",
" <td>3.368083</td>\n",
" <td>{'actual_k': 3, 'was_impossible': False}</td>\n",
" <td>3</td>\n",
" <td>The Godfather: Part II</td>\n",
" <td>5</td>\n",
" <td>Earl</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>3</td>\n",
" <td>6</td>\n",
" <td>2.9</td>\n",
" <td>2.424315</td>\n",
" <td>{'actual_k': 3, 'was_impossible': False}</td>\n",
" <td>6</td>\n",
" <td>Schindler's List</td>\n",
" <td>3</td>\n",
" <td>Carl</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>3</td>\n",
" <td>7</td>\n",
" <td>2.9</td>\n",
" <td>1.777778</td>\n",
" <td>{'actual_k': 3, 'was_impossible': False}</td>\n",
" <td>7</td>\n",
" <td>The Lord of the Rings: The Return of the King</td>\n",
" <td>3</td>\n",
" <td>Carl</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" <td>2.9</td>\n",
" <td>3.114020</td>\n",
" <td>{'actual_k': 3, 'was_impossible': False}</td>\n",
" <td>4</td>\n",
" <td>The Dark Knight</td>\n",
" <td>3</td>\n",
" <td>Carl</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>3</td>\n",
" <td>5</td>\n",
" <td>2.9</td>\n",
" <td>2.777778</td>\n",
" <td>{'actual_k': 3, 'was_impossible': False}</td>\n",
" <td>5</td>\n",
" <td>12 Angry Men</td>\n",
" <td>3</td>\n",
" <td>Carl</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>3</td>\n",
" <td>10</td>\n",
" <td>2.9</td>\n",
" <td>3.083333</td>\n",
" <td>{'was_impossible': True, 'reason': 'User and/o...</td>\n",
" <td>10</td>\n",
" <td>Fight Club</td>\n",
" <td>3</td>\n",
" <td>Carl</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" uid iid r_ui est \\\n",
"0 2 1 2.9 3.311469 \n",
"1 2 2 2.9 4.991123 \n",
"2 2 6 2.9 2.944444 \n",
"3 2 7 2.9 3.000000 \n",
"4 2 8 2.9 2.686264 \n",
"5 4 1 2.9 3.529850 \n",
"6 4 6 2.9 2.277778 \n",
"7 4 7 2.9 1.777778 \n",
"8 4 8 2.9 2.216539 \n",
"9 4 10 2.9 3.083333 \n",
"10 4 9 2.9 2.178106 \n",
"11 5 2 2.9 3.611111 \n",
"12 5 7 2.9 1.611111 \n",
"13 5 5 2.9 4.277778 \n",
"14 5 3 2.9 3.368083 \n",
"15 3 6 2.9 2.424315 \n",
"16 3 7 2.9 1.777778 \n",
"17 3 4 2.9 3.114020 \n",
"18 3 5 2.9 2.777778 \n",
"19 3 10 2.9 3.083333 \n",
"\n",
" details movieId \\\n",
"0 {'actual_k': 3, 'was_impossible': False} 1 \n",
"1 {'actual_k': 3, 'was_impossible': False} 2 \n",
"2 {'actual_k': 3, 'was_impossible': False} 6 \n",
"3 {'actual_k': 3, 'was_impossible': False} 7 \n",
"4 {'actual_k': 3, 'was_impossible': False} 8 \n",
"5 {'actual_k': 3, 'was_impossible': False} 1 \n",
"6 {'actual_k': 3, 'was_impossible': False} 6 \n",
"7 {'actual_k': 3, 'was_impossible': False} 7 \n",
"8 {'actual_k': 3, 'was_impossible': False} 8 \n",
"9 {'was_impossible': True, 'reason': 'User and/o... 10 \n",
"10 {'actual_k': 2, 'was_impossible': False} 9 \n",
"11 {'actual_k': 3, 'was_impossible': False} 2 \n",
"12 {'actual_k': 3, 'was_impossible': False} 7 \n",
"13 {'actual_k': 3, 'was_impossible': False} 5 \n",
"14 {'actual_k': 3, 'was_impossible': False} 3 \n",
"15 {'actual_k': 3, 'was_impossible': False} 6 \n",
"16 {'actual_k': 3, 'was_impossible': False} 7 \n",
"17 {'actual_k': 3, 'was_impossible': False} 4 \n",
"18 {'actual_k': 3, 'was_impossible': False} 5 \n",
"19 {'was_impossible': True, 'reason': 'User and/o... 10 \n",
"\n",
" movieName userId userName \n",
"0 The Shawshank Redemption   2 Bob \n",
"1 The Godfather 2 Bob \n",
"2 Schindler's List 2 Bob \n",
"3 The Lord of the Rings: The Return of the King 2 Bob \n",
"4 Pulp Fiction 2 Bob \n",
"5 The Shawshank Redemption   4 Deb \n",
"6 Schindler's List 4 Deb \n",
"7 The Lord of the Rings: The Return of the King 4 Deb \n",
"8 Pulp Fiction 4 Deb \n",
"9 Fight Club 4 Deb \n",
"10 The Good, the Bad and the Ugly  4 Deb \n",
"11 The Godfather 5 Earl \n",
"12 The Lord of the Rings: The Return of the King 5 Earl \n",
"13 12 Angry Men 5 Earl \n",
"14 The Godfather: Part II 5 Earl \n",
"15 Schindler's List 3 Carl \n",
"16 The Lord of the Rings: The Return of the King 3 Carl \n",
"17 The Dark Knight  3 Carl \n",
"18 12 Angry Men 3 Carl \n",
"19 Fight Club 3 Carl "
]
},
"execution_count": 467,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pred_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### finding Recommendations for a user\n",
"\n",
"We can decide based on below that we 'll recommend a movie to the users if the estimated rating is more than 3. based on the above following are going to be the recommendations for the user 2 (Bob)\n"
]
},
{
"cell_type": "code",
"execution_count": 468,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>uid</th>\n",
" <th>iid</th>\n",
" <th>r_ui</th>\n",
" <th>est</th>\n",
" <th>details</th>\n",
" <th>movieId</th>\n",
" <th>movieName</th>\n",
" <th>userId</th>\n",
" <th>userName</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>2.9</td>\n",
" <td>3.311469</td>\n",
" <td>{'actual_k': 3, 'was_impossible': False}</td>\n",
" <td>1</td>\n",
" <td>The Shawshank Redemption</td>\n",
" <td>2</td>\n",
" <td>Bob</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>2.9</td>\n",
" <td>4.991123</td>\n",
" <td>{'actual_k': 3, 'was_impossible': False}</td>\n",
" <td>2</td>\n",
" <td>The Godfather</td>\n",
" <td>2</td>\n",
" <td>Bob</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" uid iid r_ui est details \\\n",
"0 2 1 2.9 3.311469 {'actual_k': 3, 'was_impossible': False} \n",
"1 2 2 2.9 4.991123 {'actual_k': 3, 'was_impossible': False} \n",
"\n",
" movieId movieName userId userName \n",
"0 1 The Shawshank Redemption   2 Bob \n",
"1 2 The Godfather 2 Bob "
]
},
"execution_count": 468,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pred_df[(pred_df['est']>3.0)&(pred_df['userId']==2)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Recommendations for Deb"
]
},
{
"cell_type": "code",
"execution_count": 469,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>uid</th>\n",
" <th>iid</th>\n",
" <th>r_ui</th>\n",
" <th>est</th>\n",
" <th>details</th>\n",
" <th>movieId</th>\n",
" <th>movieName</th>\n",
" <th>userId</th>\n",
" <th>userName</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>2.9</td>\n",
" <td>3.529850</td>\n",
" <td>{'actual_k': 3, 'was_impossible': False}</td>\n",
" <td>1</td>\n",
" <td>The Shawshank Redemption</td>\n",
" <td>4</td>\n",
" <td>Deb</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>4</td>\n",
" <td>10</td>\n",
" <td>2.9</td>\n",
" <td>3.083333</td>\n",
" <td>{'was_impossible': True, 'reason': 'User and/o...</td>\n",
" <td>10</td>\n",
" <td>Fight Club</td>\n",
" <td>4</td>\n",
" <td>Deb</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" uid iid r_ui est \\\n",
"5 4 1 2.9 3.529850 \n",
"9 4 10 2.9 3.083333 \n",
"\n",
" details movieId \\\n",
"5 {'actual_k': 3, 'was_impossible': False} 1 \n",
"9 {'was_impossible': True, 'reason': 'User and/o... 10 \n",
"\n",
" movieName userId userName \n",
"5 The Shawshank Redemption   4 Deb \n",
"9 Fight Club 4 Deb "
]
},
"execution_count": 469,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pred_df[(pred_df['est']>3.0)&(pred_df['userId']==4)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Recommendations for Earl"
]
},
{
"cell_type": "code",
"execution_count": 470,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>uid</th>\n",
" <th>iid</th>\n",
" <th>r_ui</th>\n",
" <th>est</th>\n",
" <th>details</th>\n",
" <th>movieId</th>\n",
" <th>movieName</th>\n",
" <th>userId</th>\n",
" <th>userName</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>5</td>\n",
" <td>2</td>\n",
" <td>2.9</td>\n",
" <td>3.611111</td>\n",
" <td>{'actual_k': 3, 'was_impossible': False}</td>\n",
" <td>2</td>\n",
" <td>The Godfather</td>\n",
" <td>5</td>\n",
" <td>Earl</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>5</td>\n",
" <td>5</td>\n",
" <td>2.9</td>\n",
" <td>4.277778</td>\n",
" <td>{'actual_k': 3, 'was_impossible': False}</td>\n",
" <td>5</td>\n",
" <td>12 Angry Men</td>\n",
" <td>5</td>\n",
" <td>Earl</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>5</td>\n",
" <td>3</td>\n",
" <td>2.9</td>\n",
" <td>3.368083</td>\n",
" <td>{'actual_k': 3, 'was_impossible': False}</td>\n",
" <td>3</td>\n",
" <td>The Godfather: Part II</td>\n",
" <td>5</td>\n",
" <td>Earl</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" uid iid r_ui est details \\\n",
"11 5 2 2.9 3.611111 {'actual_k': 3, 'was_impossible': False} \n",
"13 5 5 2.9 4.277778 {'actual_k': 3, 'was_impossible': False} \n",
"14 5 3 2.9 3.368083 {'actual_k': 3, 'was_impossible': False} \n",
"\n",
" movieId movieName userId userName \n",
"11 2 The Godfather 5 Earl \n",
"13 5 12 Angry Men 5 Earl \n",
"14 3 The Godfather: Part II 5 Earl "
]
},
"execution_count": 470,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pred_df[(pred_df['est']>3.0)&(pred_df['userId']==5)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Finding nearest neighbours of an item \n",
"\n",
"One last thing I want to check is the neighborhood of a particular user. This will show which users are similar to the other users. In following example we we'll try to find the movies that are closest to the movieId 1 (The Shashank Redemption) based on our training set for algo model"
]
},
{
"cell_type": "code",
"execution_count": 471,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>movieId</th>\n",
" <th>movieName</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>The Godfather</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>12 Angry Men</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>6</td>\n",
" <td>Schindler's List</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>7</td>\n",
" <td>The Lord of the Rings: The Return of the King</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" movieId movieName\n",
"1 2 The Godfather\n",
"4 5 12 Angry Men\n",
"5 6 Schindler's List\n",
"6 7 The Lord of the Rings: The Return of the King"
]
},
"execution_count": 471,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tsr_inner_id = best_algo.trainset.to_inner_iid(1)\n",
"tsr_neighbors = best_algo.get_neighbors(tsr_inner_id, k=2)\n",
"movies[movies.movieId.isin([algo.trainset.to_raw_iid(inner_id)\n",
" for inner_id in tsr_neighbors])]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Above are the nearest neighbours for the movieId 1 (The Shashank Redemption) as per our model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## References \n",
"https://www.youtube.com/watch?v=6BTLobS7AU8\n",
"\n",
"\n",
"https://surprise.readthedocs.io/en/stable/"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment