Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save raidery/d53beb6e50e49a25592d260fd7f884ce to your computer and use it in GitHub Desktop.
Save raidery/d53beb6e50e49a25592d260fd7f884ce to your computer and use it in GitHub Desktop.
Created on Cognitive Class Labs
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a href=\"https://www.bigdatauniversity.com\"><img src=\"https://ibm.box.com/shared/static/qo20b88v1hbjztubt06609ovs85q8fau.png\" width=\"400px\" align=\"center\"></a>\n",
"\n",
"<h1 align=\"center\"><font size=\"5\">RECOMMENDATION SYSTEM WITH A RESTRICTED BOLTZMANN MACHINE</font></h1>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Welcome to the <b>Recommendation System with a Restricted Boltzmann Machine</b> notebook. In this notebook, we study and go over the usage of a Restricted Boltzmann Machine (RBM) in a Collaborative Filtering based recommendation system. This system is an algorithm that recommends items by trying to find users that are similar to each other based on their item ratings. By the end of this notebook, you should have a deeper understanding of how Restricted Boltzmann Machines are applied, and how to build one using TensorFlow."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2>Table of Contents</h2>\n",
"\n",
"<ol>\n",
" <li><a href=\"#ref1\">Acquiring the Data</a></li>\n",
" <li><a href=\"#ref2\">Loading in the Data</a></li>\n",
" <li><a href=\"#ref3\">The Restricted Boltzmann Machine model</a></li>\n",
" <li><a href=\"#ref4\">Setting the Model's Parameters</a></li>\n",
" <li><a href=\"#ref5\">Recommendation</a></li>\n",
"</ol>\n",
"<br>\n",
"<br>\n",
"<hr>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"ref1\"></a>\n",
"<h2>Acquiring the Data</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To start, we need to download the data we are going to use for our system. The datasets we are going to use were acquired by <a href=\"http://grouplens.org/datasets/movielens/\">GroupLens</a> and contain movies, users and movie ratings by these users.\n",
"\n",
"After downloading the data, we will extract the datasets to a directory that is easily accessible."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2019-12-29 15:17:53-- http://files.grouplens.org/datasets/movielens/ml-1m.zip\n",
"Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152\n",
"Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 5917549 (5.6M) [application/zip]\n",
"Saving to: ‘./data/moviedataset.zip’\n",
"\n",
"./data/moviedataset 100%[===================>] 5.64M 17.9MB/s in 0.3s \n",
"\n",
"2019-12-29 15:17:53 (17.9 MB/s) - ‘./data/moviedataset.zip’ saved [5917549/5917549]\n",
"\n",
"Archive: ./data/moviedataset.zip\n",
" creating: ./data/ml-1m/\n",
" inflating: ./data/ml-1m/movies.dat \n",
" inflating: ./data/ml-1m/ratings.dat \n",
" inflating: ./data/ml-1m/README \n",
" inflating: ./data/ml-1m/users.dat \n"
]
}
],
"source": [
"!mkdir data\n",
"!wget -O ./data/moviedataset.zip http://files.grouplens.org/datasets/movielens/ml-1m.zip\n",
"!unzip -o ./data/moviedataset.zip -d ./data"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"FuelConsumption.csv\t\t\t\t\t MNIST_data\n",
"ML0120EN-1.1-Review-TensorFlow-Hello-World.ipynb\t __pycache__\n",
"ML0120EN-1.2-Review-LinearRegressionwithTensorFlow.ipynb bird.jpg\n",
"ML0120EN-1.4-Review-LogisticRegressionwithTensorFlow.ipynb data\n",
"ML0120EN-2.1-Review-Understanding_Convolutions.ipynb\t destructed3.jpg\n",
"ML0120EN-2.2-Review-CNN-MNIST-Dataset.ipynb\t\t num3.jpg\n",
"ML0120EN-3.1-Reveiw-LSTM-basics.ipynb\t\t\t summary_logs\n",
"ML0120EN-3.2-Review-LSTM-LanguageModelling.ipynb\t utils.py\n",
"ML0120EN-4.1-Review-RBMMNIST.ipynb\t\t\t utils1.py\n",
"ML0120EN-4.2-Review-CollaborativeFilteringwithRBM.ipynb\n"
]
}
],
"source": [
"!ls"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With the datasets in place, let's now import the necessary libraries. We will be using <a href=\"https://www.tensorflow.org/\">Tensorflow</a> and <a href=\"http://www.numpy.org/\">Numpy</a> together to model and initialize our Restricted Boltzmann Machine and <a href=\"http://pandas.pydata.org/pandas-docs/stable/\">Pandas</a> to manipulate our datasets. To import these libraries, run the code cell below."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"#Tensorflow library. Used to implement machine learning models\n",
"import tensorflow as tf\n",
"#Numpy contains helpful functions for efficient mathematical calculations\n",
"import numpy as np\n",
"#Dataframe manipulation library\n",
"import pandas as pd\n",
"#Graph plotting library\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<hr>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"ref2\"></a>\n",
"<h2>Loading in the Data</h2>\n",
"\n",
"Let's begin by loading in our data with Pandas. The .dat files containing our data are similar to CSV files, but instead of using the ',' (comma) character to separate entries, it uses '::' (two colons) characters instead. To let Pandas know that it should separate data points at every '::', we have to specify the <code>sep='::'</code> parameter when calling the function.\n",
"\n",
"Additionally, we also pass it the <code>header=None</code> parameter due to the fact that our files don't contain any headers.\n",
"\n",
"Let's start with the movies.dat file and take a look at its structure:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>Toy Story (1995)</td>\n",
" <td>Animation|Children's|Comedy</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>Jumanji (1995)</td>\n",
" <td>Adventure|Children's|Fantasy</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>Grumpier Old Men (1995)</td>\n",
" <td>Comedy|Romance</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>Waiting to Exhale (1995)</td>\n",
" <td>Comedy|Drama</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>Father of the Bride Part II (1995)</td>\n",
" <td>Comedy</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1 Toy Story (1995) Animation|Children's|Comedy\n",
"1 2 Jumanji (1995) Adventure|Children's|Fantasy\n",
"2 3 Grumpier Old Men (1995) Comedy|Romance\n",
"3 4 Waiting to Exhale (1995) Comedy|Drama\n",
"4 5 Father of the Bride Part II (1995) Comedy"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Loading in the movies dataset\n",
"movies_df = pd.read_csv('./data/ml-1m/movies.dat', sep='::', header=None, engine='python')\n",
"movies_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can do the same for the ratings.dat file:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>1193</td>\n",
" <td>5</td>\n",
" <td>978300760</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>661</td>\n",
" <td>3</td>\n",
" <td>978302109</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>914</td>\n",
" <td>3</td>\n",
" <td>978301968</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>3408</td>\n",
" <td>4</td>\n",
" <td>978300275</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" <td>2355</td>\n",
" <td>5</td>\n",
" <td>978824291</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2 3\n",
"0 1 1193 5 978300760\n",
"1 1 661 3 978302109\n",
"2 1 914 3 978301968\n",
"3 1 3408 4 978300275\n",
"4 1 2355 5 978824291"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Loading in the ratings dataset\n",
"ratings_df = pd.read_csv('./data/ml-1m/ratings.dat', sep='::', header=None, engine='python')\n",
"ratings_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So our <b>movies_df</b> variable contains a dataframe that stores a movie's unique ID number, title and genres, while our <b>ratings_df</b> variable stores a unique User ID number, a movie's ID that the user has watched, the user's rating to said movie and when the user rated that movie.\n",
"\n",
"Let's now rename the columns in these dataframes so we can better convey their data more intuitively:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>MovieID</th>\n",
" <th>Title</th>\n",
" <th>Genres</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>Toy Story (1995)</td>\n",
" <td>Animation|Children's|Comedy</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>Jumanji (1995)</td>\n",
" <td>Adventure|Children's|Fantasy</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>Grumpier Old Men (1995)</td>\n",
" <td>Comedy|Romance</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>Waiting to Exhale (1995)</td>\n",
" <td>Comedy|Drama</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>Father of the Bride Part II (1995)</td>\n",
" <td>Comedy</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" MovieID Title Genres\n",
"0 1 Toy Story (1995) Animation|Children's|Comedy\n",
"1 2 Jumanji (1995) Adventure|Children's|Fantasy\n",
"2 3 Grumpier Old Men (1995) Comedy|Romance\n",
"3 4 Waiting to Exhale (1995) Comedy|Drama\n",
"4 5 Father of the Bride Part II (1995) Comedy"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movies_df.columns = ['MovieID', 'Title', 'Genres']\n",
"movies_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And our final ratings_df:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>UserID</th>\n",
" <th>MovieID</th>\n",
" <th>Rating</th>\n",
" <th>Timestamp</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>1193</td>\n",
" <td>5</td>\n",
" <td>978300760</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>661</td>\n",
" <td>3</td>\n",
" <td>978302109</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>914</td>\n",
" <td>3</td>\n",
" <td>978301968</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>3408</td>\n",
" <td>4</td>\n",
" <td>978300275</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" <td>2355</td>\n",
" <td>5</td>\n",
" <td>978824291</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" UserID MovieID Rating Timestamp\n",
"0 1 1193 5 978300760\n",
"1 1 661 3 978302109\n",
"2 1 914 3 978301968\n",
"3 1 3408 4 978300275\n",
"4 1 2355 5 978824291"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ratings_df.columns = ['UserID', 'MovieID', 'Rating', 'Timestamp']\n",
"ratings_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<hr>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"ref3\"></a>\n",
"<h2>The Restricted Boltzmann Machine model</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"https://ibm.box.com/shared/static/o049tx0dsllpbj3b546vuba25qqlzelq.png\" width=\"300\">\n",
"<br>\n",
"The Restricted Boltzmann Machine model has two layers of neurons, one of which is what we call a visible input layer and the other is called a hidden layer. The hidden layer is used to learn features from the information fed through the input layer. For our model, the input is going to contain X neurons, where X is the amount of movies in our dataset. Each of these neurons will possess a normalized rating value varying from 0 to 1, where 0 meaning that a user has not watched that movie and the closer the value is to 1, the more the user likes the movie that neuron's representing. These normalized values, of course, will be extracted and normalized from the ratings dataset.\n",
"\n",
"After passing in the input, we train the RBM on it and have the hidden layer learn its features. These features are what we use to reconstruct the input, which in our case, will predict the ratings for movies that user hasn't watched, which is exactly what we can use to recommend movies!\n",
"\n",
"We will now begin to format our dataset to follow the model's expected input."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>Formatting the Data</h3>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First let's see how many movies we have and see if the movie ID's correspond with that value:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3883"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(movies_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we can start formatting the data into input for the RBM. We're going to store the normalized users ratings into as a matrix of user-rating called trX, and normalize the values."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>MovieID</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" <th>4</th>\n",
" <th>5</th>\n",
" <th>6</th>\n",
" <th>7</th>\n",
" <th>8</th>\n",
" <th>9</th>\n",
" <th>10</th>\n",
" <th>...</th>\n",
" <th>3943</th>\n",
" <th>3944</th>\n",
" <th>3945</th>\n",
" <th>3946</th>\n",
" <th>3947</th>\n",
" <th>3948</th>\n",
" <th>3949</th>\n",
" <th>3950</th>\n",
" <th>3951</th>\n",
" <th>3952</th>\n",
" </tr>\n",
" <tr>\n",
" <th>UserID</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>5.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 3706 columns</p>\n",
"</div>"
],
"text/plain": [
"MovieID 1 2 3 4 5 6 7 8 9 10 ... \\\n",
"UserID ... \n",
"1 5.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... \n",
"2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... \n",
"3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... \n",
"4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... \n",
"5 NaN NaN NaN NaN NaN 2.0 NaN NaN NaN NaN ... \n",
"\n",
"MovieID 3943 3944 3945 3946 3947 3948 3949 3950 3951 3952 \n",
"UserID \n",
"1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN \n",
"\n",
"[5 rows x 3706 columns]"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"user_rating_df = ratings_df.pivot(index='UserID', columns='MovieID', values='Rating')\n",
"user_rating_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets normalize it now:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[1., 0., 0., ..., 0., 0., 0.],\n",
" [0., 0., 0., ..., 0., 0., 0.],\n",
" [0., 0., 0., ..., 0., 0., 0.],\n",
" [0., 0., 0., ..., 0., 0., 0.],\n",
" [0., 0., 0., ..., 0., 0., 0.]])"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"norm_user_rating_df = user_rating_df.fillna(0) / 5.0\n",
"trX = norm_user_rating_df.values\n",
"trX[0:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<hr>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"ref4\"></a>\n",
"<h2>Setting the Model's Parameters</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, let's start building our RBM with TensorFlow. We'll begin by first determining the number of neurons in the hidden layers and then creating placeholder variables for storing our visible layer biases, hidden layer biases and weights that connects the hidden layer with the visible layer. We will be arbitrarily setting the number of neurons in the hidden layers to 20. You can freely set this value to any number you want since each neuron in the hidden layer will end up learning a feature."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"hiddenUnits = 20\n",
"visibleUnits = len(user_rating_df.columns)\n",
"vb = tf.placeholder(\"float\", [visibleUnits]) #Number of unique movies\n",
"hb = tf.placeholder(\"float\", [hiddenUnits]) #Number of features we're going to learn\n",
"W = tf.placeholder(\"float\", [visibleUnits, hiddenUnits])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We then move on to creating the visible and hidden layer units and setting their activation functions. In this case, we will be using the <code>tf.sigmoid</code> and <code>tf.relu</code> functions as nonlinear activations since it is commonly used in RBM's."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"#Phase 1: Input Processing\n",
"v0 = tf.placeholder(\"float\", [None, visibleUnits])\n",
"_h0 = tf.nn.sigmoid(tf.matmul(v0, W) + hb)\n",
"h0 = tf.nn.relu(tf.sign(_h0 - tf.random_uniform(tf.shape(_h0))))\n",
"#Phase 2: Reconstruction\n",
"_v1 = tf.nn.sigmoid(tf.matmul(h0, tf.transpose(W)) + vb) \n",
"v1 = tf.nn.relu(tf.sign(_v1 - tf.random_uniform(tf.shape(_v1))))\n",
"h1 = tf.nn.sigmoid(tf.matmul(v1, W) + hb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we set the RBM training parameters and functions."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"#Learning rate\n",
"alpha = 1.0\n",
"#Create the gradients\n",
"w_pos_grad = tf.matmul(tf.transpose(v0), h0)\n",
"w_neg_grad = tf.matmul(tf.transpose(v1), h1)\n",
"#Calculate the Contrastive Divergence to maximize\n",
"CD = (w_pos_grad - w_neg_grad) / tf.to_float(tf.shape(v0)[0])\n",
"#Create methods to update the weights and biases\n",
"update_w = W + alpha * CD\n",
"update_vb = vb + alpha * tf.reduce_mean(v0 - v1, 0)\n",
"update_hb = hb + alpha * tf.reduce_mean(h0 - h1, 0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And set the error function, which in this case will be the Mean Absolute Error Function."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"err = v0 - v1\n",
"err_sum = tf.reduce_mean(err * err)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We also have to initialize our variables. Thankfully, NumPy has a handy ,code>zeros</code> function for this. We use it like so:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"#Current weight\n",
"cur_w = np.zeros([visibleUnits, hiddenUnits], np.float32)\n",
"#Current visible unit biases\n",
"cur_vb = np.zeros([visibleUnits], np.float32)\n",
"#Current hidden unit biases\n",
"cur_hb = np.zeros([hiddenUnits], np.float32)\n",
"#Previous weight\n",
"prv_w = np.zeros([visibleUnits, hiddenUnits], np.float32)\n",
"#Previous visible unit biases\n",
"prv_vb = np.zeros([visibleUnits], np.float32)\n",
"#Previous hidden unit biases\n",
"prv_hb = np.zeros([hiddenUnits], np.float32)\n",
"sess = tf.Session()\n",
"sess.run(tf.global_variables_initializer())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we train the RBM with 15 epochs with each epoch using 10 batches with size 100. After training, we print out a graph with the error by epoch."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.059453025\n",
"0.05109139\n",
"0.04913294\n",
"0.047578882\n",
"0.04714242\n",
"0.046697494\n",
"0.046139736\n",
"0.045851342\n",
"0.045538355\n",
"0.045423053\n",
"0.045310635\n",
"0.045233835\n",
"0.04525504\n",
"0.04519263\n",
"0.0451991\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"epochs = 15\n",
"batchsize = 100\n",
"errors = []\n",
"for i in range(epochs):\n",
" for start, end in zip( range(0, len(trX), batchsize), range(batchsize, len(trX), batchsize)):\n",
" batch = trX[start:end]\n",
" cur_w = sess.run(update_w, feed_dict={v0: batch, W: prv_w, vb: prv_vb, hb: prv_hb})\n",
" cur_vb = sess.run(update_vb, feed_dict={v0: batch, W: prv_w, vb: prv_vb, hb: prv_hb})\n",
" cur_nb = sess.run(update_hb, feed_dict={v0: batch, W: prv_w, vb: prv_vb, hb: prv_hb})\n",
" prv_w = cur_w\n",
" prv_vb = cur_vb\n",
" prv_hb = cur_hb\n",
" errors.append(sess.run(err_sum, feed_dict={v0: trX, W: cur_w, vb: cur_vb, hb: cur_hb}))\n",
" print (errors[-1])\n",
"plt.plot(errors)\n",
"plt.ylabel('Error')\n",
"plt.xlabel('Epoch')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<hr>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"ref5\"></a>\n",
"<h2>Recommendation</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now predict movies that an arbitrarily selected user might like. This can be accomplished by feeding in the user's watched movie preferences into the RBM and then reconstructing the input. The values that the RBM gives us will attempt to estimate the user's preferences for movies that he hasn't watched based on the preferences of the users that the RBM was trained on."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets first select a <b>User ID</b> of our mock user:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mock_user_id = 215"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"#Selecting the input user\n",
"inputUser = trX[mock_user_id-1].reshape(1, -1)\n",
"inputUser[0:5]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"#Feeding in the user and reconstructing the input\n",
"hh0 = tf.nn.sigmoid(tf.matmul(v0, W) + hb)\n",
"vv1 = tf.nn.sigmoid(tf.matmul(hh0, tf.transpose(W)) + vb)\n",
"feed = sess.run(hh0, feed_dict={ v0: inputUser, W: prv_w, hb: prv_hb})\n",
"rec = sess.run(vv1, feed_dict={ hh0: feed, W: prv_w, vb: prv_vb})\n",
"print(rec)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can then list the 20 most recommended movies for our mock user by sorting it by their scores given by our model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"scored_movies_df_mock = movies_df[movies_df['MovieID'].isin(user_rating_df.columns)]\n",
"scored_movies_df_mock = scored_movies_df_mock.assign(RecommendationScore = rec[0])\n",
"scored_movies_df_mock.sort_values([\"RecommendationScore\"], ascending=False).head(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So, how to recommend the movies that the user has not watched yet? "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we can find all the movies that our mock user has watched before:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"movies_df_mock = ratings_df[ratings_df['UserID'] == mock_user_id]\n",
"movies_df_mock.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the next cell, we merge all the movies that our mock users has watched with the predicted scores based on his historical data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"#Merging movies_df with ratings_df by MovieID\n",
"merged_df_mock = scored_movies_df_mock.merge(movies_df_mock, on='MovieID', how='outer')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"lets sort it and take a look at the first 20 rows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"merged_df_mock.sort_values([\"RecommendationScore\"], ascending=False).head(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see, there are some movies that user has not watched yet and has high score based on our model. So, we can recommend them to the user."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is the end of the module. If you want, you can try to change the parameters in the code -- adding more units to the hidden layer, changing the loss functions or maybe something else to see if it changes anything. Does the model perform better? Does it take longer to compute?\n",
"\n",
"Thank you for reading this notebook. Hopefully, you now have a little more understanding of the RBM model, its applications and how it works with TensorFlow."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<hr>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Want to learn more?\n",
"\n",
"Running deep learning programs usually needs a high performance platform. __PowerAI__ speeds up deep learning and AI. Built on IBM’s Power Systems, __PowerAI__ is a scalable software platform that accelerates deep learning and AI with blazing performance for individual users or enterprises. The __PowerAI__ platform supports popular machine learning libraries and dependencies including TensorFlow, Caffe, Torch, and Theano. You can use [PowerAI on IMB Cloud](https://cocl.us/ML0120EN_PAI).\n",
"\n",
"Also, you can use __Watson Studio__ to run these notebooks faster with bigger datasets.__Watson Studio__ is IBM’s leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, __Watson Studio__ enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of __Watson Studio__ users today with a free account at [Watson Studio](https://cocl.us/ML0120EN_DSX).This is the end of this lesson. Thank you for reading this notebook, and good luck on your studies."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Thank you for completing this exercise!\n",
"\n",
"Notebook created by: <a href = \"https://ca.linkedin.com/in/saeedaghabozorgi\">Saeed Aghabozorgi</a>, Gabriel Garcez Barros Sousa"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<hr>\n",
"\n",
"Copyright &copy; 2018 [Cognitive Class](https://cocl.us/DX0108EN_CC). This notebook and its source code are released under the terms of the [MIT License](https://bigdatauniversity.com/mit-license/)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python",
"language": "python",
"name": "conda-env-python-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
},
"widgets": {
"state": {},
"version": "1.1.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment