Skip to content

Instantly share code, notes, and snippets.

@charanpald
Created January 26, 2016 13:34
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save charanpald/ce73d9511994bc9f472f to your computer and use it in GitHub Desktop.
Save charanpald/ce73d9511994bc9f472f to your computer and use it in GitHub Desktop.
Generate MovieLens recommendations using the SVD
# Run some recommendation experiments using MovieLens 100K
import pandas
import numpy
import scipy.sparse
import scipy.sparse.linalg
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error
data_dir = "data/ml-100k/"
data_shape = (943, 1682)
df = pandas.read_csv(data_dir + "ua.base", sep="\t", header=-1)
values = df.values
values[:, 0:2] -= 1
X_train = scipy.sparse.csr_matrix((values[:, 2], (values[:, 0], values[:, 1])), dtype=numpy.float, shape=data_shape)
df = pandas.read_csv(data_dir + "ua.test", sep="\t", header=-1)
values = df.values
values[:, 0:2] -= 1
X_test = scipy.sparse.csr_matrix((values[:, 2], (values[:, 0], values[:, 1])), dtype=numpy.float, shape=data_shape)
# Compute means of nonzero elements
X_row_mean = numpy.zeros(data_shape[0])
X_row_sum = numpy.zeros(data_shape[0])
train_rows, train_cols = X_train.nonzero()
# Iterate through nonzero elements to compute sums and counts of rows elements
for i in range(train_rows.shape[0]):
X_row_mean[train_rows[i]] += X_train[train_rows[i], train_cols[i]]
X_row_sum[train_rows[i]] += 1
# Note that (X_row_sum == 0) is required to prevent divide by zero
X_row_mean /= X_row_sum + (X_row_sum == 0)
# Subtract mean rating for each user
for i in range(train_rows.shape[0]):
X_train[train_rows[i], train_cols[i]] -= X_row_mean[train_rows[i]]
test_rows, test_cols = X_test.nonzero()
for i in range(test_rows.shape[0]):
X_test[test_rows[i], test_cols[i]] -= X_row_mean[test_rows[i]]
X_train = numpy.array(X_train.toarray())
X_test = numpy.array(X_test.toarray())
ks = numpy.arange(2, 50)
train_mae = numpy.zeros(ks.shape[0])
test_mae = numpy.zeros(ks.shape[0])
train_scores = X_train[(train_rows, train_cols)]
test_scores = X_test[(test_rows, test_cols)]
# Now take SVD of X_train
U, s, Vt = numpy.linalg.svd(X_train, full_matrices=False)
for j, k in enumerate(ks):
X_pred = U[:, 0:k].dot(numpy.diag(s[0:k])).dot(Vt[0:k, :])
pred_train_scores = X_pred[(train_rows, train_cols)]
pred_test_scores = X_pred[(test_rows, test_cols)]
train_mae[j] = mean_absolute_error(train_scores, pred_train_scores)
test_mae[j] = mean_absolute_error(test_scores, pred_test_scores)
print(k, train_mae[j], test_mae[j])
plt.plot(ks, train_mae, 'k', label="Train")
plt.plot(ks, test_mae, 'r', label="Test")
plt.xlabel("k")
plt.ylabel("MAE")
plt.legend()
plt.show()
@eggie5
Copy link

eggie5 commented Feb 7, 2017

Looking at your learning curve, it's clear that your model is overfitting the noise of the training set. If you look at the test curve, it remains flat which would mean that it's not learning anything.

The SVD routine is simply reconstructing the original matrix R using only k eigen vectors. This model isn't filling in the blanks in R or leaning anything. In other words you could have just made the same predictions on original R w/o jumping through the hoops of doing SVD.

@charanpald
Copy link
Author

charanpald commented Feb 9, 2017

Thanks for your comments. Agreed that the model is overfitting. I wouldn't say the test curve is flat though, it clearly dips rapidly until k=7, then much more slowly until k=17, rising again afterwards.

I'm not sure I understand your point about the "model isn't filling in the blanks in R or learning anything". The k-rank SVD is different from R, unless you are assuming R is low rank? Clearly it's not the case. Also, I'm not sure how that means I could have made the same predictions using the original R?

@eggie5
Copy link

eggie5 commented Feb 9, 2017

@charanpald The learning curve I see is flat for the test set: See your live notebook here: https://github.com/DSE-capstone-sharknado/main/blob/master/SVD%20RecSys.ipynb

The SVD routine is simply giving you U, s, V which are the components of R, the original sparse matrix w/ missing values. All you are doing is reconstructing the original R w/ as an approximation as k increases. The empty values in R will still be empty in the reconstruction. You are not learning anything.

In summary:

The r^k_ij element of the matrix R_k is an approximation of the r_ij element of the original matrix R. In particular, if you were to use the U, s, and V matrices of expansion as they are (without the dimension reduction) you would have r^k_ij = r^ij. As a corollary, it's impossible to use the R_k as a prediction matrix, because you already have this information in matrix R. For example, if the p'th user did not rate the q'th product, r_pq will be 0. At the same time r^k_pq will be 0 too. Not a very insightful prediction, is it?

@charanpald
Copy link
Author

Your statement "The empty values in R will still be empty in the reconstruction." is wrong unfortunately. Are you able to prove it for k < rank(R)?

@denis-bz
Copy link

Numpy SVD is quite different from the "Funk SVD" that you need in recommender systems.
Why ? See
https://github.com/aaw/IncrementalSVD.jl#great-but-julia-already-has-an-svd-function-ill-just-use-that
(admirably clear)

cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment