Skip to content

Instantly share code, notes, and snippets.

@jfkirk
Last active October 22, 2019 13:20
Show Gist options
  • Save jfkirk/53a90ec846b33cd08d24592b94c2458c to your computer and use it in GitHub Desktop.
Save jfkirk/53a90ec846b33cd08d24592b94c2458c to your computer and use it in GitHub Desktop.
# Map the MovieLens IDs to our internal IDs and keep track of the genres and titles
movie_genres_by_internal_id = {}
movie_titles_by_internal_id = {}
for row in raw_movie_metadata:
row[0] = movielens_to_internal_item_ids[int(row[0])] # Map to IDs
row[2] = row[2].split('|') # Split up the genres
movie_genres_by_internal_id[row[0]] = row[2]
movie_titles_by_internal_id[row[0]] = row[1]
# Look at an example movie metadata row
print("Raw metadata example:\n{}\n{}".format(raw_movie_metadata_header,
raw_movie_metadata[0]))
# Build a list of genres where the index is the internal movie ID and
# the value is a list of [Genre, Genre, ...]
movie_genres = [movie_genres_by_internal_id[internal_id]
for internal_id in range(n_items)]
# Transform the genres into binarized labels using scikit's MultiLabelBinarizer
movie_genre_features = MultiLabelBinarizer().fit_transform(movie_genres)
n_genres = movie_genre_features.shape[1]
print("Binarized genres example for movie {}:\n{}".format(movie_titles_by_internal_id[0],
movie_genre_features[0]))
# Coerce the movie genre features to a sparse matrix, which TensorRec expects
movie_genre_features = sparse.coo_matrix(movie_genre_features)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment