Skip to content

Instantly share code, notes, and snippets.

@krsnewwave
Created June 15, 2022 16:22
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save krsnewwave/f855102f2e92ef2738dbb0b7267a3619 to your computer and use it in GitHub Desktop.
Save krsnewwave/f855102f2e92ef2738dbb0b7267a3619 to your computer and use it in GitHub Desktop.
# movies - id, genres
# train and valid are ratings
# with the following schema:
# userid, moveid, rating (1-5)
# all are stored in parquet format
# join the columns userid and movie id
# wait for the "train" and "valid" datasets later...
joined = ["userId", "movieId"] >> nvt.ops.JoinExternal(movies, on=["movieId"])
# convert users and movies to categoricals
cat_features = joined >> nvt.ops.Categorify()
# convert explicit ratings (4 & 5) as implicit (1)
ratings = nvt.ColumnGroup(["rating"]) >> nvt.ops.LambdaOp(lambda col: (col > 3).astype("int8"))
output = cat_features + ratings
# workflow is like a pipeline in sklearn
workflow = nvt.Workflow(output)
output.graph
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment