Skip to content

Instantly share code, notes, and snippets.

@bgweber
Last active June 2, 2019 03:57
Show Gist options
  • Save bgweber/0a7e63c103bb9896c8f953f2990a975d to your computer and use it in GitHub Desktop.
Save bgweber/0a7e63c103bb9896c8f953f2990a975d to your computer and use it in GitHub Desktop.
# load the CSV as a Spark data frame
pandas_df = pd.read_csv(
"https://github.com/bgweber/Twitch/raw/master/Recommendations/games-expand.csv")
spark_df = spark.createDataFrame(pandas_df)
# assign a user ID and a partition ID using Spark SQL
spark_df.createOrReplaceTempView("spark_df")
spark_df = spark.sql("""
select *, user_id%10 as partition_id
from (
select *, row_number() over (order by rand()) as user_id
from spark_df
)
""")
# preview the results
display(spark_df)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment