Skip to content

Instantly share code, notes, and snippets.

@aialenti
Last active December 7, 2019 15:41
Show Gist options
  • Save aialenti/cb28cc9afd9ce51f9aa0b88ebff3dd66 to your computer and use it in GitHub Desktop.
Save aialenti/cb28cc9afd9ce51f9aa0b88ebff3dd66 to your computer and use it in GitHub Desktop.
// The following row avoids the broadcasting, the dimension_table2
// is very small and my configuration would broadcast it
spark.conf.set("spark.sql.autoBroadcastJoinThreshold",-1)
// I'm using caching to simplify the DAG
dimension_table2.cache
dimension_table2.count
fact_table = fact_table.repartition(400)
fact_table = fact_table.join(dimension_table2.repartition(400),
fact_table.col("dimension_2_key") === dimension_table2.col("id"), "left")
fact_table.count
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment