Skip to content

Instantly share code, notes, and snippets.

@aialenti
Last active December 7, 2019 15:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save aialenti/39e1a1a8901701529e06774466da11cc to your computer and use it in GitHub Desktop.
Save aialenti/39e1a1a8901701529e06774466da11cc to your computer and use it in GitHub Desktop.
// The following row avoids the broadcasting, the dimension_table2 is very small
spark.conf.set("spark.sql.autoBroadcastJoinThreshold",-1)
// I'm using caching to simplify the DAG
dimension_table2.cache
dimension_table2.count
// One way to use the same partitioner is to partition on a column with the same name,
// let's rename the columns that we want to join
fact_table = fact_table.withColumnRenamed("dimension_2_key", "repartition_id")
dimension_table2 = dimension_table2.withColumnRenamed("id", "repartition_id")
fact_table = fact_table.repartition(400, fact_table.col("repartition_id"))
fact_table = fact_table.join(dimension_table2.repartition(400, dimension_table2.col("repartition_id")),
fact_table.col("repartition_id") === dimension_table2.col("repartition_id"), "left")
fact_table.count
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment