Last active
December 7, 2019 15:59
-
-
Save aialenti/39e1a1a8901701529e06774466da11cc to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// The following row avoids the broadcasting, the dimension_table2 is very small | |
spark.conf.set("spark.sql.autoBroadcastJoinThreshold",-1) | |
// I'm using caching to simplify the DAG | |
dimension_table2.cache | |
dimension_table2.count | |
// One way to use the same partitioner is to partition on a column with the same name, | |
// let's rename the columns that we want to join | |
fact_table = fact_table.withColumnRenamed("dimension_2_key", "repartition_id") | |
dimension_table2 = dimension_table2.withColumnRenamed("id", "repartition_id") | |
fact_table = fact_table.repartition(400, fact_table.col("repartition_id")) | |
fact_table = fact_table.join(dimension_table2.repartition(400, dimension_table2.col("repartition_id")), | |
fact_table.col("repartition_id") === dimension_table2.col("repartition_id"), "left") | |
fact_table.count |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment