Skip to content

Instantly share code, notes, and snippets.

@jomoespe
Created February 14, 2019 07:28
Show Gist options
  • Save jomoespe/cd0dcfcb7b910e1a5df779e0e7687181 to your computer and use it in GitHub Desktop.
Save jomoespe/cd0dcfcb7b910e1a5df779e0e7687181 to your computer and use it in GitHub Desktop.
Snippet of Spark job to merge parquet files, also removing duplicates
val partitions = 5; // this value depends on data and volumes. Will be different in every case.
val df = sqlContext.read.json(“URI://path/to/parquet/files/")
df.createOrReplaceTempView("df")
val df_output = spark
.sql("SELECT DISTINCT * FROM df") // this removes duplicates. If it's not needed, simply remove this line
.coalesce(partitions)
df_output.write.parquet("URI://path/to/destination")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment