Skip to content

Instantly share code, notes, and snippets.

@tilakpatidar
Created March 25, 2018 13:28
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tilakpatidar/d0d818679a0147ade7ae5804d277e01e to your computer and use it in GitHub Desktop.
Save tilakpatidar/d0d818679a0147ade7ae5804d277e01e to your computer and use it in GitHub Desktop.
Finding unique records between ORC file and MySQL rows using Apache Spark
import spark.implicits._
import org.apache.spark.sql.SaveMode
val products = spark.sqlContext.read.format("jdbc").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "products").option("user", "gobblin").option("password", "gobblin").option("url", "jdbc:mysql://localhost/mopar_demo").load()
scala> val newProducts = spark.sqlContext.read.format("orc").load("/Users/tilak/gobblin/mopar-demo/output/org/apache/gobblin/copy/user/tilak/pricing.products_1521799535.csv/20180325023900_append/part.task_PullCsvFromS3_1521945534992_0_0.orc")
scala> val reparitionedProducts = products.repartition(10)
val joined = newProducts.as("np").join(reparitionedProducts.as("op"), reparitionedProducts("sha") === newProducts("sha"), "left_outer")
val newNewProducts = joined.select("np.*")
newNewProducts.write.mode(SaveMode.Overwrite).format("orc").save("/tmp/myapp.orc")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment