Skip to content

Instantly share code, notes, and snippets.

View yaravind's full-sized avatar
💭
Constraints Liberate. Liberties Constrain.

Aravind Yarram yaravind

💭
Constraints Liberate. Liberties Constrain.
View GitHub Profile
@yaravind
yaravind / spark-duplicates.scala
Created May 31, 2017 14:39 — forked from crocker/spark-duplicates.scala
Find duplicates in a Spark DataFrame
val transactions = spark.read
.option("header", "true")
.option("inferSchema", "true")
.json("s3n://bucket-name/transaction.json")
transactions.groupBy("id", "organization").count.sort($"count".desc).show
val today = LocalDate.now
val todayTransactions = spark.read
.option("header", "true")
.option("inferSchema", "true")
.json(s"s3n://bucket-name/${today}/transaction.json")
val yesterdayTransactions = spark.read
.option("header", "true")
.option("inferSchema", "true")