Skip to content

Instantly share code, notes, and snippets.

@yaravind
Forked from crocker/spark-duplicates.scala
Created May 31, 2017 14:39
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save yaravind/fe9ee5b03ac32d825dcbb836dda0b9fa to your computer and use it in GitHub Desktop.
Save yaravind/fe9ee5b03ac32d825dcbb836dda0b9fa to your computer and use it in GitHub Desktop.
Find duplicates in a Spark DataFrame
val transactions = spark.read
.option("header", "true")
.option("inferSchema", "true")
.json("s3n://bucket-name/transaction.json")
transactions.groupBy("id", "organization").count.sort($"count".desc).show
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment