Skip to content

Instantly share code, notes, and snippets.

@crocker
Last active July 2, 2020 12:15
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save crocker/bc924a3f2660d47a606be6d84ff17893 to your computer and use it in GitHub Desktop.
Save crocker/bc924a3f2660d47a606be6d84ff17893 to your computer and use it in GitHub Desktop.
Find duplicates in a Spark DataFrame
val transactions = spark.read
.option("header", "true")
.option("inferSchema", "true")
.json("s3n://bucket-name/transaction.json")
transactions.groupBy("id", "organization").count.sort($"count".desc).show
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment