Skip to content

Instantly share code, notes, and snippets.

@justinnaldzin
Created February 13, 2019 21:35
Show Gist options
  • Save justinnaldzin/31896893a1a31d1e5b8e47e8d458d53a to your computer and use it in GitHub Desktop.
Save justinnaldzin/31896893a1a31d1e5b8e47e8d458d53a to your computer and use it in GitHub Desktop.

Unique ID column

Generate a unique identifier that consistently produces the same result each time based on the values in the row. The ID column will be the first column positioned in the DataFrame.

from pyspark.sql.functions import sha2, concat_ws

columns = df.columns
df = df.withColumn(id_col, sha2(concat_ws("||", *df.columns), 256))
df = df.select([id_col] + columns)
df.printSchema()

Assert the ID column generated is consistent each time

df.select(id_col).orderBy(id_col).limit(5).show(truncate=False)
df2 = df.drop(id_col)
df2 = df2.withColumn(id_col, sha2(concat_ws("||", *df2.columns), 256))
df2 = df2.select([id_col] + columns)
df2.select(id_col).orderBy(id_col).limit(5).show(truncate=False)
diff_rows = df.subtract(df2).count()
print("Number of rows that are different: {}".format(diff_rows))
assert diff_rows == 0
df2.unpersist()

Monotonically increasing ID column

Add a monotonically increasing identifier column. The ID column will be the first column positioned in the DataFrame.

from pyspark.sql.functions import monotonically_increasing_id

columns = df.columns
df = df.withColumn(id_col, monotonically_increasing_id())
df = df.select([id_col] + columns)
df.show()
df.printSchema()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment