Skip to content

Instantly share code, notes, and snippets.

@mkaranasou
Last active October 4, 2019 09:39
Show Gist options
  • Save mkaranasou/b143d717ed2068ac40b31467d1649592 to your computer and use it in GitHub Desktop.
Save mkaranasou/b143d717ed2068ac40b31467d1649592 to your computer and use it in GitHub Desktop.
Adding indexes to a dataframe with row_num if your data is sortable
>>> from pyspark.sql import Window
# the window is necessary here because row_number is a windowing function
# that means you can have row_number run over some amount of your data
# we'll be currently running it over the sorted by column1 data, row per row - our window will be of size 2 (rows),
# the whole dataframe that is.
>>> window = Window.orderBy(F.col('column1'))
>>> df_final = df_final.withColumn('row_number', F.row_number().over(window)
>>> df_final.select('index', 'row_number', 'column1', 'column2').show()
+-----+----------+-------+-------+
|index|row_number|column1|column2|
+-----+----------+-------+-------+
| 0| 1| 1| 2|
| 1| 2| 15| 21|
+-----+----------+-------+-------+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment