Last active
October 4, 2019 09:39
-
-
Save mkaranasou/b143d717ed2068ac40b31467d1649592 to your computer and use it in GitHub Desktop.
Adding indexes to a dataframe with row_num if your data is sortable
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
>>> from pyspark.sql import Window | |
# the window is necessary here because row_number is a windowing function | |
# that means you can have row_number run over some amount of your data | |
# we'll be currently running it over the sorted by column1 data, row per row - our window will be of size 2 (rows), | |
# the whole dataframe that is. | |
>>> window = Window.orderBy(F.col('column1')) | |
>>> df_final = df_final.withColumn('row_number', F.row_number().over(window) | |
>>> df_final.select('index', 'row_number', 'column1', 'column2').show() | |
+-----+----------+-------+-------+ | |
|index|row_number|column1|column2| | |
+-----+----------+-------+-------+ | |
| 0| 1| 1| 2| | |
| 1| 2| 15| 21| | |
+-----+----------+-------+-------+ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment