Skip to content

Instantly share code, notes, and snippets.

@mkaranasou
Created May 13, 2021 07:48
Show Gist options
  • Save mkaranasou/8e21489d584ee21972c0d579fbd7e814 to your computer and use it in GitHub Desktop.
Save mkaranasou/8e21489d584ee21972c0d579fbd7e814 to your computer and use it in GitHub Desktop.
Testing explode as a way to get sequential ids in a spark dataframe
if __name__ == '__main__':
from pyspark.sql import SparkSession, functions as F
from pyspark import SparkConf
from pyspark.sql import functions as F
conf = SparkConf()
spark = SparkSession.builder \
.config(conf=conf) \
.appName('Dataframe with Indexes') \
.getOrCreate()
# create a simple dataframe with two columns
data = [{'column1': 1, 'column2': 2}, {'column1': 15, 'column2': 21}]
df = spark.createDataFrame(data)
df.show()
df = df.withColumn("row_id",
F.explode(F.array([F.lit(i) for i in range(1, df.count() + 1)])))
df.show()
# +-------+-------+
# |column1|column2|
# +-------+-------+
# | 1| 2|
# | 15| 21|
# +-------+-------+
#
# +-------+-------+------+
# |column1|column2|row_id|
# +-------+-------+------+
# | 1| 2| 1|
# | 1| 2| 2|
# | 15| 21| 1|
# | 15| 21| 2|
# +-------+-------+------+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment