Skip to content

Instantly share code, notes, and snippets.

@AbdealiLoKo
Created January 2, 2023 06:01
Show Gist options
  • Save AbdealiLoKo/49f39f8704bda2b3714b6d6f63668863 to your computer and use it in GitHub Desktop.
Save AbdealiLoKo/49f39f8704bda2b3714b6d6f63668863 to your computer and use it in GitHub Desktop.
Spark - .withColumn() vs .select()
"""
Simple benchmark to check if withColumn() is faster or select() is faster
Conflusion: select() is faster than withColumn() in a for loop as lesser dataframes are created
"""
import datetime
import findspark; findspark.init(); import pyspark
spark = pyspark.sql.SparkSession.builder.getOrCreate()
for ncol in [10, 100, 1000, 2000, 5000]:
print(f'Running for ncol={ncol}')
# Approach 1: Use withColumn multiple times
df1 = spark.createDataFrame([[1]], ['a'])
start_t = datetime.datetime.now()
for i in range(ncol):
df1 = df1.withColumn(f'col_{i}', df1['a'])
secs1 = (datetime.datetime.now() - start_t).total_seconds()
print(f' - .withColumn() time for {len(df1.columns) - 1} columns: {secs1}')
# Approach 2: Use .select() one time
df2 = spark.createDataFrame([[1]], ['a'])
start_t = datetime.datetime.now()
df2 = df2.select(df2['a'], *[df2['a'].alias(f'col_{i}') for i in range(ncol)])
secs2 = (datetime.datetime.now() - start_t).total_seconds()
print(f' - .select() time for {len(df2.columns) - 1} columns: {secs2}')
print(f' >>> Ratio of withColumn:select = {secs1 / secs2:.2f}')
assert df1.dtypes == df2.dtypes
# Results:
# Running for ncol=10
# - .withColumn() time for 10 columns: 0.113788
# - .select() time for 10 columns: 0.01681
# >>> Ratio of withColumn:select = 6.77
# Running for ncol=100
# - .withColumn() time for 100 columns: 0.733191
# - .select() time for 100 columns: 0.061565
# >>> Ratio of withColumn:select = 11.91
# Running for ncol=1000
# - .withColumn() time for 1000 columns: 23.12309
# - .select() time for 1000 columns: 0.862976
# >>> Ratio of withColumn:select = 26.79
# Running for ncol=2000
# - .withColumn() time for 2000 columns: 140.796384
# - .select() time for 2000 columns: 1.564098
# >>> Ratio of withColumn:select = 90.02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment