Skip to content

Instantly share code, notes, and snippets.

@icexelloss
Created January 24, 2018 22:30
Show Gist options
  • Save icexelloss/88f6c6fdaf04aac39d68d74cd0942c07 to your computer and use it in GitHub Desktop.
Save icexelloss/88f6c6fdaf04aac39d68d74cd0942c07 to your computer and use it in GitHub Desktop.
Groupby apply group key benchmark
from pyspark.sql.functions import pandas_udf
df = spark.range(0, 10 * 1000 * 1000).withColumn('id', (col('id') / 1000).cast(IntegerType()))
df.cache()
df.count()
from pyspark.sql.functions import PandasUDFType
@pandas_udf(df.schema, PandasUDFType.GROUP_MAP)
# Input/output are both a pandas.DataFrame
def foo_udf(pdf):
return pdf
# 8.61 s ± 333 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.groupby('id').apply(foo_udf).agg(count(col('id'))).show()
# 8.51 s ± 477 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.groupby(df.id + 1).apply(foo_udf).agg(count(col('id'))).show()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment