Skip to content

Instantly share code, notes, and snippets.

@andrearota
Last active March 22, 2019 10:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save andrearota/f77b6a293421a3f26dd5d2fb0a04046e to your computer and use it in GitHub Desktop.
Save andrearota/f77b6a293421a3f26dd5d2fb0a04046e to your computer and use it in GitHub Desktop.
Apache Spark issues with optimizer when using PySpark UDF and complex types
from pyspark.sql import SparkSession, functions as f
from pyspark.sql.types import IntegerType
import sys
if __name__ == '__main__':
spark = SparkSession.builder.getOrCreate()
print('Spark version:', spark.version)
print('Python version:', sys.version)
foo_udf = f.udf(lambda x: 1, IntegerType())
df = spark.createDataFrame([['bar']]) \
.withColumn('result', foo_udf(f.col('_1'))) \
.withColumn('a', f.col('result')) \
.withColumn('b', f.col('result'))
df.explain()
df.show()
Spark version: 2.3.1
Python version: 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 17:00:18) [MSC v.1900 64 bit (AMD64)]
== Physical Plan ==
*(1) Project [_1#0, pythonUDF2#16 AS result#2, pythonUDF2#16 AS a#5, pythonUDF2#16 AS b#9]
+- BatchEvalPython [<lambda>(_1#0), <lambda>(_1#0), <lambda>(_1#0)], [_1#0, pythonUDF0#14, pythonUDF1#15, pythonUDF2#16]
+- Scan ExistingRDD[_1#0]
+---+------+---+---+
| _1|result| a| b|
+---+------+---+---+
|bar| 1| 1| 1|
+---+------+---+---+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment