christinebuckler/pyspark_dataframe_deep_copy.py

Created November 12, 2018 03:50

Star () You must be signed in to star a gist
Fork () You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/christinebuckler/02e021a285069ce98c01a24ab348ca14.js"></script>
Save christinebuckler/02e021a285069ce98c01a24ab348ca14 to your computer and use it in GitHub Desktop.

Download ZIP

PySpark deep copy dataframe

Raw

pyspark_dataframe_deep_copy.py

	import copy

	X = spark.createDataFrame([[1,2], [3,4]], ['a', 'b'])
	_schema = copy.deepcopy(X.schema)
	_X = X.rdd.zipWithIndex().toDF(_schema)

dfsklar commented Nov 2, 2021

This tiny code fragment totally saved me -- I was running up against Spark 2's infamous "self join" defects and stackoverflow kept leading me in the wrong direction. Within 2 minutes of finding this nifty fragment I was unblocked. Much gratitude!

Author

christinebuckler commented Nov 3, 2021

@dfsklar Awesome! So glad that it helped!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment