Skip to content

Instantly share code, notes, and snippets.

@christinebuckler
Created November 12, 2018 03:50
Show Gist options
  • Save christinebuckler/02e021a285069ce98c01a24ab348ca14 to your computer and use it in GitHub Desktop.
Save christinebuckler/02e021a285069ce98c01a24ab348ca14 to your computer and use it in GitHub Desktop.
PySpark deep copy dataframe
import copy
X = spark.createDataFrame([[1,2], [3,4]], ['a', 'b'])
_schema = copy.deepcopy(X.schema)
_X = X.rdd.zipWithIndex().toDF(_schema)
@dfsklar
Copy link

dfsklar commented Nov 2, 2021

This tiny code fragment totally saved me -- I was running up against Spark 2's infamous "self join" defects and stackoverflow kept leading me in the wrong direction. Within 2 minutes of finding this nifty fragment I was unblocked. Much gratitude!

@christinebuckler
Copy link
Author

@dfsklar Awesome! So glad that it helped!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment