Skip to content

Instantly share code, notes, and snippets.

@kittipatkampa
kittipatkampa / convert_to_rdd_then_df.py
Created July 31, 2019 21:29
Converting pyspark dataframe into RDD and back to DataFrame can resolve StackOverflow Error.
train_df = spark.createDataFrame(train_df.rdd, schema=train_df.schema)
model_pred = pipeline_pred.fit(train_df)
### This may cause Py4JJavaError: An error occurred while calling o1019.fit.: java.lang.StackOverflowError
train_df = train_df.select(cols)
train_df.cache()
train_df.checkpoint()
train_df.show(n=3, truncate=False, vertical=True)
#... many cache() and .checkpoint() thingies in between, but not relevant to train_df at all
model_pred = pipeline_pred.fit(train_df)
spark = SparkSession.builder \
.appName("Confidence Model") \
.enableHiveSupport() \
.getOrCreate()
# I told spark to use dir called `checkpoint` to
# store checkpoints.
sc = spark.sparkContext
sc.setCheckpointDir('checkpoint')
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.