Skip to content

Instantly share code, notes, and snippets.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
spark = SparkSession.builder \
.appName("Confidence Model") \
.enableHiveSupport() \
.getOrCreate()
# I told spark to use dir called `checkpoint` to
# store checkpoints.
sc = spark.sparkContext
sc.setCheckpointDir('checkpoint')
### This may cause Py4JJavaError: An error occurred while calling o1019.fit.: java.lang.StackOverflowError
train_df = train_df.select(cols)
train_df.cache()
train_df.checkpoint()
train_df.show(n=3, truncate=False, vertical=True)
#... many cache() and .checkpoint() thingies in between, but not relevant to train_df at all
model_pred = pipeline_pred.fit(train_df)
@kittipatkampa
kittipatkampa / convert_to_rdd_then_df.py
Created July 31, 2019 21:29
Converting pyspark dataframe into RDD and back to DataFrame can resolve StackOverflow Error.
train_df = spark.createDataFrame(train_df.rdd, schema=train_df.schema)
model_pred = pipeline_pred.fit(train_df)