Skip to content

Instantly share code, notes, and snippets.

@justinnaldzin
Created July 18, 2018 19:29
Show Gist options
  • Save justinnaldzin/2510da265f598497d99dbb5217581754 to your computer and use it in GitHub Desktop.
Save justinnaldzin/2510da265f598497d99dbb5217581754 to your computer and use it in GitHub Desktop.
Estimate size of Spark DataFrame in bytes
# Function to convert python object to Java objects
def _to_java_object_rdd(rdd):
""" Return a JavaRDD of Object by unpickling
It will convert each Python object into Java object by Pyrolite, whenever the
RDD is serialized in batch or not.
"""
rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
return rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True)
# Convert DataFrame to an RDD
JavaObj = _to_java_object_rdd(df.rdd)
# Estimate size in bytes
bytes = sc._jvm.org.apache.spark.util.SizeEstimator.estimate(JavaObj)
@Ezraorich
Copy link

I am getting 43 Mb with your code, but in my storage stats it shows that this df has 82Mb, any suggestions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment