Skip to content

Instantly share code, notes, and snippets.

@justinnaldzin
Created July 18, 2018 19:29
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save justinnaldzin/2510da265f598497d99dbb5217581754 to your computer and use it in GitHub Desktop.
Save justinnaldzin/2510da265f598497d99dbb5217581754 to your computer and use it in GitHub Desktop.
Estimate size of Spark DataFrame in bytes
# Function to convert python object to Java objects
def _to_java_object_rdd(rdd):
""" Return a JavaRDD of Object by unpickling
It will convert each Python object into Java object by Pyrolite, whenever the
RDD is serialized in batch or not.
"""
rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
return rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True)
# Convert DataFrame to an RDD
JavaObj = _to_java_object_rdd(df.rdd)
# Estimate size in bytes
bytes = sc._jvm.org.apache.spark.util.SizeEstimator.estimate(JavaObj)
@itsmano1993
Copy link

itsmano1993 commented Aug 14, 2020

HI Justin Thanks for this info. I would like to ask a question. When I am using this function in my local I am getting the data frame size as 3 MB for 150 row dataset. When I use the same in databricks i am getting the values as 30 MB. Any thought?

@Ezraorich
Copy link

I am getting 43 Mb with your code, but in my storage stats it shows that this df has 82Mb, any suggestions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment