Skip to content

Instantly share code, notes, and snippets.

@gfranxman
Created October 16, 2020 17:46
Show Gist options
  • Save gfranxman/4fd0719ff2618039182dd7ea1a702f8e to your computer and use it in GitHub Desktop.
Save gfranxman/4fd0719ff2618039182dd7ea1a702f8e to your computer and use it in GitHub Desktop.
Pyspark / DataBricks DataFrame size estimation
from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
def _to_java_object_rdd(rdd):
""" Return a JavaRDD of Object by unpickling
It will convert each Python object into Java object by Pyrolite, whenever the
RDD is serialized in batc h or not.
"""
rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
return rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True)
def estimate_df_size(df):
JavaObj = _to_java_object_rdd(df.rdd)
nbytes = sc._jvm.org.apache.spark.util.SizeEstimator.estimate(JavaObj)
return nbytes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment