Skip to content

Instantly share code, notes, and snippets.

Created October 16, 2020 17:46
Pyspark / DataBricks DataFrame size estimation
from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
def _to_java_object_rdd(rdd):
""" Return a JavaRDD of Object by unpickling
It will convert each Python object into Java object by Pyrolite, whenever the
RDD is serialized in batc h or not.
rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
return, True)
def estimate_df_size(df):
JavaObj = _to_java_object_rdd(df.rdd)
nbytes =
return nbytes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment