Created
October 16, 2020 17:46
-
-
Save gfranxman/4fd0719ff2618039182dd7ea1a702f8e to your computer and use it in GitHub Desktop.
Pyspark / DataBricks DataFrame size estimation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from pyspark.serializers import PickleSerializer, AutoBatchedSerializer | |
def _to_java_object_rdd(rdd): | |
""" Return a JavaRDD of Object by unpickling | |
It will convert each Python object into Java object by Pyrolite, whenever the | |
RDD is serialized in batc h or not. | |
""" | |
rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer())) | |
return rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True) | |
def estimate_df_size(df): | |
JavaObj = _to_java_object_rdd(df.rdd) | |
nbytes = sc._jvm.org.apache.spark.util.SizeEstimator.estimate(JavaObj) | |
return nbytes |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment