Skip to content

Instantly share code, notes, and snippets.

@klesouza
Created February 8, 2021 10:12
Show Gist options
  • Save klesouza/3173549aeca4acb205dda770c6213269 to your computer and use it in GitHub Desktop.
Save klesouza/3173549aeca4acb205dda770c6213269 to your computer and use it in GitHub Desktop.
Calculating total compressed size of a parquet column
import pyarrow.parquet
def calculate_column_size(column, file):
m = pyarrow.parquet.read_metadata(file)
sz = sum([c["total_compressed_size"] for x in m.to_dict()["row_groups"] for c in x["columns"] if c["path_in_schema"].startswith(column)])
totalsize = sum([c["total_compressed_size"] for x in m.to_dict()["row_groups"] for c in x["columns"]])
return sz, sz/totalsize
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment