Skip to content

Instantly share code, notes, and snippets.

@rjurney
Last active October 24, 2020 19:03
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rjurney/0291cff3f5e49d3b42a1dd63e1af70c2 to your computer and use it in GitHub Desktop.
Save rjurney/0291cff3f5e49d3b42a1dd63e1af70c2 to your computer and use it in GitHub Desktop.
PyArrow now takes forever to load partitioned Parquet data. Why?
# Prepare the partition filter
filters = [
[('Ticker', 'in', tickers)]
]
dataset = pq.ParquetDataset(
path_or_paths=LOCAL_PATH,
filesystem=filesystem,
filters=filters,
metadata_nthreads=4,
)
table = dataset.read_pandas(
columns=columns + index_columns,
use_threads=True,
)
df = table.to_pandas(
use_threads=True,
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment