Skip to content

Instantly share code, notes, and snippets.

@lpillmann
Last active November 16, 2023 05:52
Show Gist options
  • Save lpillmann/fa1874c7deb8434ca8cba8e5a045dde2 to your computer and use it in GitHub Desktop.
Save lpillmann/fa1874c7deb8434ca8cba8e5a045dde2 to your computer and use it in GitHub Desktop.
Read partitioned parquet files into pandas DataFrame from Google Cloud Storage using PyArrow
import gcsfs
import pyarrow
def read_parquet(gs_directory_path, to_pandas=True):
"""
Reads multiple (partitioned) parquet files from a GS directory
e.g. 'gs://<bucket>/<directory>' (without ending /)
"""
gs = gcsfs.GCSFileSystem()
arrow_df = pyarrow.parquet.ParquetDataset(gs_directory_path, filesystem=gs)
if to_pandas:
return arrow_df.read_pandas().to_pandas()
return arrow_df
@samos123
Copy link

It worked perfectly for me! Thanks a bunch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment