Skip to content

Instantly share code, notes, and snippets.

@lpillmann
Last active November 16, 2023 05:52
Show Gist options
  • Save lpillmann/fa1874c7deb8434ca8cba8e5a045dde2 to your computer and use it in GitHub Desktop.
Save lpillmann/fa1874c7deb8434ca8cba8e5a045dde2 to your computer and use it in GitHub Desktop.
Read partitioned parquet files into pandas DataFrame from Google Cloud Storage using PyArrow
import gcsfs
import pyarrow
def read_parquet(gs_directory_path, to_pandas=True):
"""
Reads multiple (partitioned) parquet files from a GS directory
e.g. 'gs://<bucket>/<directory>' (without ending /)
"""
gs = gcsfs.GCSFileSystem()
arrow_df = pyarrow.parquet.ParquetDataset(gs_directory_path, filesystem=gs)
if to_pandas:
return arrow_df.read_pandas().to_pandas()
return arrow_df
@uchiiii
Copy link

uchiiii commented Aug 11, 2021

Hi everyone!
Unfortunately, I got errors like below.

OSError: Passed non-file path:  gs://<bucket>/<folder>

or

ArrowInvalid: Parquet file size is 0 bytes

I found another way here to achieve the same, which could hopefully help someone.

Note that pandas dons not support this

@freedomtowin
Copy link

cool, thank you

@samos123
Copy link

It worked perfectly for me! Thanks a bunch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment