Skip to content

Instantly share code, notes, and snippets.

@lpillmann
Last active November 16, 2023 05:52
Show Gist options
  • Save lpillmann/fa1874c7deb8434ca8cba8e5a045dde2 to your computer and use it in GitHub Desktop.
Save lpillmann/fa1874c7deb8434ca8cba8e5a045dde2 to your computer and use it in GitHub Desktop.
Read partitioned parquet files into pandas DataFrame from Google Cloud Storage using PyArrow
import gcsfs
import pyarrow
def read_parquet(gs_directory_path, to_pandas=True):
"""
Reads multiple (partitioned) parquet files from a GS directory
e.g. 'gs://<bucket>/<directory>' (without ending /)
"""
gs = gcsfs.GCSFileSystem()
arrow_df = pyarrow.parquet.ParquetDataset(gs_directory_path, filesystem=gs)
if to_pandas:
return arrow_df.read_pandas().to_pandas()
return arrow_df
@felipejardimf
Copy link

Hey @lpillmann !

Yeah, if i set this path i can reach the files : gs://bucket/folder/DATA_PART=201801

but how to access paths like this? gs://bucket/folder/*

I ask you because in other environments I can usually look for this path.

thank you for your help!!

@lpillmann
Copy link
Author

Got it @@felipejardimf.

I'd expect PyArrow to be able to read from that path if you pass gs://bucket/folder as gs_directory_path.

However, I'm not able to test it right now. You might want to take a look at pyarrow.parquet.ParquetDataset documentation and see if you need to tweak any of the parameters in order for that to work.

@uchiiii
Copy link

uchiiii commented Aug 11, 2021

Hi everyone!
Unfortunately, I got errors like below.

OSError: Passed non-file path:  gs://<bucket>/<folder>

or

ArrowInvalid: Parquet file size is 0 bytes

I found another way here to achieve the same, which could hopefully help someone.

Note that pandas dons not support this

@freedomtowin
Copy link

cool, thank you

@samos123
Copy link

It worked perfectly for me! Thanks a bunch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment