Skip to content

Instantly share code, notes, and snippets.

@rjurney
Created August 28, 2019 20:09
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rjurney/f8385d8bfab0445e4444058afefeb6c9 to your computer and use it in GitHub Desktop.
Save rjurney/f8385d8bfab0445e4444058afefeb6c9 to your computer and use it in GitHub Desktop.
How does one load Parquet from S3 in Pandas/PyArrow?
import pandas as pd
import pyarrow
import s3fs
posts_df = pd.read_parquet(
's3://stackoverflow-events/08-05-2019/Questions.Stratified.Final.50000.parquet',
columns=['_Body'] + ['label_{}'.format(i) for i in range(0, 24)],
engine='pyarrow'
)
posts_df.head(5)
>>>> FileNotFoundError: stackoverflow-events/08-05-2019/Questions.Stratified.Final.50000.parquet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment