Skip to content

Instantly share code, notes, and snippets.

@soaxelbrooke
Created August 7, 2023 01:31
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save soaxelbrooke/59697a2115d6ed9080546bfa1f2c0069 to your computer and use it in GitHub Desktop.
Save soaxelbrooke/59697a2115d6ed9080546bfa1f2c0069 to your computer and use it in GitHub Desktop.
Reading/Querying Parquet Datasets from Self-Hosted S3-Compatible Block Storage with s3fs + PyArrow + Polars
# Having already:
# export AWS_ACCESS_KEY_ID=youraccesskey
# export AWS_SECRET_ACCESS_KEY=yoursecretkey
import pyarrow.dataset as ds
import polars as pl
import s3fs
S3_ENDPOINT = "http://your.s3.endpoint:3900"
fs = s3fs.S3FileSystem(client_kwargs={"endpoint_url": S3_ENDPOINT})
# Do not include s3:// (s3fs mount)
foo_ds = ds.dataset("yourbucket/foo/", filesystem=fs, format="parquet")
bar_ds = ds.dataset("yourbucket/bar/", filesystem=fs, format="parquet")
# Create lazy frames with dataset metadata
dataframes = {
"foo": pl.scan_pyarrow_dataset(foo_ds),
"bar": pl.scan_pyarrow_dataset(bar_ds),
}
sql = pl.SQLContext(frames=dataframes)
# Now query!
sql.execute("""
select
foo_id,
avg(bar_rating) as rating_avg,
count(*) as count
from reviews
join products using (product_id)
group by foo_id
""").collect()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment