Skip to content

Instantly share code, notes, and snippets.

@mneedham
Created October 14, 2022 18:24
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mneedham/1118519a859ce92ec54de6bed320c698 to your computer and use it in GitHub Desktop.
Save mneedham/1118519a859ce92ec54de6bed320c698 to your computer and use it in GitHub Desktop.
An intro to Apache Parquet
# The NYC Taxis Dataset - https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
pip install parquet-cli
parq data/yellow_tripdata_2022-01.parquet
parq data/yellow_tripdata_2022-01.parquet --schema
parq data/yellow_tripdata_2022-01.parquet --head 10
parq data/yellow_tripdata_2022-01.parquet --tail 10
import pyarrow.parquet as pq
file = pq.ParquetFile("data/yellow_tripdata_2022-01.parquet")
file.metadata
file.schema
file.read().to_pandas()
df = file.read().to_pandas()
df.to_csv("trips.csv")
df.to_json("trips.json", orient="records", lines=True)
stat -f %z data/yellow_tripdata_2022-01.parquet | numfmt --to=iec
stat -f %z trips.csv | numfmt --to=iec
stat -f %z trips.json | numfmt --to=iec
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment