By Anthony Vilarim Caliani
This is an example of working with Partitioned Parquet, here you will find how to read and write partitioned parquet files.
In this example I'm using a Netflix Shows dataset, so thanks to Shivam Bansal for sharing his dataset.
The important thing here is the code, but if you want to execute it there is a run.sh
to help you out.
# First, run ingestion script...
bash run.sh ingest
# And then you can read some data...
bash run.sh read # Read all data
bash run.sh read 2009 # Read all data from "2009"
bash run.sh read 2009 5 # Read all data from "May 2009"
Partitioned by default
data/netflix/shows-default.parquet
βββ _SUCCESS
βββ part-00000-341edcc1-f245-46d7-85f9-9f54a8b862ac-c000.snappy.parquet
Partitioned by Year and Month
data/netflix/shows.parquet
βββ _SUCCESS
βββ release_year=2008
βΒ Β βββ release_month=1
βΒ Β βΒ Β βββ part-00000-b60d3cb1-629f-4034-bebc-c75e0341e1b4.c000.snappy.parquet
βΒ Β βββ release_month=2
βΒ Β βββ part-00000-b60d3cb1-629f-4034-bebc-c75e0341e1b4.c000.snappy.parquet
βββ release_year=2009
Β Β βββ release_month=11
Β Β βΒ Β βββ part-00000-b60d3cb1-629f-4034-bebc-c75e0341e1b4.c000.snappy.parquet
Β Β βββ release_month=5
Β Β βββ part-00000-b60d3cb1-629f-4034-bebc-c75e0341e1b4.c000.snappy.parquet
If you want to execute this code locally you have to download the dataset from Kaggle and then add the file into the folder ./data/
.