Skip to content

Instantly share code, notes, and snippets.

@mneedham
Created October 21, 2022 13:58
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mneedham/3370d3f7da730777d439b6506da0bcbd to your computer and use it in GitHub Desktop.
Save mneedham/3370d3f7da730777d439b6506da0bcbd to your computer and use it in GitHub Desktop.
Queries against DuckDB
SELECT count(*)
FROM 'data/*.parquet';
SELECT *
FROM 'data/*.parquet'
LIMIT 10;
DESCRIBE
SELECT *
FROM 'data/yellow_tripdata_2011-07.parquet';
DESCRIBE
SELECT *
FROM 'data/yellow_tripdata_2009-10.parquet';
-- We can see they have totally different schemas
-- So a bunch of Parquet files are presumably being ignored
SELECT string_split(file_name, '_')[-1] AS filename, stats_min, stats_max
FROM parquet_metadata('data/*.parquet')
WHERE path_in_schema IN (
'tpep_pickup_datetime', 'tpep_dropoff_datetime'
)
ORDER BY stats_min
LIMIT 20;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment