Skip to content

Instantly share code, notes, and snippets.

@lambrospetrou
Last active January 19, 2024 12:39
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save lambrospetrou/77f441d55cf75679b1b7fba8ee0e553c to your computer and use it in GitHub Desktop.
Save lambrospetrou/77f441d55cf75679b1b7fba8ee0e553c to your computer and use it in GitHub Desktop.
DuckDB JSON to Parquet in S3

DuckDB JSON to Parquet

Command to generate sample NDJSON files:

copy (select * from (select a.range as 'a' from range(1000) as a), (select b.range as 'b' from range(1000) as b)) to 'range.ndjson';

Creates a 2x1000000 table:

select count(1) from 'range.ndjson';
┌──────────┐
│ count(1) │
│  int64   │
├──────────┤
│  1000000 │
└──────────┘

select * from 'range.ndjson' limit 10;
┌───────┬───────┐
│   a   │   b   │
│ int64 │ int64 │
├───────┼───────┤
│     00 │
│     10 │
│     20 │
│     30 │
│     40 │
│     50 │
│     60 │
│     70 │
│     80 │
│     90 │
├───────┴───────┤
│    10 rows    │
└───────────────┘

Convert JSON files to Parquet

copy (select * from 'range.ndjson') to 'range.parquet';

select count(1) from 'range.parquet';
┌──────────┐
│ count(1) │
│  int64   │
├──────────┤
│  1000000 │
└──────────┘

Or for multiple JSON files:

copy (select * from 'prefix*.ndjson') to 'range.parquet';

We can also directly upload to S3 instead of a local file, and it will automatically use S3 multipart upload!

copy (select * from 'prefix*.ndjson') to 's3://some-bucket/range.parquet';

And comparing the sizes of NDJSON vs Parquet:

ll range.*
-rw-r--r-- 1 lambros lambros  17M Jan 18 22:11 range.ndjson
-rw-r--r-- 1 lambros lambros 1.2M Jan 18 22:11 range.parquet

References:

Example files:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment