lambrospetrou/readme.md

## readme.md

      
    Raw
  

              readme.md
            
          
    DuckDB JSON to Parquet

Command to generate sample NDJSON files:
copy (select * from (select a.range as 'a' from range(1000) as a), (select b.range as 'b' from range(1000) as b)) to 'range.ndjson';
Creates a 2x1000000 table:
select count(1) from 'range.ndjson';
┌──────────┐
│ count(1) │
│  int64   │
├──────────┤
│  1000000 │
└──────────┘

select * from 'range.ndjson' limit 10;
┌───────┬───────┐
│   a   │   b   │
│ int64 │ int64 │
├───────┼───────┤
│     0 │     0 │
│     1 │     0 │
│     2 │     0 │
│     3 │     0 │
│     4 │     0 │
│     5 │     0 │
│     6 │     0 │
│     7 │     0 │
│     8 │     0 │
│     9 │     0 │
├───────┴───────┤
│    10 rows    │
└───────────────┘
Convert JSON files to Parquet

copy (select * from 'range.ndjson') to 'range.parquet';

select count(1) from 'range.parquet';
┌──────────┐
│ count(1) │
│  int64   │
├──────────┤
│  1000000 │
└──────────┘
Or for multiple JSON files:
copy (select * from 'prefix*.ndjson') to 'range.parquet';
We can also directly upload to S3 instead of a local file, and it will automatically use S3 multipart upload!
copy (select * from 'prefix*.ndjson') to 's3://some-bucket/range.parquet';
And comparing the sizes of NDJSON vs Parquet:
ll range.*
-rw-r--r-- 1 lambros lambros  17M Jan 18 22:11 range.ndjson
-rw-r--r-- 1 lambros lambros 1.2M Jan 18 22:11 range.parquet
References:

https://duckdb.org/docs/extensions/json
https://duckdb.org/docs/sql/statements/copy
https://duckdb.org/docs/extensions/httpfs#writing

Example files:

https://shell.duckdb.org/data/tpch/0_01/parquet/customer.parquet
https://shell.duckdb.org/data/tpch/0_01/parquet/nation.parquet
https://raw.githubusercontent.com/tobilg/aws-iam-data/main/data/parquet/aws_services.parquet
https://raw.githubusercontent.com/tobilg/aws-edge-locations/main/data/aws-edge-locations.parquet
https://raw.githubusercontent.com/tobilg/public-cloud-provider-ip-ranges/main/data/providers/all.parquet
https://raw.githubusercontent.com/tobilg/public-cloud-provider-ip-ranges/main/data/providers/all.csv