AntonFriberg/README.md

## README.md

      
    Raw
  

              README.md
            
          
    Parquet files from s3 using ClickHouse

Accessing the prompt

Start the clickhouse-client and connect it to your ClickHouse Server.
$ clickhouse-client --host=<host> --port=<port> --user=<user> --password=<password>
You should be greated by the ClickHouse prompt
ClickHouse client version 21.12.3.32 (official build).
Connecting to localhost:9000 as user admin.
Connected to ClickHouse server version 21.12.3 revision 54452.

localhost :) 
On this prompt we can execute our queries.
Access individual Parquet files using S3

Here we are going to utilize the s3 table function
SELECT *
FROM s3(
  'http://s3.amazonaws.com/[bucket_name]/[path]/filename.parquet',
  '<access-key>',
  '<secret-key>',
  'Parquet',
  '`column_name_a` Int64, `column_name_b` String'
)
LIMIT 10;
Notice that we surrounded the column names with ` characters. This allows us to access column names containing dots (".")
and other reserved characters.
Access Parquet files using wildcard

For example we can select from all the files between 2022-01-11 10 AM and 2022-01-11 12 AM
from a partitioned dataset using the following query:
SELECT *
FROM s3(
  'http://s3.amazonaws.com/[bucket_name]/year=2022/month=1/day=11/hour={10..11}/*.parquet',
  '<access-key>',
  '<secret-key>',
  'Parquet',
  '`column_name_a` Int64, `column_name_b` String'
)
LIMIT 10;
More complex filtering using _path or _file virtual columns

In addition to the access patterns above we can also utilize the _path or _file virtual columns during filtering.
To illustrate say that we have a file structure in S3 that has files like this.
year=2022/month=1/day=3/hour=21/5c454b8b35e8466aafc74960847970a0.parquet

This is quite common on partitioned datasets such as prestoDB or Apache Hive and similar.
What if we wanted to only access files in a given date period but did not have the date inside the parquet files? Then
we could utilize the virtual _path field which contains the string above and parse the date like this.
How to parse a string to DateTime.
SELECT
  parseDateTimeBestEffortOrNull(
    replaceRegexpOne(
      'year=2022/month=1/day=3/hour=21/5c454b8b35e8466aafc74960847970a0.parquet',
      'year=(\\d+)/month=(\\d+)/day=(\\d+)/hour=(\\d+)/.*',
      '\\1-\\2-\\3 \\4:00:00'
    )
  ) as path_date;
So then we can efficiently construct queries like this on certain DateTime periods.
SELECT column_name_b,
  parseDateTimeBestEffortOrNull(
    replaceRegexpOne(
      _path,
      'year=(\\d+)/month=(\\d+)/day=(\\d+)/hour=(\\d+)/.*',
      '\\1-\\2-\\3 \\4:00:00'
    )
  ) as path_date;
FROM s3(
  'http://s3.amazonaws.com/[bucket_name]/year=*/month=*/day=*/hour=*/*.parquet',
  '<access-key>',
  '<secret-key>',
  'Parquet',
  '`column_name_a` Int64, `column_name_b` String'
)
WHERE toYYYYMMDD(path_date) BETWEEN 20220113 AND 20220114
LIMIT 10;
Resulting in results like
┌─column_name_b──────────────┬───────────path_date─┐
│ 540ea21f-a8b8-9fa703509eeb │ 2022-01-13 22:00:00 │
│ 3356eb45-35ef-09a726b18726 │ 2022-01-13 22:00:00 │
│ f8034960-6361-bf20826518c9 │ 2022-01-13 22:00:00 │
│ cc0bc4fb-1247-ac6d3e0d58c4 │ 2022-01-13 22:00:00 │
│ ec6dcafc-43d9-66afb8cdd48c │ 2022-01-13 22:00:00 │
│ f082fbe0-eae7-78da244c9649 │ 2022-01-13 22:00:00 │
│ 5b7c8ecc-6298-3dcbfce7b3ce │ 2022-01-13 22:00:00 │
│ 823e0606-905c-e0307813ed8a │ 2022-01-13 22:00:00 │
│ 99c9bcd3-437d-09ab6493324d │ 2022-01-13 22:00:00 │
│ eec25407-9766-aeebccb33adc │ 2022-01-13 22:00:00 │
└────────────────────────────┴─────────────────────┘