Skip to content

Instantly share code, notes, and snippets.

@AntonFriberg
Last active January 14, 2022 11:33
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save AntonFriberg/4d00c4d28341de2232fdf91e6667a680 to your computer and use it in GitHub Desktop.
Save AntonFriberg/4d00c4d28341de2232fdf91e6667a680 to your computer and use it in GitHub Desktop.
Access parquet files from S3 using ClickHouse

Parquet files from s3 using ClickHouse

Accessing the prompt

Start the clickhouse-client and connect it to your ClickHouse Server.

$ clickhouse-client --host=<host> --port=<port> --user=<user> --password=<password>

You should be greated by the ClickHouse prompt

ClickHouse client version 21.12.3.32 (official build).
Connecting to localhost:9000 as user admin.
Connected to ClickHouse server version 21.12.3 revision 54452.

localhost :) 

On this prompt we can execute our queries.

Access individual Parquet files using S3

Here we are going to utilize the s3 table function

SELECT *
FROM s3(
  'http://s3.amazonaws.com/[bucket_name]/[path]/filename.parquet',
  '<access-key>',
  '<secret-key>',
  'Parquet',
  '`column_name_a` Int64, `column_name_b` String'
)
LIMIT 10;

Notice that we surrounded the column names with ` characters. This allows us to access column names containing dots (".") and other reserved characters.

Access Parquet files using wildcard

For example we can select from all the files between 2022-01-11 10 AM and 2022-01-11 12 AM from a partitioned dataset using the following query:

SELECT *
FROM s3(
  'http://s3.amazonaws.com/[bucket_name]/year=2022/month=1/day=11/hour={10..11}/*.parquet',
  '<access-key>',
  '<secret-key>',
  'Parquet',
  '`column_name_a` Int64, `column_name_b` String'
)
LIMIT 10;

More complex filtering using _path or _file virtual columns

In addition to the access patterns above we can also utilize the _path or _file virtual columns during filtering. To illustrate say that we have a file structure in S3 that has files like this.

year=2022/month=1/day=3/hour=21/5c454b8b35e8466aafc74960847970a0.parquet

This is quite common on partitioned datasets such as prestoDB or Apache Hive and similar.

What if we wanted to only access files in a given date period but did not have the date inside the parquet files? Then we could utilize the virtual _path field which contains the string above and parse the date like this.

How to parse a string to DateTime.

SELECT
  parseDateTimeBestEffortOrNull(
    replaceRegexpOne(
      'year=2022/month=1/day=3/hour=21/5c454b8b35e8466aafc74960847970a0.parquet',
      'year=(\\d+)/month=(\\d+)/day=(\\d+)/hour=(\\d+)/.*',
      '\\1-\\2-\\3 \\4:00:00'
    )
  ) as path_date;

So then we can efficiently construct queries like this on certain DateTime periods.

SELECT column_name_b,
  parseDateTimeBestEffortOrNull(
    replaceRegexpOne(
      _path,
      'year=(\\d+)/month=(\\d+)/day=(\\d+)/hour=(\\d+)/.*',
      '\\1-\\2-\\3 \\4:00:00'
    )
  ) as path_date;
FROM s3(
  'http://s3.amazonaws.com/[bucket_name]/year=*/month=*/day=*/hour=*/*.parquet',
  '<access-key>',
  '<secret-key>',
  'Parquet',
  '`column_name_a` Int64, `column_name_b` String'
)
WHERE toYYYYMMDD(path_date) BETWEEN 20220113 AND 20220114
LIMIT 10;

Resulting in results like

┌─column_name_b──────────────┬───────────path_date─┐
│ 540ea21f-a8b8-9fa703509eeb │ 2022-01-13 22:00:00 │
│ 3356eb45-35ef-09a726b18726 │ 2022-01-13 22:00:00 │
│ f8034960-6361-bf20826518c9 │ 2022-01-13 22:00:00 │
│ cc0bc4fb-1247-ac6d3e0d58c4 │ 2022-01-13 22:00:00 │
│ ec6dcafc-43d9-66afb8cdd48c │ 2022-01-13 22:00:00 │
│ f082fbe0-eae7-78da244c9649 │ 2022-01-13 22:00:00 │
│ 5b7c8ecc-6298-3dcbfce7b3ce │ 2022-01-13 22:00:00 │
│ 823e0606-905c-e0307813ed8a │ 2022-01-13 22:00:00 │
│ 99c9bcd3-437d-09ab6493324d │ 2022-01-13 22:00:00 │
│ eec25407-9766-aeebccb33adc │ 2022-01-13 22:00:00 │
└────────────────────────────┴─────────────────────┘
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment