This gist uses UK Property Price Paid example dataset.
mac3.local :) select count(*) from uk_price_paid
┌──count()─┐
│ 27359802 │
└──────────┘
mac3.local :) select formatReadableSize(total_bytes) from system.tables where name = 'uk_price_paid'
┌─formatReadableSize(total_bytes)─┐
│ 291.09 MiB │
└─────────────────────────────────┘
Using S3 Table Functions
Inserting data from local clickhouse instance to s3:
INSERT INTO FUNCTION s3(
'https://ruslan-clickhouse-backup-experiment.s3.eu-north-1.amazonaws.com/csv/uk_price_paid_{_partition_id}.csv.lz4',
-- see https://clickhouse.com/docs/en/integrations/s3/s3-table-engine#managing-credentials
'access_key_id',
'secret_access_key',
'CSV' -- or https://clickhouse.com/docs/en/interfaces/formats/#csvwithnames
) PARTITION BY rand() % 10
-- a where clause can be added here to only upload prev period's (day, week, month) data
SELECT * FROM uk_price_paid;
Ok.
0 rows in set. Elapsed: 25.004 sec. Processed 27.36 million rows, 1.31 GB (1.09 million rows/s., 52.37 MB/s.)
Checking s3 bucket contents:
> aws s3 ls s3://ruslan-clickhouse-backup-experiment/csv/
2022-08-12 17:05:41 62139785 uk_price_paid_0.csv.lz4
2022-08-12 17:05:41 62145476 uk_price_paid_1.csv.lz4
2022-08-12 17:05:41 62136800 uk_price_paid_2.csv.lz4
2022-08-12 17:05:41 62183195 uk_price_paid_3.csv.lz4
2022-08-12 17:05:41 62148444 uk_price_paid_4.csv.lz4
2022-08-12 17:05:41 62157473 uk_price_paid_5.csv.lz4
2022-08-12 17:05:41 62177198 uk_price_paid_6.csv.lz4
2022-08-12 17:05:41 62160887 uk_price_paid_7.csv.lz4
2022-08-12 17:05:41 62160395 uk_price_paid_8.csv.lz4
2022-08-12 17:05:41 62143395 uk_price_paid_9.csv.lz4
Trying to read a single column from s3 bucket:
mac3.local :) select c2 as date from s3('https://ruslan-clickhouse-backup-experiment.s3.eu-north-1.amazonaws.com/csv/uk_price_paid_*.csv.lz4', 'access_key_id', 'secret_access_key', 'CSV') limit 10;
┌─date───────┐
│ 1997-10-31 │
│ 1997-06-27 │
│ 1997-02-10 │
│ 1999-03-09 │
│ 2000-12-22 │
│ 2000-10-25 │
│ 2000-02-21 │
│ 2001-08-10 │
│ 2000-08-21 │
│ 1996-09-06 │
└────────────┘
10 rows in set. Elapsed: 1.834 sec.
Creating a table to restore data into:
mac3.local :) CREATE TABLE uk_price_paid_2
(
price UInt32,
date Date,
postcode1 LowCardinality(String),
postcode2 LowCardinality(String),
type Enum8('terraced' = 1, 'semi-detached' = 2, 'detached' = 3, 'flat' = 4, 'other' = 0),
is_new UInt8,
duration Enum8('freehold' = 1, 'leasehold' = 2, 'unknown' = 0),
addr1 String,
addr2 String,
street LowCardinality(String),
locality LowCardinality(String),
town LowCardinality(String),
district LowCardinality(String),
county LowCardinality(String),
category UInt8
) ENGINE = MergeTree ORDER BY (postcode1, postcode2, addr1, addr2);
Restoring the data from s3:
mac3.local :) insert into uk_price_paid_2 select * from s3('https://ruslan-clickhouse-backup-experiment.s3.eu-north-1.amazonaws.com/csv/uk_price_paid_*.csv.lz4', 'access_key_id', 'secret_access_key', 'CSV');
Ok.
0 rows in set. Elapsed: 57.638 sec. Processed 27.36 million rows, 6.43 GB (474.68 thousand rows/s., 111.55 MB/s.)
mac3.local :) select count() from uk_price_paid_2
┌──count()─┐
│ 27359802 │
└──────────┘
This is so cool :)