-
Backend developer
-
Ruby/Shell/Postgresql/Vim
- Ruby since 2007.
- Postgresql since 2004.
- The weather dataset - our playground Today!
- SQL walkthrough - Statistics with Postgresql.
- The TimescaleDB walkthrough - time-series with super-powers.
- The Ruby walkthrough with the TimescaleDB gem.
Open weather:
- Free data from entire world.
- Free API.
- Statistics from anywhere.
- Time-series data.
The focus will be weather metrics.
Joining the open weather via psql:
psql open_weather
Describe the weather_metrics table:
\d weather_metrics
The Hypertable 5WH!
- Who: The timescaledb extension
- What: Hypertable
- When: you need to handle time-series data (insert, select, update, delete)
- Where: In your PostgreSQL database
- Why: to optimize time-series throughtput
- How: using table partitions to compress, parallelize and manage smaller chunks of data.
\d+ weather_metrics
On psql we can enable time:
\timing
SELECT count(1) FROM weather_metrics ; # => 4092484
Time: 227.889 ms
Timescaledb offers a different counting approach that is very approximate to real counter.
SELECT approximate_row_count('weather_metrics') ; # => 4092484
Time: 14.310 ms
Note that 220 / 14
= 16 times faster.
Understanding a bit of the execution plan:
EXPLAIN SELECT count(1) FROM weather_metrics;
Divide and conquer!
-> Parallel Append (cost=0.29..56957.24 rows=1023114 width=0) │
│ -> Parallel Index Only Scan using _hyper_1_...idx on _hyper_1_...
SELECT time, temp_c
FROM weather_metrics
WHERE city_name = 'New York'
AND time BETWEEN '2022-06-01' AND '2022-06-02'
ORDER BY 1 LIMIT 20;
Get average of temperature grouped by one hour.
SELECT time_bucket('1 hour', time) AS bucket,
avg(temp_c)
FROM weather_metrics
WHERE city_name = 'New York'
AND time BETWEEN '2022-06-01' AND '2022-06-02'
GROUP BY 1 ORDER BY 1;
The time_bucket
also supports timestamps with time zones.
Now, let's get a bit more details adding the min AND max:
SELECT time_bucket('1 hour'::interval, time) AS bucket,
avg(temp_c)::numeric(4,2),
min(temp_c), max(temp_c)
FROM weather_metrics
WHERE city_name = 'New York'
AND time BETWEEN '2022-06-01' AND '2022-06-02'
GROUP BY 1 ORDER BY 1;
Now, we can also check the standard deviation:
SELECT time_bucket('1 hour'::interval, time) AS bucket,
avg(temp_c)::numeric(4,2),
min(temp_c), max(temp_c), stddev(temp_c)
FROM weather_metrics
WHERE city_name = 'New York'
AND time BETWEEN '2022-06-01' AND '2022-06-02'
GROUP BY 1 ORDER BY 1;
Let's go deep into a single record to understand the standard deviation:
SELECT time_bucket('1 hour'::interval, time) AS bucket,
count(*),
avg(temp_c)::numeric(4,2),
min(temp_c), max(temp_c), stddev(temp_c)
FROM weather_metrics
WHERE city_name = 'New York'
AND time BETWEEN '2022-06-01 00:00:00' AND '2022-06-01 01:00:00'
GROUP BY 1 ORDER BY 1;
Now going deep into individual values inside this hour:
SELECT time_bucket('1 hour'::interval, time) AS bucket,
array_agg( temp_c)
FROM weather_metrics
WHERE city_name = 'New York'
AND time BETWEEN '2022-06-01 00:00:00' AND '2022-06-01 01:00:00'
GROUP BY 1 ORDER BY 1;
To get the percentile_agg function an overview:
SELECT time_bucket('1 hour'::interval, time) AS bucket,
percentile_agg( temp_c)
FROM weather_metrics
WHERE city_name = 'New York'
AND time BETWEEN '2022-06-01 00:00:00' AND '2022-06-01 01:00:00'
GROUP BY 1 ORDER BY 1;
The functions with _agg
suffix' indicates that several statistical aggregates
can be pre-computed and save computing later.
Now, getting quartiles AND median from percentiles:
SELECT time_bucket('1 month'::interval, time) AS bucket,
approx_percentile(0.25, percentile_agg( temp_c)) AS q_1,
approx_percentile(0.5, percentile_agg( temp_c)) AS median,
approx_percentile(0.75, percentile_agg( temp_c)) AS q3
FROM weather_metrics
WHERE city_name = 'New York'
AND time BETWEEN '2021-06-01 00:00:00' AND '2022-06-01 01:00:00'
GROUP BY 1 ORDER BY 1;
Pre-compute aggregations with CTE can reuse the previous calculated percentile_agg
:
WITH one_month AS (
SELECT time_bucket('1 month'::interval, time) AS bucket,
percentile_agg( temp_c)
FROM weather_metrics
WHERE city_name = 'New York'
AND time BETWEEN '2021-06-01 00:00:00' AND '2022-07-01 01:00:00'
GROUP BY 1 ORDER BY 1
)
SELECT bucket,
approx_percentile(0.25, percentile_agg) AS q_1,
approx_percentile(0.5, percentile_agg) AS median,
approx_percentile(0.75, percentile_agg) AS q3
FROM one_month;
Statistical aggregates in one or two dimensions to pre-compute statistics summary.
SELECT time_bucket('1 hour'::interval, time) AS bucket,
stats_agg( temp_c) AS hourly_agg
FROM weather_metrics
WHERE city_name = 'New York'
AND time BETWEEN '2022-06-01 00:00:00' AND '2022-07-01 01:00:00'
GROUP BY 1 ORDER BY 1
Compute an average from stats aggs is very easy:
SELECT time_bucket('1 hour'::interval, time) AS bucket,
average(stats_agg( temp_c)) AS hourly_average
FROM weather_metrics
WHERE city_name = 'New York'
AND time BETWEEN '2022-06-01 00:00:00' AND '2022-07-01 01:00:00'
GROUP BY 1 ORDER BY 1
Using CTE to reuse the stats aggs pre-computed data:
WITH agg AS (
SELECT time_bucket('1 hour'::interval, time) AS bucket,
stats_agg( temp_c)
FROM weather_metrics
WHERE city_name = 'New York'
AND time BETWEEN '2022-06-01 00:00:00' AND '2022-07-01 01:00:00'
GROUP BY 1
ORDER BY 1
)
SELECT bucket, average(stats_agg) FROM agg;
Rollup can combine stats aggs in different time frames:
WITH hourly AS (
SELECT time_bucket('1 hour'::interval, time) AS hour_bucket,
stats_agg( temp_c)
FROM weather_metrics
WHERE city_name = 'New York'
AND time between '2022-06-01 00:00:00' AND '2022-07-01 01:00:00'
GROUP BY 1 ORDER BY 1
)
SELECT time_bucket('1 day', hour_bucket),
average(rollup(stats_agg))
FROM hourly GROUP BY 1;
Cascading rollups can reuse previous stats aggs:
WITH hourly AS ( SELECT time_bucket('1 hour'::interval, time) AS bucket,
stats_agg( temp_c) AS hourly_agg
FROM weather_metrics
WHERE city_name = 'New York'
AND time BETWEEN '2021-06-01 00:00:00' AND '2022-07-01 01:00:00'
GROUP BY 1 ORDER BY 1
),
daily AS ( SELECT time_bucket('1 day', bucket) AS bucket,
rollup(hourly_agg) AS daily_agg
FROM hourly GROUP BY 1
),
monthly AS ( SELECT time_bucket('1 month', bucket) AS bucket,
rollup(daily_agg) AS monthly_agg
FROM daily GROUP BY 1
)
SELECT bucket, average(monthly_agg) from monthly;
Adding variance
AND stddev
without expensive computing process:
-- previous stats aggs example
SELECT bucket,
average(monthly_agg),
variance(monthly_agg),
stddev(monthly_agg)
FROM monthly;
Querying number of values from pre-computed stats aggs:
WITH hourly AS (
SELECT time_bucket('1 hour'::interval, time) AS bucket,
stats_agg( temp_c) AS hourly_agg
FROM weather_metrics
WHERE city_name = 'New York'
AND time BETWEEN '2021-06-01 00:00:00' AND '2022-06-01 01:00:00'
GROUP BY 1 ORDER BY 1
)
SELECT bucket, average(hourly_agg), num_vals(hourly_agg) from hourly;
AKA Continuous Aggregates ;)
Materialized views for hypertables.
CREATE MATERIALIZED VIEW ny_hourly_agg
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 hour'::interval, time) AS bucket,
stats_agg( temp_c) AS hourly_agg
FROM weather_metrics
WHERE city_name = 'New York'
GROUP BY 1;
Materialized data can be combined with real time data from open timeframes.
CREATE MATERIALIZED VIEW ny_daily_agg
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 day',bucket),
rollup(hourly_agg) AS daily_agg
FROM ny_hourly_agg group by 1;
not allowed, but can save processing with regular views:
CREATE VIEW ny_daily_agg AS
SELECT time_bucket('1 day',bucket),
rollup(hourly_agg) AS daily_agg
FROM ny_hourly_agg GROUP BY 1;
The timescaledb gem:
- wrapper for TimescaleDB functions
- wrapper for Toolkit utilities (WIP)
- command line utility to navigate into your tsdb
gem install timescaledb
https://github.com/jonatas/timescaledb
The tsdb is a Ruby playground for TimescaleDB instances.
tsdb $PG_URI --stats
Pry Console - psql for rubysts ;)
tsdb $PG_URI --console
tsdb utilities generates on the fly models to allow you to query with readonly apps:
WeatherMetric
Allows you to access all hypertable metadata:
WeatherMetric.hypertable
Has wrappers for hypertable utilities:
WeatherMetric.hypertable.approximate_rows_count
ny = WeatherMetric.where(city_name: "New York"); nil
Build query from previous scope:
ny.select("time_bucket('1y',time) as time, avg(temp_c) as temp_c").group(1)
- Ease all things analytics when using TimescaleDB.
- Focus on developer ergonomics and performance.
require 'timescaledb/toolkit'
WeatherMetric.acts_as_time_vector value_column: "temp_c"
acts_as_time_vector
can also specify time_column and segment_by.
WeatherMetric.yesterday.volatility.map(&:attributes)
Segment by can be used to group data by some segment. It can be used through several time vector functions, so it's good to have it pre-configured.
WeatherMetric.acts_as_time_vector value_column: "temp_c", segment_by: "city_name"
And then volatility will become:
WeatherMetric.yesterday.volatility.map(&:attributes)
Downsampling method to reduce number of points to a threshold.
WeatherMetric.lttb(threshold: 50)
Comparison to Ruby.
https://jonatas.github.io/timescaledb/toolkit_lttb_tutorial/
The histogram with Open, High, Low, Close values for grouped data.
ohlc = ny.select("time_bucket('1y',time) as time,
toolkit_experimental.ohlc(time, temp_c)").group(1)
WeatherMetric.from("(#{ohlc.to_sql}) AS ohlc")
.select("time,
toolkit_experimental.open(ohlc),
toolkit_experimental.high(ohlc),
toolkit_experimental.low(ohlc),
toolkit_experimental.close(ohlc)")
.map(&:attributes)
- https://ideia.me/using-the-timescale-gem-with-ruby
- https://ideia.me/timescale-continuous-aggregates-with-ruby
- https://github.com/jonatas/timescaledb
- https://timescale.com/community
- @jonatasdp on {Twitter,Instagram,Linkedin}
- Github: @jonatas