Skip to content

Instantly share code, notes, and snippets.

@jonatas
Last active October 21, 2022 17:05
Show Gist options
  • Save jonatas/8bedefd7811411748eb7be3044bf5d0d to your computer and use it in GitHub Desktop.
Save jonatas/8bedefd7811411748eb7be3044bf5d0d to your computer and use it in GitHub Desktop.
ADA Meetup - Processing Data with TimescaleDB and Ruby

ADA.rb

Ruby

&

Timescale

Jônatas Davi Paganini

@jonatasdp

@jonatasdp

  • Backend developer

  • Ruby/Shell/Postgresql/Vim

    • Ruby since 2007.
    • Postgresql since 2004.

twitter: @jonatasdp

github: @jonatas

Agenda

  • The weather dataset - our playground Today!
  • SQL walkthrough - Statistics with Postgresql.
  • The TimescaleDB walkthrough - time-series with super-powers.
  • The Ruby walkthrough with the TimescaleDB gem.

Dataset

Open weather:

https://openweathermap.org

  • Free data from entire world.
  • Free API.
  • Statistics from anywhere.
  • Time-series data.

The focus will be weather metrics.

DDL

Joining the open weather via psql:

psql open_weather

Describe the weather_metrics table:

\d weather_metrics

5WH

The Hypertable 5WH!

  • Who: The timescaledb extension
  • What: Hypertable
  • When: you need to handle time-series data (insert, select, update, delete)
  • Where: In your PostgreSQL database
  • Why: to optimize time-series throughtput
  • How: using table partitions to compress, parallelize and manage smaller chunks of data.
\d+ weather_metrics

Timing

On psql we can enable time:

\timing

Counting

 SELECT count(1) FROM weather_metrics ; # => 4092484
Time: 227.889 ms

Approx. Count

Timescaledb offers a different counting approach that is very approximate to real counter.

SELECT approximate_row_count('weather_metrics') ; # => 4092484
Time: 14.310 ms

Note that 220 / 14 = 16 times faster.

Explain

Understanding a bit of the execution plan:

EXPLAIN SELECT count(1) FROM weather_metrics;

Divide and conquer!

->  Parallel Append  (cost=0.29..56957.24 rows=1023114 width=0)                                                                                    │
│  ->  Parallel Index Only Scan using _hyper_1_...idx on _hyper_1_...

simple query

SELECT time, temp_c
FROM weather_metrics
WHERE city_name = 'New York'
AND time BETWEEN '2022-06-01' AND '2022-06-02'
ORDER BY 1 LIMIT 20;

time bucket

Get average of temperature grouped by one hour.

SELECT time_bucket('1 hour', time) AS bucket,
  avg(temp_c)
FROM weather_metrics
 WHERE city_name = 'New York'
AND time BETWEEN '2022-06-01' AND '2022-06-02'
GROUP BY 1 ORDER BY 1;

The time_bucket also supports timestamps with time zones.

Min / Max

Now, let's get a bit more details adding the min AND max:

SELECT time_bucket('1 hour'::interval, time) AS bucket,
  avg(temp_c)::numeric(4,2),
  min(temp_c), max(temp_c)
FROM weather_metrics
WHERE city_name = 'New York'
  AND time BETWEEN '2022-06-01' AND '2022-06-02'
GROUP BY 1 ORDER BY 1;

Stddev

Now, we can also check the standard deviation:

SELECT time_bucket('1 hour'::interval, time) AS bucket,
  avg(temp_c)::numeric(4,2),
  min(temp_c), max(temp_c), stddev(temp_c)
FROM weather_metrics
WHERE city_name = 'New York'
  AND time BETWEEN '2022-06-01' AND '2022-06-02'
GROUP BY 1 ORDER BY 1;

Sample

Let's go deep into a single record to understand the standard deviation:

SELECT time_bucket('1 hour'::interval, time) AS bucket,
count(*),
  avg(temp_c)::numeric(4,2),
  min(temp_c), max(temp_c), stddev(temp_c)
FROM weather_metrics
WHERE city_name = 'New York'
  AND time BETWEEN '2022-06-01 00:00:00' AND '2022-06-01 01:00:00'
GROUP BY 1 ORDER BY 1;

array_agg

Now going deep into individual values inside this hour:

 SELECT time_bucket('1 hour'::interval, time) AS bucket,
array_agg( temp_c)
FROM weather_metrics
 WHERE city_name = 'New York'
AND time BETWEEN '2022-06-01 00:00:00' AND '2022-06-01 01:00:00'
GROUP BY 1 ORDER BY 1;

percentile_agg

To get the percentile_agg function an overview:

SELECT time_bucket('1 hour'::interval, time) AS bucket,
   percentile_agg( temp_c)
FROM weather_metrics
 WHERE city_name = 'New York'
AND time BETWEEN '2022-06-01 00:00:00' AND '2022-06-01 01:00:00'
GROUP BY 1 ORDER BY 1;

The functions with _agg suffix' indicates that several statistical aggregates can be pre-computed and save computing later.

quartiles

Now, getting quartiles AND median from percentiles:

SELECT time_bucket('1 month'::interval, time) AS bucket,
    approx_percentile(0.25, percentile_agg( temp_c)) AS q_1,
    approx_percentile(0.5, percentile_agg( temp_c)) AS median,
    approx_percentile(0.75, percentile_agg( temp_c)) AS q3
FROM weather_metrics
 WHERE city_name = 'New York'
AND time BETWEEN '2021-06-01 00:00:00' AND '2022-06-01 01:00:00'
GROUP BY 1 ORDER BY 1;

CTE

Pre-compute aggregations with CTE can reuse the previous calculated percentile_agg:

WITH one_month AS (
  SELECT time_bucket('1 month'::interval, time) AS bucket,
    percentile_agg( temp_c)
  FROM weather_metrics
  WHERE city_name = 'New York'
    AND time BETWEEN '2021-06-01 00:00:00' AND '2022-07-01 01:00:00'
  GROUP BY 1 ORDER BY 1
)
SELECT bucket,
  approx_percentile(0.25, percentile_agg) AS q_1,
  approx_percentile(0.5, percentile_agg) AS median,
  approx_percentile(0.75, percentile_agg) AS q3
FROM one_month;

Stats aggs

Statistical aggregates in one or two dimensions to pre-compute statistics summary.

SELECT time_bucket('1 hour'::interval, time) AS bucket,
   stats_agg( temp_c) AS hourly_agg
FROM weather_metrics
 WHERE city_name = 'New York'
AND time BETWEEN '2022-06-01 00:00:00' AND '2022-07-01 01:00:00'
GROUP BY 1 ORDER BY 1

Average

Compute an average from stats aggs is very easy:

 SELECT time_bucket('1 hour'::interval, time) AS bucket,
   average(stats_agg( temp_c)) AS hourly_average
FROM weather_metrics
 WHERE city_name = 'New York'
AND time BETWEEN '2022-06-01 00:00:00' AND '2022-07-01 01:00:00'
GROUP BY 1 ORDER BY 1

Alias

Using CTE to reuse the stats aggs pre-computed data:

WITH agg AS (
  SELECT time_bucket('1 hour'::interval, time) AS bucket,
    stats_agg( temp_c)
  FROM weather_metrics
  WHERE city_name = 'New York'
  AND time BETWEEN '2022-06-01 00:00:00' AND '2022-07-01 01:00:00'
  GROUP BY 1
  ORDER BY 1
)
SELECT bucket, average(stats_agg) FROM agg;

Rollup

Rollup can combine stats aggs in different time frames:

WITH hourly AS (
  SELECT time_bucket('1 hour'::interval, time) AS hour_bucket,
    stats_agg( temp_c)
  FROM weather_metrics
  WHERE city_name = 'New York'
  AND time between '2022-06-01 00:00:00' AND '2022-07-01 01:00:00'
  GROUP BY 1 ORDER BY 1
)
SELECT time_bucket('1 day', hour_bucket),
  average(rollup(stats_agg))
FROM hourly GROUP BY 1;

cascade

Cascading rollups can reuse previous stats aggs:

WITH hourly AS ( SELECT time_bucket('1 hour'::interval, time) AS bucket,
    stats_agg( temp_c) AS hourly_agg
  FROM weather_metrics
  WHERE city_name = 'New York'
    AND time BETWEEN '2021-06-01 00:00:00' AND '2022-07-01 01:00:00'
  GROUP BY 1 ORDER BY 1
),
daily AS ( SELECT time_bucket('1 day', bucket) AS bucket,
    rollup(hourly_agg) AS daily_agg
  FROM hourly GROUP BY 1
),
monthly AS ( SELECT time_bucket('1 month', bucket) AS bucket,
   rollup(daily_agg) AS monthly_agg
 FROM daily GROUP BY 1
)
SELECT bucket, average(monthly_agg) from monthly;

Variance

Adding variance AND stddev without expensive computing process:

-- previous stats aggs example
SELECT bucket,
  average(monthly_agg),
  variance(monthly_agg),
  stddev(monthly_agg)
FROM monthly;

num_vals

Querying number of values from pre-computed stats aggs:

WITH hourly AS (
  SELECT time_bucket('1 hour'::interval, time) AS bucket,
    stats_agg( temp_c) AS hourly_agg
  FROM weather_metrics
  WHERE city_name = 'New York'
  AND time BETWEEN '2021-06-01 00:00:00' AND '2022-06-01 01:00:00'
  GROUP BY 1 ORDER BY 1
)
SELECT bucket, average(hourly_agg), num_vals(hourly_agg) from hourly;

CAggs

AKA Continuous Aggregates ;)

Materialized views for hypertables.

CREATE MATERIALIZED VIEW ny_hourly_agg
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 hour'::interval, time) AS bucket,
   stats_agg( temp_c) AS hourly_agg
FROM weather_metrics
 WHERE city_name = 'New York'
GROUP BY 1;

Materialized data can be combined with real time data from open timeframes.

caggs^2?

CREATE MATERIALIZED VIEW ny_daily_agg
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 day',bucket),
rollup(hourly_agg) AS daily_agg
FROM ny_hourly_agg group by 1;

not allowed, but can save processing with regular views:

CREATE VIEW ny_daily_agg AS
SELECT time_bucket('1 day',bucket),
    rollup(hourly_agg) AS daily_agg
FROM ny_hourly_agg GROUP BY 1;

Ruby

The timescaledb gem:

  • wrapper for TimescaleDB functions
  • wrapper for Toolkit utilities (WIP)
  • command line utility to navigate into your tsdb
gem install timescaledb

https://github.com/jonatas/timescaledb

tsdb

The tsdb is a Ruby playground for TimescaleDB instances.

tsdb $PG_URI --stats

Pry Console - psql for rubysts ;)

tsdb $PG_URI --console

Models

tsdb utilities generates on the fly models to allow you to query with readonly apps:

WeatherMetric

Metadata

Allows you to access all hypertable metadata:

WeatherMetric.hypertable

Has wrappers for hypertable utilities:

WeatherMetric.hypertable.approximate_rows_count

Scope

ny = WeatherMetric.where(city_name: "New York"); nil

Nesting

Build query from previous scope:

ny.select("time_bucket('1y',time) as time, avg(temp_c) as temp_c").group(1)

Toolkit

  • Ease all things analytics when using TimescaleDB.
  • Focus on developer ergonomics and performance.
require 'timescaledb/toolkit'
WeatherMetric.acts_as_time_vector value_column: "temp_c"

acts_as_time_vector can also specify time_column and segment_by.

Volatility

WeatherMetric.yesterday.volatility.map(&:attributes)

Segment by

Segment by can be used to group data by some segment. It can be used through several time vector functions, so it's good to have it pre-configured.

WeatherMetric.acts_as_time_vector value_column: "temp_c", segment_by: "city_name"

And then volatility will become:

WeatherMetric.yesterday.volatility.map(&:attributes)

LTTB

Downsampling method to reduce number of points to a threshold.

WeatherMetric.lttb(threshold: 50)

LTTB web

Comparison to Ruby.

https://jonatas.github.io/timescaledb/toolkit_lttb_tutorial/

OHLC

The histogram with Open, High, Low, Close values for grouped data.

ohlc = ny.select("time_bucket('1y',time) as time,
  toolkit_experimental.ohlc(time, temp_c)").group(1)
WeatherMetric.from("(#{ohlc.to_sql}) AS ohlc")
  .select("time,
    toolkit_experimental.open(ohlc),
    toolkit_experimental.high(ohlc),
    toolkit_experimental.low(ohlc),
    toolkit_experimental.close(ohlc)")
  .map(&:attributes)

Extra Resources

Thanks

Jônatas Davi Paganini

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment