jonatas/ada-meetup-walkthrough-timescale-and-ruby.md

## ada-meetup-walkthrough-timescale-and-ruby.md

      
    Raw
  

              ada-meetup-walkthrough-timescale-and-ruby.md
            
          
    watch the talk here

ADA.rb

Ruby

&

Timescale

Jônatas Davi Paganini

@jonatasdp

@jonatasdp


Backend developer


Ruby/Shell/Postgresql/Vim

Ruby since 2007.
Postgresql since 2004.


twitter: @jonatasdp

github: @jonatas

Agenda


The weather dataset -  our playground Today!
SQL walkthrough - Statistics with Postgresql.
The TimescaleDB walkthrough - time-series with super-powers.
The Ruby walkthrough with the TimescaleDB gem.

Dataset

Open weather:
https://openweathermap.org

Free data from entire world.
Free API.
Statistics from anywhere.
Time-series data.


The focus will be weather metrics.

DDL

Joining the open weather via psql:
psql open_weather
Describe the weather_metrics table:
\d weather_metrics
5WH

The Hypertable 5WH!

Who: The timescaledb extension
What: Hypertable
When: you need to handle time-series data (insert, select, update, delete)
Where: In your PostgreSQL database
Why: to optimize time-series throughtput
How: using table partitions to compress, parallelize and manage smaller chunks of data.

\d+ weather_metrics
Timing

On psql we can enable time:
\timing

Counting

 SELECT count(1) FROM weather_metrics ; # => 4092484
Time: 227.889 ms
Approx. Count

Timescaledb offers a different counting approach that is very approximate to
real counter.
SELECT approximate_row_count('weather_metrics') ; # => 4092484
Time: 14.310 ms
Note that 220 / 14 = 16 times faster.
Explain

Understanding a bit of the execution plan:
EXPLAIN SELECT count(1) FROM weather_metrics;
Divide and conquer!
->  Parallel Append  (cost=0.29..56957.24 rows=1023114 width=0)                                                                                    │
│  ->  Parallel Index Only Scan using _hyper_1_...idx on _hyper_1_...

simple query

SELECT time, temp_c
FROM weather_metrics
WHERE city_name = 'New York'
AND time BETWEEN '2022-06-01' AND '2022-06-02'
ORDER BY 1 LIMIT 20;
time bucket

Get average of temperature grouped by one hour.
SELECT time_bucket('1 hour', time) AS bucket,
  avg(temp_c)
FROM weather_metrics
 WHERE city_name = 'New York'
AND time BETWEEN '2022-06-01' AND '2022-06-02'
GROUP BY 1 ORDER BY 1;
The time_bucket also supports timestamps with time zones.
Min / Max

Now, let's get a bit more details adding the min AND max:
SELECT time_bucket('1 hour'::interval, time) AS bucket,
  avg(temp_c)::numeric(4,2),
  min(temp_c), max(temp_c)
FROM weather_metrics
WHERE city_name = 'New York'
  AND time BETWEEN '2022-06-01' AND '2022-06-02'
GROUP BY 1 ORDER BY 1;
Stddev

Now, we can also check the standard deviation:
SELECT time_bucket('1 hour'::interval, time) AS bucket,
  avg(temp_c)::numeric(4,2),
  min(temp_c), max(temp_c), stddev(temp_c)
FROM weather_metrics
WHERE city_name = 'New York'
  AND time BETWEEN '2022-06-01' AND '2022-06-02'
GROUP BY 1 ORDER BY 1;
Sample

Let's go deep into a single record to understand the standard deviation:
SELECT time_bucket('1 hour'::interval, time) AS bucket,
count(*),
  avg(temp_c)::numeric(4,2),
  min(temp_c), max(temp_c), stddev(temp_c)
FROM weather_metrics
WHERE city_name = 'New York'
  AND time BETWEEN '2022-06-01 00:00:00' AND '2022-06-01 01:00:00'
GROUP BY 1 ORDER BY 1;
array_agg

Now going deep into individual values inside this hour:
 SELECT time_bucket('1 hour'::interval, time) AS bucket,
array_agg( temp_c)
FROM weather_metrics
 WHERE city_name = 'New York'
AND time BETWEEN '2022-06-01 00:00:00' AND '2022-06-01 01:00:00'
GROUP BY 1 ORDER BY 1;
percentile_agg

To get the percentile_agg function an overview:
SELECT time_bucket('1 hour'::interval, time) AS bucket,
   percentile_agg( temp_c)
FROM weather_metrics
 WHERE city_name = 'New York'
AND time BETWEEN '2022-06-01 00:00:00' AND '2022-06-01 01:00:00'
GROUP BY 1 ORDER BY 1;
The functions with _agg suffix' indicates that several statistical aggregates
can be pre-computed and save computing later.
quartiles

Now, getting quartiles AND median from percentiles:
SELECT time_bucket('1 month'::interval, time) AS bucket,
    approx_percentile(0.25, percentile_agg( temp_c)) AS q_1,
    approx_percentile(0.5, percentile_agg( temp_c)) AS median,
    approx_percentile(0.75, percentile_agg( temp_c)) AS q3
FROM weather_metrics
 WHERE city_name = 'New York'
AND time BETWEEN '2021-06-01 00:00:00' AND '2022-06-01 01:00:00'
GROUP BY 1 ORDER BY 1;
CTE

Pre-compute aggregations with CTE can reuse the previous calculated percentile_agg:
WITH one_month AS (
  SELECT time_bucket('1 month'::interval, time) AS bucket,
    percentile_agg( temp_c)
  FROM weather_metrics
  WHERE city_name = 'New York'
    AND time BETWEEN '2021-06-01 00:00:00' AND '2022-07-01 01:00:00'
  GROUP BY 1 ORDER BY 1
)
SELECT bucket,
  approx_percentile(0.25, percentile_agg) AS q_1,
  approx_percentile(0.5, percentile_agg) AS median,
  approx_percentile(0.75, percentile_agg) AS q3
FROM one_month;
Stats aggs

Statistical aggregates in one or two dimensions to pre-compute statistics summary.
SELECT time_bucket('1 hour'::interval, time) AS bucket,
   stats_agg( temp_c) AS hourly_agg
FROM weather_metrics
 WHERE city_name = 'New York'
AND time BETWEEN '2022-06-01 00:00:00' AND '2022-07-01 01:00:00'
GROUP BY 1 ORDER BY 1
Average

Compute an average from stats aggs is very easy:
 SELECT time_bucket('1 hour'::interval, time) AS bucket,
   average(stats_agg( temp_c)) AS hourly_average
FROM weather_metrics
 WHERE city_name = 'New York'
AND time BETWEEN '2022-06-01 00:00:00' AND '2022-07-01 01:00:00'
GROUP BY 1 ORDER BY 1
Alias

Using CTE to reuse the stats aggs pre-computed data:
WITH agg AS (
  SELECT time_bucket('1 hour'::interval, time) AS bucket,
    stats_agg( temp_c)
  FROM weather_metrics
  WHERE city_name = 'New York'
  AND time BETWEEN '2022-06-01 00:00:00' AND '2022-07-01 01:00:00'
  GROUP BY 1
  ORDER BY 1
)
SELECT bucket, average(stats_agg) FROM agg;
Rollup

Rollup can combine stats aggs in different time frames:
WITH hourly AS (
  SELECT time_bucket('1 hour'::interval, time) AS hour_bucket,
    stats_agg( temp_c)
  FROM weather_metrics
  WHERE city_name = 'New York'
  AND time between '2022-06-01 00:00:00' AND '2022-07-01 01:00:00'
  GROUP BY 1 ORDER BY 1
)
SELECT time_bucket('1 day', hour_bucket),
  average(rollup(stats_agg))
FROM hourly GROUP BY 1;
cascade

Cascading rollups can reuse previous stats aggs:
WITH hourly AS ( SELECT time_bucket('1 hour'::interval, time) AS bucket,
    stats_agg( temp_c) AS hourly_agg
  FROM weather_metrics
  WHERE city_name = 'New York'
    AND time BETWEEN '2021-06-01 00:00:00' AND '2022-07-01 01:00:00'
  GROUP BY 1 ORDER BY 1
),
daily AS ( SELECT time_bucket('1 day', bucket) AS bucket,
    rollup(hourly_agg) AS daily_agg
  FROM hourly GROUP BY 1
),
monthly AS ( SELECT time_bucket('1 month', bucket) AS bucket,
   rollup(daily_agg) AS monthly_agg
 FROM daily GROUP BY 1
)
SELECT bucket, average(monthly_agg) from monthly;
Variance

Adding variance AND stddev without expensive computing process:
-- previous stats aggs example
SELECT bucket,
  average(monthly_agg),
  variance(monthly_agg),
  stddev(monthly_agg)
FROM monthly;
num_vals

Querying number of values from pre-computed stats aggs:
WITH hourly AS (
  SELECT time_bucket('1 hour'::interval, time) AS bucket,
    stats_agg( temp_c) AS hourly_agg
  FROM weather_metrics
  WHERE city_name = 'New York'
  AND time BETWEEN '2021-06-01 00:00:00' AND '2022-06-01 01:00:00'
  GROUP BY 1 ORDER BY 1
)
SELECT bucket, average(hourly_agg), num_vals(hourly_agg) from hourly;
CAggs


AKA Continuous Aggregates ;)

Materialized views for hypertables.
CREATE MATERIALIZED VIEW ny_hourly_agg
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 hour'::interval, time) AS bucket,
   stats_agg( temp_c) AS hourly_agg
FROM weather_metrics
 WHERE city_name = 'New York'
GROUP BY 1;
Materialized data can be combined with real time data from open timeframes.
caggs^2?

CREATE MATERIALIZED VIEW ny_daily_agg
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 day',bucket),
rollup(hourly_agg) AS daily_agg
FROM ny_hourly_agg group by 1;
not allowed, but can save processing with regular views:
CREATE VIEW ny_daily_agg AS
SELECT time_bucket('1 day',bucket),
    rollup(hourly_agg) AS daily_agg
FROM ny_hourly_agg GROUP BY 1;
Ruby

The timescaledb gem:

wrapper for TimescaleDB functions
wrapper for Toolkit utilities (WIP)
command line utility to navigate into your tsdb

gem install timescaledb
https://github.com/jonatas/timescaledb
tsdb

The tsdb is a Ruby playground for TimescaleDB instances.
tsdb $PG_URI --stats
Pry Console - psql for rubysts ;)
tsdb $PG_URI --console
Models

tsdb utilities generates on the fly models to allow you to query with
readonly apps:
WeatherMetric
Metadata

Allows you to access all hypertable metadata:
WeatherMetric.hypertable
Has wrappers for hypertable utilities:
WeatherMetric.hypertable.approximate_rows_count
Scope

ny = WeatherMetric.where(city_name: "New York"); nil
Nesting

Build query from previous scope:
ny.select("time_bucket('1y',time) as time, avg(temp_c) as temp_c").group(1)
Toolkit


Ease all things analytics when using TimescaleDB.
Focus on developer ergonomics and performance.

require 'timescaledb/toolkit'
WeatherMetric.acts_as_time_vector value_column: "temp_c"
acts_as_time_vector can also specify time_column and segment_by.
Volatility

WeatherMetric.yesterday.volatility.map(&:attributes)
Segment by

Segment by can be used to group data by some segment. It can be used through
several time vector functions, so it's good to have it pre-configured.
WeatherMetric.acts_as_time_vector value_column: "temp_c", segment_by: "city_name"
And then volatility will become:
WeatherMetric.yesterday.volatility.map(&:attributes)
LTTB

Downsampling method to reduce number of points to a threshold.
WeatherMetric.lttb(threshold: 50)
LTTB web

Comparison to Ruby.
https://jonatas.github.io/timescaledb/toolkit_lttb_tutorial/
OHLC

The histogram with Open, High, Low, Close values for grouped data.
ohlc = ny.select("time_bucket('1y',time) as time,
  toolkit_experimental.ohlc(time, temp_c)").group(1)
WeatherMetric.from("(#{ohlc.to_sql}) AS ohlc")
  .select("time,
    toolkit_experimental.open(ohlc),
    toolkit_experimental.high(ohlc),
    toolkit_experimental.low(ohlc),
    toolkit_experimental.close(ohlc)")
  .map(&:attributes)
Extra Resources


https://ideia.me/using-the-timescale-gem-with-ruby
https://ideia.me/timescale-continuous-aggregates-with-ruby
https://github.com/jonatas/timescaledb
https://timescale.com/community

Thanks


@jonatasdp on {Twitter,Instagram,Linkedin}
Github: @jonatas

Jônatas Davi Paganini