Skip to content

Instantly share code, notes, and snippets.

@ClaytonJY
Last active January 28, 2021 00:19
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ClaytonJY/6de3ba7cbedfcffda794d333263cdb0d to your computer and use it in GitHub Desktop.
Save ClaytonJY/6de3ba7cbedfcffda794d333263cdb0d to your computer and use it in GitHub Desktop.
Timescale scans all chunks when joining more than one key!

Timescale Selection Query Plans

I've seen some odd behavior trying to select specific items from Timescale hypertables, which I reproduce here with randomized data.

Setup

We'll need a vanilla Timescale instance; I like Docker.

# start a PG12 instance w/ Timescale 2.0
docker run --rm -d -p 5432:5432 -e POSTGRES_PASSWORD=password --name timescale timescale/timescaledb:latest-pg12

# connect to it
docker exec -it timescale psql -U postgres

First we need data, with both a timestamp and ID column, as well as some value-columns.

We'll create 6-months worth of per-minute data for 10 devices, adding some random perturbation to the timestamps. We'll also add a primary key on (sensor_id, time) and make it a hypertable.

CREATE TABLE sensor_data AS
SELECT
  time + (interval '1 minute' * random()) AS time,
  sensor_id,
  random() AS cpu,
  random()*100 AS temperature
FROM
  generate_series(now() - interval '6 months', now(), interval '1 minute') AS g1(time),
  generate_series(1,10,1) AS g2(sensor_id)
ORDER BY time;

ALTER TABLE sensor_data ADD CONSTRAINT sensor_data_pk PRIMARY KEY (sensor_id, time);

SELECT FROM create_hypertable('sensor_data', 'time', migrate_data=>true);

SELECT * FROM timescaledb_information.hypertables;
 hypertable_schema | hypertable_name |  owner   | num_dimensions | num_chunks | compression_enabled | is_distributed | replication_factor | data_nodes | tablespaces
-------------------+-----------------+----------+----------------+------------+---------------------+----------------+--------------------+------------+-------------
 public            | sensor_data     | postgres |              1 |         27 | f                   | f              |                    |            |
(1 row)

Let's see some of the oldest values, from chunks we expect are on disk.

SELECT * FROM sensor_data
ORDER BY time
LIMIT 10;
             time              | sensor_id |         cpu         |    temperature
-------------------------------+-----------+---------------------+--------------------
 2020-07-27 22:44:29.963116+00 |         5 |  0.9907297858145725 |  2.443502657772356
 2020-07-27 22:44:30.884572+00 |         9 | 0.36664094875715136 |  94.92922018026597
 2020-07-27 22:44:33.943573+00 |         6 |  0.6114390001389793 | 61.091366210785125
 2020-07-27 22:44:36.798245+00 |         8 |  0.6320730080727373 |  8.323516658384378
 2020-07-27 22:44:54.981908+00 |         7 |  0.9271099918152501 | 11.218798468656033
 2020-07-27 22:45:02.815279+00 |         4 |  0.5134860056074046 | 11.483146614569861
 2020-07-27 22:45:05.306625+00 |         1 |  0.5120288493981846 | 30.728807731221153
 2020-07-27 22:45:23.069828+00 |         3 |  0.2670812694924365 | 28.216946641256513
 2020-07-27 22:45:25.775646+00 |         2 |  0.9039883996903981 |  27.59812841150371
 2020-07-27 22:45:27.690306+00 |        10 |  0.8509384567249754 |  60.05163794338095
(10 rows)

The Problem

If we want to select the values for a known ID & time, we can stick them in the WHERE clause, and we get a nice query plan:

EXPLAIN
SELECT * FROM sensor_data
WHERE sensor_id = 5
  AND time = '2020-07-27 22:44:29.963116+00';
                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Index Scan using "1_1_sensor_data_pk" on _hyper_1_1_chunk  (cost=0.29..2.51 rows=1 width=28)
   Index Cond: ((sensor_id = 5) AND ("time" = '2020-07-27 22:44:29.963116+00'::timestamp with time zone))
(2 rows)

There are other ways to do this that will generalize better to multiple points, like joining on a table of our known key-pairs:

EXPLAIN
WITH keys AS (
  SELECT * FROM (VALUES
    (5, '2020-07-27 22:44:29.963116+00'::TIMESTAMPTZ)
  ) AS _(sensor_id, time)
)
SELECT * FROM sensor_data
INNER JOIN keys
  USING (sensor_id, time)
;
                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Index Scan using "1_1_sensor_data_pk" on _hyper_1_1_chunk  (cost=0.29..2.51 rows=1 width=28)
   Index Cond: ((sensor_id = 5) AND ("time" = '2020-07-27 22:44:29.963116+00'::timestamp with time zone))
(2 rows)

Still good! What if we expand to another point? We'll also show real timings now with ANALYZE.

EXPLAIN ANALYZE
WITH keys AS (
  SELECT * FROM (VALUES
    (5, '2020-07-27 22:44:29.963116+00'::TIMESTAMPTZ),
    (9, '2020-07-27 22:44:30.884572+00')
  ) AS _(sensor_id, time)
)
SELECT * FROM sensor_data
INNER JOIN keys
  USING (sensor_id, time)
;
                                                                     QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.29..143.00 rows=132 width=28) (actual time=0.014..0.021 rows=2 loops=1)
   ->  Values Scan on "*VALUES*"  (cost=0.00..0.03 rows=2 width=12) (actual time=0.001..0.002 rows=2 loops=1)
   ->  Append  (cost=0.29..71.22 rows=27 width=28) (actual time=0.007..0.007 rows=1 loops=2)
         ->  Index Scan using "1_1_sensor_data_pk" on _hyper_1_1_chunk  (cost=0.29..2.51 rows=1 width=28) (actual time=0.006..0.006 rows=1 loops=2)
               Index Cond: ((sensor_id = "*VALUES*".column1) AND ("time" = "*VALUES*".column2))
         ->  Index Scan using "2_2_sensor_data_pk" on _hyper_1_2_chunk  (cost=0.42..2.64 rows=1 width=28) (never executed)
               Index Cond: ((sensor_id = "*VALUES*".column1) AND ("time" = "*VALUES*".column2))
<...abreviated...>
         ->  Index Scan using "26_26_sensor_data_pk" on _hyper_1_26_chunk  (cost=0.42..2.64 rows=1 width=28) (never executed)
               Index Cond: ((sensor_id = "*VALUES*".column1) AND ("time" = "*VALUES*".column2))
         ->  Index Scan using "27_27_sensor_data_pk" on _hyper_1_27_chunk  (cost=0.42..2.64 rows=1 width=28) (never executed)
               Index Cond: ((sensor_id = "*VALUES*".column1) AND ("time" = "*VALUES*".column2))
 Planning Time: 5.322 ms
 Execution Time: 0.492 ms
(59 rows)

We scanned every chunk index!

The following approaches produce the same query plan:

-- lateral join
EXPLAIN ANALYZE
WITH keys AS (
  SELECT * FROM (VALUES
    (5, '2020-07-27 22:44:29.963116+00'::TIMESTAMPTZ),
    (9, '2020-07-27 22:44:30.884572+00')
  ) AS _(sensor_id, time)
)
SELECT sensor_data.* FROM keys
INNER JOIN LATERAL (
  SELECT * FROM sensor_data
  WHERE sensor_id = keys.sensor_id
    AND time      = keys.time
) AS sensor_data ON TRUE
;

-- UNION ALL + INNER JOIN
EXPLAIN ANALYZE
WITH keys AS (
  SELECT * FROM (VALUES
    (5, '2020-07-27 22:44:29.963116+00'::TIMESTAMPTZ),
    (9, '2020-07-27 22:44:30.884572+00')
  ) AS _(sensor_id, time)
)
SELECT * FROM sensor_data
INNER JOIN (SELECT * FROM keys OFFSET 0 LIMIT 1) AS _
  USING (sensor_id, time)
UNION ALL
SELECT * FROM sensor_data
INNER JOIN (SELECT * FROM keys OFFSET 1 LIMIT 1) AS _
  USING (sensor_id, time)
;

Also note that forcing the CTE to materialize (hello PG12!) doesn't help any of these.

Solutions

Taking a UNION ALL approach with INNER JOIN still scans every chunk, but UNION ALL-ing specific selection forces the query planner to behave better:

EXPLAIN ANALYZE
SELECT * FROM sensor_data
WHERE sensor_id = 5
  AND time = '2020-07-27 22:44:29.963116+00'
UNION ALL
SELECT * FROM sensor_data
WHERE sensor_id = 9
  AND time = '2020-07-27 22:44:30.884572+00'
;
                                                                           QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.29..5.04 rows=2 width=28) (actual time=0.011..0.018 rows=2 loops=1)
   ->  Index Scan using "1_1_sensor_data_pk" on _hyper_1_1_chunk  (cost=0.29..2.51 rows=1 width=28) (actual time=0.010..0.011 rows=1 loops=1)
         Index Cond: ((sensor_id = 5) AND ("time" = '2020-07-27 22:44:29.963116+00'::timestamp with time zone))
   ->  Index Scan using "1_1_sensor_data_pk" on _hyper_1_1_chunk _hyper_1_1_chunk_1  (cost=0.29..2.51 rows=1 width=28) (actual time=0.005..0.005 rows=1 loops=1)
         Index Cond: ((sensor_id = 9) AND ("time" = '2020-07-27 22:44:30.884572+00'::timestamp with time zone))
 Planning Time: 0.629 ms
 Execution Time: 0.039 ms
(7 rows)

By forcing the query planner to only look in appropriate chunks, we've got a 10x speedup. As you can imagine, the join approach gets worse as the chunks grow!

We can get a slightly different but similarly-fast plan if we take a slightly different approach

EXPLAIN ANALYZE
SELECT * FROM sensor_data
WHERE (sensor_id = 5 AND time = '2020-07-27 22:44:29.963116+00')
   OR (sensor_id = 9 AND time = '2020-07-27 22:44:30.884572+00')
;
                                                                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on _hyper_1_1_chunk  (cost=2.80..3.92 rows=1 width=28) (actual time=0.012..0.013 rows=2 loops=1)
   Recheck Cond: (((sensor_id = 5) AND ("time" = '2020-07-27 22:44:29.963116+00'::timestamp with time zone)) OR ((sensor_id = 9) AND ("time" = '2020-07-27 22:44:30.884572+00'::timestamp with time zone)))
   Heap Blocks: exact=1
   ->  BitmapOr  (cost=2.80..2.80 rows=1 width=0) (actual time=0.009..0.010 rows=0 loops=1)
         ->  Bitmap Index Scan on "1_1_sensor_data_pk"  (cost=0.00..1.40 rows=1 width=0) (actual time=0.006..0.006 rows=1 loops=1)
               Index Cond: ((sensor_id = 5) AND ("time" = '2020-07-27 22:44:29.963116+00'::timestamp with time zone))
         ->  Bitmap Index Scan on "1_1_sensor_data_pk"  (cost=0.00..1.40 rows=1 width=0) (actual time=0.003..0.003 rows=1 loops=1)
               Index Cond: ((sensor_id = 9) AND ("time" = '2020-07-27 22:44:30.884572+00'::timestamp with time zone))
 Planning Time: 0.646 ms
 Execution Time: 0.040 ms
(10 rows)

Unfortunately, if you have a table of values of interest (suppose you cache the last time per sensor with an upserting trigger into a separate table), it's very unnatural to take either of the fast approaches above. Dynamic query building is certainly an option, but not a great one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment