Skip to content

Instantly share code, notes, and snippets.

@judell
Last active November 17, 2023 03:19
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save judell/fb3799d2a356c9377f4f39185719f887 to your computer and use it in GitHub Desktop.
Save judell/fb3799d2a356c9377f4f39185719f887 to your computer and use it in GitHub Desktop.
join-vs-where

case 1: match literal followed_account_id, finds 3 rows

explain analyze 

with data as (
  select 
    to_char(created_at, 'YYYY-MM') as created
  from 
    mastodon_follower
  where 
    followed_account_id = '110147269591397842'
 limit 3
) 
select 
  created, 
  count(*)
from 
  data
group by
  created
order by
  created

+---------------------------------------------------------------------------------------------------------------------->
| QUERY PLAN                                                                                                           >
+---------------------------------------------------------------------------------------------------------------------->
| GroupAggregate  (cost=60000000.06..60000000.11 rows=3 width=40) (actual time=2.584..2.587 rows=3 loops=1)            >
|   Group Key: (to_char(mastodon_follower.created_at, 'YYYY-MM'::text))                                                >
|   ->  Sort  (cost=60000000.06..60000000.07 rows=3 width=32) (actual time=2.577..2.578 rows=3 loops=1)                >
|         Sort Key: (to_char(mastodon_follower.created_at, 'YYYY-MM'::text))                                           >
|         Sort Method: quicksort  Memory: 25kB                                                                         >
|         ->  Limit  (cost=0.00..60000000.01 rows=3 width=32) (actual time=2.484..2.565 rows=3 loops=1)                >
|               ->  Foreign Scan on mastodon_follower  (cost=0.00..20000000002500.00 rows=1000000 width=32) (actual tim>
|                     Filter: (followed_account_id = '110147269591397842'::text)                                       >
| Planning Time: 4.437 ms                                                                                              >
| Execution Time: 3.783 ms                                                                                             >
+----------------------------------------------------------------------------------------------------------------------

case 2: where followed_account_id = SUBQUERY, finds 0 rows

first, observe:

> select ( select id::text from mastodon_my_account limit 1) = '110147269591397842' as match
+-------+
| match |
+-------+
| true  |
+-------+
explain analyze 

with data as (
  select 
    to_char(created_at, 'YYYY-MM') as created
  from 
    mastodon_follower
  where 
    followed_account_id::text = ( select id::text from mastodon_my_account limit 1)
 limit 3
) 
select 
  created, 
  count(*)
from 
  data
group by
  created
order by
  created

+---------------------------------------------------------------------------------------------------------------------->
| QUERY PLAN                                                                                                           >
+---------------------------------------------------------------------------------------------------------------------->
| GroupAggregate  (cost=70000000.06..70000000.11 rows=3 width=40) (actual time=220.996..220.999 rows=0 loops=1)        >
|   Group Key: (to_char(mastodon_follower.created_at, 'YYYY-MM'::text))                                                >
|   ->  Sort  (cost=70000000.06..70000000.07 rows=3 width=32) (actual time=220.995..220.996 rows=0 loops=1)            >
|         Sort Key: (to_char(mastodon_follower.created_at, 'YYYY-MM'::text))                                           >
|         Sort Method: quicksort  Memory: 25kB                                                                         >
|         ->  Limit  (cost=10000000.00..70000000.01 rows=3 width=32) (actual time=220.981..220.983 rows=0 loops=1)     >
|               InitPlan 1 (returns $0)                                                                                >
|                 ->  Limit  (cost=0.00..10000000.00 rows=1 width=100) (never executed)                                >
|                       ->  Foreign Scan on mastodon_my_account  (cost=0.00..10000000000000.00 rows=1000000 width=100) >
|               ->  Foreign Scan on mastodon_follower  (cost=0.00..20000000002500.00 rows=1000000 width=32) (actual tim>
|                     Filter: (followed_account_id = $0)                                                               >
| Planning Time: 5.547 ms                                                                                              >
| Execution Time: 222.762 ms                                                                                           >
+---------------------------------------------------------------------------------------------------------------------->

case 3: join via cte finds 3 rows

explain analyze

with my_account_id as (
  select id::text from mastodon_my_account limit 1
)
select 
  to_char(mf.created_at, 'yyyy-mm') as created, 
  count(*)
from 
  mastodon_follower mf
join 
  my_account_id mai on mf.followed_account_id::text = mai.id
group by 
  created
order by 
  created
limit 3;

+---------------------------------------------------------------------------------------------------------------------------------------------->
| QUERY PLAN                                                                                                                                   >
+---------------------------------------------------------------------------------------------------------------------------------------------->
| Limit  (cost=10000719.72..10000720.32 rows=3 width=40) (actual time=19385.227..19385.237 rows=3 loops=1)                                     >
|   ->  GroupAggregate  (cost=10000719.72..10000759.72 rows=200 width=40) (actual time=19385.226..19385.235 rows=3 loops=1)                    >
|         Group Key: (to_char(mf.created_at, 'yyyy-mm'::text))                                                                                 >
|         ->  Sort  (cost=10000719.72..10000732.22 rows=5000 width=32) (actual time=19385.216..19385.220 rows=47 loops=1)                      >
|               Sort Key: (to_char(mf.created_at, 'yyyy-mm'::text))                                                                            >
|               Sort Method: quicksort  Memory: 130kB                                                                                          >
|               ->  Nested Loop  (cost=0.00..10000412.53 rows=5000 width=32) (actual time=1412.765..19383.986 rows=1732 loops=1)               >
|                     ->  Limit  (cost=0.00..10000000.00 rows=1 width=100) (actual time=883.771..883.773 rows=1 loops=1)                       >
|                           ->  Foreign Scan on mastodon_my_account  (cost=0.00..10000000000000.00 rows=1000000 width=100) (actual time=883.770>
|                     ->  Foreign Scan on mastodon_follower mf  (cost=0.00..400.00 rows=2 width=200) (actual time=528.414..18497.811 rows=1732 >
|                           Filter: (followed_account_id = mastodon_my_account.id)                                                             >
| Planning Time: 6.612 ms                                                                                                                      >
| Execution Time: 19387.510 ms                                                                                                                 >
+---------------------------------------------------------------------------------------------------------------------------------------------->

case 4: join via temp table finds 3 rows

create temp table temp_my_account_id as
select id::text from mastodon_my_account limit 1;


explain analyze

select 
  to_char(mf.created_at, 'yyyy-mm') as created, 
  count(*)
from 
  mastodon_follower mf
join 
  temp_my_account_id mai on mf.followed_account_id::text = mai.id
group by 
  created
order by 
  created
limit 3;


+---------------------------------------------------------------------------------------------------------------------------------------------->
| QUERY PLAN                                                                                                                                   >
+---------------------------------------------------------------------------------------------------------------------------------------------->
| Limit  (cost=595055.88..595055.89 rows=3 width=40) (actual time=19292.631..19292.633 rows=3 loops=1)                                         >
|   ->  Sort  (cost=595055.88..595056.38 rows=200 width=40) (actual time=19292.630..19292.631 rows=3 loops=1)                                  >
|         Sort Key: (to_char(mf.created_at, 'yyyy-mm'::text))                                                                                  >
|         Sort Method: top-N heapsort  Memory: 25kB                                                                                            >
|         ->  HashAggregate  (cost=595050.80..595053.30 rows=200 width=40) (actual time=19292.605..19292.613 rows=81 loops=1)                  >
|               Group Key: to_char(mf.created_at, 'yyyy-mm'::text)                                                                             >
|               Batches: 1  Memory Usage: 40kB                                                                                                 >
|               ->  Nested Loop  (cost=0.00..561050.80 rows=6800000 width=32) (actual time=1188.948..19291.621 rows=1732 loops=1)              >
|                     ->  Seq Scan on temp_my_account_id mai  (cost=0.00..23.60 rows=1360 width=32) (actual time=0.009..0.013 rows=1 loops=1)  >
|                     ->  Foreign Scan on mastodon_follower mf  (cost=0.00..400.00 rows=2 width=200) (actual time=1188.527..19289.546 rows=1732>
|                           Filter: (followed_account_id = mai.id)                                                                             >
| Planning Time: 5.384 ms                                                                                                                      >
| Execution Time: 19293.293 ms                                                                                                                 >
+---------------------------------------------------------------------------------------------------------------------------------------------->

case 5: where followed_account_id = SUBQUERY (of temp table), finds 0 rows

explain analyze

select 
  to_char(mf.created_at, 'yyyy-mm') as created, 
  count(*)
from 
  mastodon_follower mf
where 
  followed_account_id::text = ( select id::text from temp_my_account_id limit 1)
group by 
  created
order by 
  created
limit 3;


+---------------------------------------------------------------------------------------------------------------------------------------------->
| QUERY PLAN                                                                                                                                   >
+---------------------------------------------------------------------------------------------------------------------------------------------->
| Limit  (cost=20000000126083.86..20000000126196.40 rows=3 width=40) (actual time=1.741..1.743 rows=0 loops=1)                                 >
|   InitPlan 1 (returns $0)                                                                                                                    >
|     ->  Limit  (cost=0.00..0.02 rows=1 width=32) (never executed)                                                                            >
|           ->  Seq Scan on temp_my_account_id  (cost=0.00..23.60 rows=1360 width=32) (never executed)                                         >
|   ->  GroupAggregate  (cost=20000000126083.84..20000000133586.34 rows=200 width=40) (actual time=1.740..1.741 rows=0 loops=1)                >
|         Group Key: (to_char(mf.created_at, 'yyyy-mm'::text))                                                                                 >
|         ->  Sort  (cost=20000000126083.84..20000000128583.84 rows=1000000 width=32) (actual time=1.739..1.739 rows=0 loops=1)                >
|               Sort Key: (to_char(mf.created_at, 'yyyy-mm'::text))                                                                            >
|               Sort Method: quicksort  Memory: 25kB                                                                                           >
|               ->  Foreign Scan on mastodon_follower mf  (cost=0.00..20000000002500.00 rows=1000000 width=32) (actual time=1.732..1.732 rows=0>
|                     Filter: (followed_account_id = $0)                                                                                       >
| Planning Time: 4.346 ms                                                                                                                      >
| Execution Time: 2.193 ms                                                                                                                     >
+---------------------------------------------------------------------------------------------------------------------------------------------->

analysis

prompt

Consider these 5 cases. It looks like a join strategy against a foreign table makes it more likely the foreign table will be used than a where strategy.

First, do you agree the evidence suggests that?

Second, if so, why?

response (chatgpt)

The evidence from your five cases does suggest that using a join strategy (Cases 3 and 4) is more effective in ensuring the foreign table (mastodon_my_account) is factored into the query results compared to using a subquery in the WHERE clause (Cases 2 and 5). Here's a closer look at why this might be happening:

Analysis of the Cases

Case 1 (Literal Comparison): Works as expected. This is straightforward and doesn't involve any subqueries or joins.

Case 2 (Subquery in WHERE Clause): Returns 0 rows despite the subquery being logically equivalent to the literal value used in Case 1. The subquery isn’t executed as part of the query plan.

Case 3 (Join with CTE): Successfully returns rows. The subquery result is used in a CTE, which is then joined to the main table.

Case 4 (Join with Temporary Table): Also successfully returns rows. The subquery result is stored in a temporary table, which is then joined.

Case 5 (Subquery in WHERE Clause with Temporary Table): Returns 0 rows, similar to Case 2, despite using a temporary table.

Why Join Strategy Seems More Effective

Execution Plan Handling: SQL query optimizers might handle joins differently compared to subqueries in WHERE clauses. In joins, the optimizer can clearly see the relationship between two datasets and might be better at executing these queries, especially with foreign tables.

Materialization of Subqueries: In Cases 3 and 4, the subquery results are materialized (made concrete) either in a CTE or a temporary table. This means the result of the subquery is computed and stored, and then the stored result is used in the join. In contrast, subqueries in WHERE clauses (Cases 2 and 5) are typically evaluated in the context of each row, and it's possible the optimizer is failing to execute these subqueries properly, especially when dealing with foreign scans.

Foreign Table Optimization: When dealing with foreign tables, the database's query planner might have limitations or specific optimizations that affect how subqueries are executed. Joins, particularly with materialized results (like CTEs or temp tables), might be more predictable and effectively utilized by the query planner.

Subquery Execution Context: Subqueries in WHERE clauses can sometimes be optimized out or not executed as expected, depending on the query planner's logic. This seems to be happening in your cases, where the subquery, although valid, is marked as never executed.

Conclusion

In summary, join strategies appear more reliable in your scenario, likely due to how the SQL query optimizer handles joins versus subqueries, especially in the context of foreign tables. Materializing the subquery results before joining ensures that the optimizer has a concrete dataset to work with, leading to more predictable and effective query execution.

response (sourcegraph cody)

Based on the provided code snippets, I agree that the evidence suggests a join strategy against a foreign table makes it more likely the foreign table will be used compared to a WHERE strategy.

There are a few potential reasons for this:

With a join, the query planner knows it needs to access the foreign table to perform the join, so it will plan to use the foreign table. With a WHERE clause, the planner may think it can get the results it needs more efficiently from just the local data if statistics are not accurate.

Joins may get planned before WHERE clauses, so the foreign table access gets baked into the plan early. WHERE clauses may get added later as a filter on top of a plan that doesn't include the foreign table.

Join conditions can provide more specific criteria (joining on a key column) versus a WHERE clause with a broader filter. This may make the planner more confident that limiting the foreign table access with the join condition will produce the correct result set.

Joining forces accessing the foreign table because the query needs data from both sources. A WHERE clause may be evaluated just against local data first before considering the foreign table.

So in summary, the join approach signals clearly that the foreign table is needed upfront, while the WHERE approach leaves more room for optimization and avoiding the foreign table if possible. But the specifics likely depend on the query planner implementation.

response (copilot chat)

Yes, based on the provided cases, it appears that using a join strategy against a foreign table is more likely to result in the foreign table being used compared to a where strategy.

The evidence suggests this because in cases 1, 3, and 4 where a join strategy is used, the foreign table (mastodon_follower) is explicitly mentioned in the query plan and is involved in the execution of the query. In contrast, in cases 2 and 5 where a where strategy is used, the foreign table is not explicitly mentioned in the query plan and is not involved in the execution of the query.

This suggests that the join strategy explicitly includes the foreign table in the query plan and execution, making it more likely for the foreign table to be used. On the other hand, the where strategy does not explicitly include the foreign table, potentially leading to the optimizer choosing a different execution plan that does not involve the foreign table.

response (claude)

Yes, I agree the evidence from the 5 cases suggests that joining against a foreign table makes it more likely that the foreign table will be used compared to using a WHERE clause strategy.

The key difference seems to be that when using a join, PostgreSQL is able to see that it needs to scan the foreign table in order to evaluate the join condition. This leads it to include the foreign table scan in the query plan.

In contrast, when using a WHERE clause with a subquery, PostgreSQL first executes the subquery to get the value to compare against. It doesn't "see" the foreign table at that stage. So the subquery is planned separately from the main query, and the foreign table scan doesn't get included in the main query plan.

As to why this occurs - it likely has to do with how PostgreSQL plans queries recursively and optimizes them based on the information it has available at each planning stage. When doing a join, it has visibility into both tables at the same stage so it can optimize them together. With a subquery, the optimization happens in isolation so it misses potential optimizations across the queries.

The use of temporary tables seems to behave more like a join, giving PostgreSQL the full picture to optimize the query plan. So in general, joining appears to work better than a WHERE clause subquery for including foreign data scans in the optimized query plan.

response (unblocked)

Based on the provided SQL execution plans, it does appear that using a JOIN strategy against a foreign table is more likely to result in the foreign table being used than a WHERE strategy.

The reason for this behavior is related to how PostgreSQL (the underlying database for Steampipe) optimizes queries. When you use a JOIN, PostgreSQL can use statistics about the tables involved to make better decisions about how to execute the query. This can include choosing to scan the foreign table if it determines that this would be more efficient.

In contrast, when you use a WHERE clause with a subquery, PostgreSQL has to execute the subquery first and then use the result in the main query. This can be less efficient, especially if the subquery returns a large number of rows. In your case, the subquery is returning a single value, but PostgreSQL doesn't know this until it executes the subquery. As a result, it may choose a less efficient plan that doesn't involve scanning the foreign table.

It's also worth noting that the cost estimates in the execution plans are significantly higher for the WHERE strategies (cases 2 and 5) compared to the JOIN strategies (cases 3 and 4). This further suggests that PostgreSQL considers the JOIN strategies to be more efficient.

However, these are general observations and the actual performance can depend on various factors, including the specific data and indexes in your tables, the PostgreSQL configuration, and the complexity of your queries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment