jeremyyeo/README.md

## README.md

      
    Raw
  

              README.md
            
          
    How to do multi-threaded / parallel operations with dbt

-!  🚨                                          WARNING                                          🚨  !-
Just because you could does not mean you should!

Examples below are using PostgreSQL however the same idea apply across adapters (Snowflake,etc).

In order to run arbitrary SQL commands with dbt - we would typically use hooks or operations. Imagine for a second, we have to run an operation that performs some action multiple times:
insert into foo_1 ...;
insert into foo_2 ...;
insert into foo_3 ...;
insert into foo_4 ...;
insert into foo_5 ...;
insert into foo_6 ...;
If we were to run those SQL statements in a single run-operation - then this would be executed serially. Let's look at an example.
First let's create a function that simulates something doing something "slow":
create or replace function wait_then_return(t integer)
returns integer as $$
begin
  perform pg_sleep(t); -- sleep for t seconds.
  return t;            -- return t.
end;
$$ language plpgsql;
With that function created, when we call that function, the query would take t seconds to execute:
postgres=# select wait_then_return(10) as c;
 c  
----
 10
(1 row)

Time: 10012.288 ms (00:10.012)

postgres=# select wait_then_return(10) as c; select wait_then_return(10) as c; select wait_then_return(10) as c;
 c  
----
 10
(1 row)

Time: 10012.910 ms (00:10.013)
 c  
----
 10
(1 row)

Time: 10012.328 ms (00:10.012)
 c  
----
 10
(1 row)

Time: 10012.681 ms (00:10.013)
In a dbt project, things would look something like:
# dbt_project.yml
name: my_dbt_project
profile: all
config-version: 2
version: "1.0.0"
-- macros/op.sql
{% macro op() %}
    {% do log('Procedure starting: ' ~ modules.datetime.datetime.now(), true) %}
    {% set query %}
        select wait_then_return(10) as c;
        select wait_then_return(10) as c;
        select wait_then_return(10) as c;
        select wait_then_return(10) as c;
        select wait_then_return(10) as c;
        select wait_then_return(10) as c;
    {% endset %}
    {% do run_query(query) %}
    {% do log('Procedure finished: ' ~  modules.datetime.datetime.now(), true) %}
{% endmacro %}
$ dbt run-operation op
01:30:25  Running with dbt=1.6.9
01:30:25  Registered adapter: postgres=1.6.9
01:30:26  Found 0 sources, 0 exposures, 0 metrics, 353 macros, 0 groups, 0 semantic models
01:30:26  Procedure starting: 2023-11-24 14:30:26.350642
01:31:26  Procedure finished: 2023-11-24 14:31:26.425314
We can see it took 1 minute to execute those 6 queries serially. What if we want to run those queries in parallel? dbt has threads - however it is not possible to have multi-threaded run-operations, only the building of models can take advantage of multiple threads.
Since dbt can build multiple models at the same time, lets use that to our advantage.

Create a new materialization:

-- macros/m13n.sql
{% materialization m13n, default %}
    {%- set identifier = model['alias'] -%}
    {%- set target_relation = api.Relation.create(identifier=identifier, schema=schema, database=database, type='view') -%}
    {% call statement('main') -%}
        {{ sql }}
    {%- endcall %}
    {{ return({'relations': [target_relation]}) }}
{% endmaterialization %}
This materialization does not do anything except to run the templated SQL statement {{ sql }} which will come from the SQL text in a model sql file.

Create multiple models that all use our custom materialization:

# dbt_project.yml
name: my_dbt_project
profile: all
config-version: 2
version: "1.0.0"

models:
  my_dbt_project:
    +materialized: m13n
    
on-run-start: "{% do log('Procedure starting: ' ~ modules.datetime.datetime.now(), true) %}"
on-run-end: "{% do log('Procedure finished: ' ~  modules.datetime.datetime.now(), true) %}"

I've added some on-run-x hooks so we can see some timestamps being logged.

-- models/foo_1.sql
select wait_then_return(10) as c
-- models/foo_2.sql
select wait_then_return(10) as c
-- models/foo_3.sql
select wait_then_return(10) as c
-- models/foo_4.sql
select wait_then_return(10) as c
-- models/foo_5.sql
select wait_then_return(10) as c
-- models/foo_6.sql
select wait_then_return(10) as c
And now let's run:
$ dbt run --threads 1
01:43:48  Running with dbt=1.6.9
01:43:48  Registered adapter: postgres=1.6.9
01:43:49  Procedure starting: 2023-11-24 14:43:49.342662
01:43:49  Procedure finished: 2023-11-24 14:43:49.350480
01:43:49  Found 6 models, 2 operations, 0 sources, 0 exposures, 0 metrics, 354 macros, 0 groups, 0 semantic models
01:43:49  
01:43:49  
01:43:49  Running 1 on-run-start hook
01:43:49  Procedure starting: 2023-11-24 14:43:49.601148
01:43:49  1 of 1 START hook: my_dbt_project.on-run-start.0 ............................... [RUN]
01:43:49  1 of 1 OK hook: my_dbt_project.on-run-start.0 .................................. [OK in 0.00s]
01:43:49  
01:43:49  Concurrency: 1 threads (target='pg-local')
01:43:49  
01:43:49  1 of 6 START sql m13n model public.foo_1 ....................................... [RUN]
01:43:59  1 of 6 OK created sql m13n model public.foo_1 .................................. [SELECT 1 in 10.06s]
01:43:59  2 of 6 START sql m13n model public.foo_2 ....................................... [RUN]
01:44:09  2 of 6 OK created sql m13n model public.foo_2 .................................. [SELECT 1 in 10.06s]
01:44:09  3 of 6 START sql m13n model public.foo_3 ....................................... [RUN]
01:44:19  3 of 6 OK created sql m13n model public.foo_3 .................................. [SELECT 1 in 10.05s]
01:44:19  4 of 6 START sql m13n model public.foo_4 ....................................... [RUN]
01:44:29  4 of 6 OK created sql m13n model public.foo_4 .................................. [SELECT 1 in 10.06s]
01:44:29  5 of 6 START sql m13n model public.foo_5 ....................................... [RUN]
01:44:39  5 of 6 OK created sql m13n model public.foo_5 .................................. [SELECT 1 in 10.05s]
01:44:39  6 of 6 START sql m13n model public.foo_6 ....................................... [RUN]
01:44:49  6 of 6 OK created sql m13n model public.foo_6 .................................. [SELECT 1 in 10.05s]
01:44:50  
01:44:50  Running 1 on-run-end hook
01:44:50  Procedure finished: 2023-11-24 14:44:50.018001
01:44:50  1 of 1 START hook: my_dbt_project.on-run-end.0 ................................. [RUN]
01:44:50  1 of 1 OK hook: my_dbt_project.on-run-end.0 .................................... [OK in 0.00s]
01:44:50  
01:44:50  
01:44:50  Finished running 6 m13n models, 2 hooks in 0 hours 1 minutes and 0.60 seconds (60.60s).
01:44:50  
01:44:50  Completed successfully
01:44:50  
01:44:50  Done. PASS=6 WARN=0 ERROR=0 SKIP=0 TOTAL=6

$ dbt run --threads 6
01:45:13  Running with dbt=1.6.9
01:45:13  Registered adapter: postgres=1.6.9
01:45:14  Procedure starting: 2023-11-24 14:45:14.079126
01:45:14  Procedure finished: 2023-11-24 14:45:14.087990
01:45:14  Found 6 models, 2 operations, 0 sources, 0 exposures, 0 metrics, 354 macros, 0 groups, 0 semantic models
01:45:14  
01:45:14  
01:45:14  Running 1 on-run-start hook
01:45:14  Procedure starting: 2023-11-24 14:45:14.325350
01:45:14  1 of 1 START hook: my_dbt_project.on-run-start.0 ............................... [RUN]
01:45:14  1 of 1 OK hook: my_dbt_project.on-run-start.0 .................................. [OK in 0.00s]
01:45:14  
01:45:14  Concurrency: 6 threads (target='pg-local')
01:45:14  
01:45:14  1 of 6 START sql m13n model public.foo_1 ....................................... [RUN]
01:45:14  2 of 6 START sql m13n model public.foo_2 ....................................... [RUN]
01:45:14  3 of 6 START sql m13n model public.foo_3 ....................................... [RUN]
01:45:14  4 of 6 START sql m13n model public.foo_4 ....................................... [RUN]
01:45:14  5 of 6 START sql m13n model public.foo_5 ....................................... [RUN]
01:45:14  6 of 6 START sql m13n model public.foo_6 ....................................... [RUN]
01:45:24  2 of 6 OK created sql m13n model public.foo_2 .................................. [SELECT 1 in 10.33s]
01:45:24  1 of 6 OK created sql m13n model public.foo_1 .................................. [SELECT 1 in 10.34s]
01:45:24  4 of 6 OK created sql m13n model public.foo_4 .................................. [SELECT 1 in 10.33s]
01:45:24  6 of 6 OK created sql m13n model public.foo_6 .................................. [SELECT 1 in 10.32s]
01:45:24  3 of 6 OK created sql m13n model public.foo_3 .................................. [SELECT 1 in 10.34s]
01:45:24  5 of 6 OK created sql m13n model public.foo_5 .................................. [SELECT 1 in 10.33s]
01:45:24  
01:45:24  Running 1 on-run-end hook
01:45:24  Procedure finished: 2023-11-24 14:45:24.739074
01:45:24  1 of 1 START hook: my_dbt_project.on-run-end.0 ................................. [RUN]
01:45:24  1 of 1 OK hook: my_dbt_project.on-run-end.0 .................................... [OK in 0.00s]
01:45:24  
01:45:24  
01:45:24  Finished running 6 m13n models, 2 hooks in 0 hours 0 minutes and 10.58 seconds (10.58s).
01:45:24  
01:45:24  Completed successfully
01:45:24  
01:45:24  Done. PASS=6 WARN=0 ERROR=0 SKIP=0 TOTAL=6

$ dbt run --threads 3
01:45:38  Running with dbt=1.6.9
01:45:38  Registered adapter: postgres=1.6.9
01:45:39  Procedure starting: 2023-11-24 14:45:39.289995
01:45:39  Procedure finished: 2023-11-24 14:45:39.297111
01:45:39  Found 6 models, 2 operations, 0 sources, 0 exposures, 0 metrics, 354 macros, 0 groups, 0 semantic models
01:45:39  
01:45:39  
01:45:39  Running 1 on-run-start hook
01:45:39  Procedure starting: 2023-11-24 14:45:39.492161
01:45:39  1 of 1 START hook: my_dbt_project.on-run-start.0 ............................... [RUN]
01:45:39  1 of 1 OK hook: my_dbt_project.on-run-start.0 .................................. [OK in 0.00s]
01:45:39  
01:45:39  Concurrency: 3 threads (target='pg-local')
01:45:39  
01:45:39  1 of 6 START sql m13n model public.foo_1 ....................................... [RUN]
01:45:39  2 of 6 START sql m13n model public.foo_2 ....................................... [RUN]
01:45:39  3 of 6 START sql m13n model public.foo_3 ....................................... [RUN]
01:45:49  1 of 6 OK created sql m13n model public.foo_1 .................................. [SELECT 1 in 10.09s]
01:45:49  3 of 6 OK created sql m13n model public.foo_3 .................................. [SELECT 1 in 10.09s]
01:45:49  2 of 6 OK created sql m13n model public.foo_2 .................................. [SELECT 1 in 10.09s]
01:45:49  4 of 6 START sql m13n model public.foo_4 ....................................... [RUN]
01:45:49  5 of 6 START sql m13n model public.foo_5 ....................................... [RUN]
01:45:49  6 of 6 START sql m13n model public.foo_6 ....................................... [RUN]
01:45:59  4 of 6 OK created sql m13n model public.foo_4 .................................. [SELECT 1 in 10.10s]
01:45:59  5 of 6 OK created sql m13n model public.foo_5 .................................. [SELECT 1 in 10.10s]
01:45:59  6 of 6 OK created sql m13n model public.foo_6 .................................. [SELECT 1 in 10.10s]
01:45:59  
01:45:59  Running 1 on-run-end hook
01:45:59  Procedure finished: 2023-11-24 14:45:59.751741
01:45:59  1 of 1 START hook: my_dbt_project.on-run-end.0 ................................. [RUN]
01:45:59  1 of 1 OK hook: my_dbt_project.on-run-end.0 .................................... [OK in 0.00s]
01:45:59  
01:45:59  
01:45:59  Finished running 6 m13n models, 2 hooks in 0 hours 0 minutes and 20.39 seconds (20.39s).
01:45:59  
01:45:59  Completed successfully
01:45:59  
01:45:59  Done. PASS=6 WARN=0 ERROR=0 SKIP=0 TOTAL=6
By controlling the number of threads - we control number of "models" dbt executes at any one time and when we set that to be equal to the number of models (6), the whole run took at most the lenght of time a single execution would have taken (~ 10 seconds) vs (~ 1 minute).
01:45:24  Finished running 6 m13n models, 2 hooks in 0 hours 0 minutes and 10.58 seconds (10.58s).


We can of course also achieve the same outcome by using normal table/view materializations and using pre/post-hooks but I didn't want to also have to then drop those models after they have been created since I want to keep my database clear of random tables/views.

One caveat with using this pattern is then you have random nodes showing up in your dbt DAG (even if we choose to hide those nodes via show: false):