Skip to content

Instantly share code, notes, and snippets.

@jeremyyeo
Last active February 26, 2022 08:06
Show Gist options
  • Save jeremyyeo/2e65478b5ec9d6593d7f36efbf412f17 to your computer and use it in GitHub Desktop.
Save jeremyyeo/2e65478b5ec9d6593d7f36efbf412f17 to your computer and use it in GitHub Desktop.
Adding custom generic dbt tests #dbt

Adding custom generic dbt tests

Table of contents


dbt comes built with a handful of built-in generic tests and even more test are availble from community dbt packages:

Sometimes though, you might have bespoke business logic that you might want to tests against - follow this guide to see how you can start with a very simple example.

Note: All of this information is available in the official documentation here.


A basic generic test

In our example below, we're going to add a test that essential checks that 2 tables are equivalent via the SQL minus operator.

  1. Add 2 simple models that we'll use to simulate an error:
-- my_model.sql
select 1 as user_id, 'alice' as user_name
 union
select 2 as user_id, 'bob' as user_name


-- my_model_1.sql
select 2 as user_id, 'bob' as user_name
 union
select 3 as user_id, 'eve' as user_name
  1. Add a generic test macro to your tests/generic folder:
{% test is_the_same(model, to) %}
-- tests/generic/test_is_the_same.sql

with validation as (
    select * from {{ model }}
),

reference as (
    select * from {{ to }}
)

select * from validation
 minus
select * from reference

{% endtest %}

The test above is essentially configured to fail IF there are differences between the model that the test is applied to and a reference model. Basically if the query results return any rows at all, then the test has failed.

  1. Add the above test to your model's schema:
# models/schema.yml

version: 2

models:
  - name: my_model
    tests:
      - is_the_same:
          to: ref('my_model_1')
  1. Run and test your models:
Expand output
$ dbt run && dbt test

00:30:14  Running with dbt=1.0.1
00:30:14  Found 2 models, 1 test, 0 snapshots, 0 analyses, 180 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
00:30:14  
00:30:20  Concurrency: 4 threads (target='dev')
00:30:20  
00:30:20  2 of 2 START table model dbt_jyeo.my_model_1.................................... [RUN]
00:30:20  1 of 2 START table model dbt_jyeo.my_model...................................... [RUN]
00:30:24  2 of 2 OK created table model dbt_jyeo.my_model_1............................... [SUCCESS 1 in 4.12s]
00:30:24  1 of 2 OK created table model dbt_jyeo.my_model................................. [SUCCESS 1 in 4.12s]
00:30:24  
00:30:24  Finished running 2 table models in 9.54s.
00:30:24  
00:30:24  Completed successfully
00:30:24  
00:30:24  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
00:30:29  Running with dbt=1.0.1
00:30:29  Found 2 models, 1 test, 0 snapshots, 0 analyses, 180 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
00:30:29  
00:30:31  Concurrency: 4 threads (target='dev')
00:30:31  
00:30:31  1 of 1 START test is_the_same_my_model_ref_my_model_1_.......................... [RUN]
00:30:34  1 of 1 FAIL 1 is_the_same_my_model_ref_my_model_1_.............................. [FAIL 1 in 3.03s]
00:30:34  
00:30:34  Finished running 1 test in 5.75s.
00:30:34  
00:30:34  Completed with 1 error and 0 warnings:
00:30:34  
00:30:34  Failure in test is_the_same_my_model_ref_my_model_1_ (models/schema.yml)
00:30:34    Got 1 result, configured to fail if != 0
00:30:34  
00:30:34    compiled SQL at target/compiled/snowflake/models/schema.yml/is_the_same_my_model_ref_my_model_1_.sql
00:30:34  
00:30:34  Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1

Notice that we had a failed test (as expected) since the two tables ARE NOT the same.

  1. Lets try modifying our models so that they are both equivalent:
-- my_model.sql
select 1 as user_id, 'alice' as user_name
 union
select 2 as user_id, 'bob' as user_name


-- my_model_1.sql
select 1 as user_id, 'alice' as user_name
 union
select 2 as user_id, 'bob' as user_name
  1. Rerun and retest your models:
Expand output
$ dbt run && dbt test

00:33:49  Running with dbt=1.0.1
00:33:49  Found 2 models, 1 test, 0 snapshots, 0 analyses, 180 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
00:33:49  
00:33:54  Concurrency: 4 threads (target='dev')
00:33:54  
00:33:54  2 of 2 START table model dbt_jyeo.my_model_1.................................... [RUN]
00:33:54  1 of 2 START table model dbt_jyeo.my_model...................................... [RUN]
00:33:58  2 of 2 OK created table model dbt_jyeo.my_model_1............................... [SUCCESS 1 in 3.62s]
00:33:58  1 of 2 OK created table model dbt_jyeo.my_model................................. [SUCCESS 1 in 3.63s]
00:33:58  
00:33:58  Finished running 2 table models in 9.33s.
00:33:58  
00:33:58  Completed successfully
00:33:58  
00:33:58  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
00:34:03  Running with dbt=1.0.1
00:34:03  Found 2 models, 1 test, 0 snapshots, 0 analyses, 180 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
00:34:03  
00:34:06  Concurrency: 4 threads (target='dev')
00:34:06  
00:34:06  1 of 1 START test is_the_same_my_model_ref_my_model_1_.......................... [RUN]
00:34:09  1 of 1 PASS is_the_same_my_model_ref_my_model_1_................................ [PASS in 3.21s]
00:34:09  
00:34:09  Finished running 1 test in 5.89s.
00:34:09  
00:34:09  Completed successfully
00:34:09  
00:34:09  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

Notice now that your test has passed because the two tables ARE the same.


A more complex example

Let's now work on a more complex example where we will have to override some of the built in dbt testing macros. One use case for this is let's say we want to use Snowflake's policy simulation to test that correct column masking policies are applied to the different roles (e.g. only certain Snowflake roles should be able to see values in the user_name column of a table).

For the purposes of this exercise, we're going to test that the role jeremy_test_role does not have access to the user_name column on the my_source table (by checking to see if it returns ***** instead of the actual value). It is also expected that you have already created the masking policy and have applied it in Snowflake.

  1. First we need to add our custom generic test code as we have above.
-- tests/generic/test_is_masked.sql

{% test is_masked(model, column_name, masked_value, header) %}

{{ config(sql_header = header) }}

select * from {{ model }}
  where {{ column_name }} != '{{ masked_value }}'

{% endtest %}

Notice that we're creating a column level test is_masked that can take in additional arguments (masked_value, header) and assigning the header argument to a sql_header config (we will use this config in our override next).

  1. Override the built in get_test_sql() macro (see source).
-- macros/custom_get_test_sql.sql

{% macro get_test_sql(main_sql, fail_calc, warn_if, error_if, limit) -%}
    {% set header = config.get('sql_header') or '' %}
    {{ header }}
    select
      {{ fail_calc }} as failures,
      {{ fail_calc }} {{ warn_if }} as should_warn,
      {{ fail_calc }} {{ error_if }} as should_error
    from (
      {{ main_sql }}
      {{ "limit " ~ limit if limit != none }}
    ) dbt_internal_test
{%- endmacro %}

If you compare the above to the built-in macro, you will see that the only new lines we added were the second and third line - basically we're trying to inject a header to the sql text that is executed when the test runs.

Note that the {{ main_sql }} jinja block is where the entire sql text in your custom generic test is placed - this means that if you had your generic test be written as something like the following (instead of what we had in step 1):

{% test is_masked(model, column_name, masked_value) %}

execute using policy_context(current_role => 'JEREMY_TEST_ROLE') as
select * from {{ model }}
  where {{ column_name }} != '{{ masked_value }}'

{% endtest %}

Our test when executed will compile to:

select
  count(*) as failures,
  count(*) != 0 as should_warn,
  count(*) != 0 as should_error
from (
  execute using policy_context(current_role => 'JEREMY_TEST_ROLE') as
  select * from my_source
  where my_column != '*****'
) dbt_internal_test

Which is invalid if executed on Snowflake - executing using policy_context(... has to be at the start of the query, like so:

execute using policy_context(current_role => 'JEREMY_TEST_ROLE') as
select
  count(*) as failures,
  count(*) != 0 as should_warn,
  count(*) != 0 as should_error
from (
  select * from my_source
  where my_column != '*****'
) dbt_internal_test
  1. Add our new test to a model / source and try runing a test.
version: 2

sources:
  - name: dbt_jyeo
    tables:
      - name: my_source
        columns:
          - name: user_name
            tests:
              - not_null
              - is_masked:
                  masked_value: "*****"
                  header: "execute using policy_context(current_role => 'JEREMY_TEST_ROLE') as"

Note that we're adding a built in not_null test just to check that our macro override has not broken any default testing behaviour.

Expand output
$ dbt test

23:52:07  Running with dbt=1.0.2
23:52:08  Found 1 model, 2 tests, 0 snapshots, 0 analyses, 181 macros, 0 operations, 0 seed files, 1 source, 0 exposures, 0 metrics
23:52:08  
23:52:10  Concurrency: 4 threads (target='dev')
23:52:10  
23:52:10  1 of 2 START test source_is_masked_dbt_jyeo_my_source_user_name__execute_using_policy_context_current_role_JEREMY_TEST_ROLE_as___ [RUN]
23:52:10  2 of 2 START test source_not_null_dbt_jyeo_my_source_user_name.................. [RUN]
23:52:12  2 of 2 PASS source_not_null_dbt_jyeo_my_source_user_name........................ [PASS in 2.30s]
23:52:13  1 of 2 PASS source_is_masked_dbt_jyeo_my_source_user_name__execute_using_policy_context_current_role_JEREMY_TEST_ROLE_as___ [PASS in 2.46s]
23:52:13  
23:52:13  Finished running 2 tests in 4.64s.
23:52:13  
23:52:13  Completed successfully
23:52:13  
23:52:13  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

All test are passing (there are no nulls' and the column is masked to the role). Let's try with a role that does have access to the column (so we're expecting an error instead):

version: 2

sources:
  - name: dbt_jyeo
    tables:
      - name: my_source
        columns:
          - name: user_name
            tests:
              - not_null
              - is_masked:
                  masked_value: "*****"
                  header: "execute using policy_context(current_role => 'TRANSFORMER') as"
Expand output
$ dbt test

23:54:38  Running with dbt=1.0.2
23:54:39  Found 1 model, 2 tests, 0 snapshots, 0 analyses, 181 macros, 0 operations, 0 seed files, 1 source, 0 exposures, 0 metrics
23:54:39  
23:54:41  Concurrency: 4 threads (target='dev')
23:54:41  
23:54:41  1 of 2 START test source_is_masked_dbt_jyeo_my_source_user_name__execute_using_policy_context_current_role_TRANSFORMER_as___ [RUN]
23:54:41  2 of 2 START test source_not_null_dbt_jyeo_my_source_user_name.................. [RUN]
23:54:43  2 of 2 PASS source_not_null_dbt_jyeo_my_source_user_name........................ [PASS in 2.21s]
23:54:43  1 of 2 FAIL 5 source_is_masked_dbt_jyeo_my_source_user_name__execute_using_policy_context_current_role_TRANSFORMER_as___ [FAIL 5 in 2.30s]
23:54:43  
23:54:43  Finished running 2 tests in 4.56s.
23:54:43  
23:54:43  Completed with 1 error and 0 warnings:
23:54:43  
23:54:43  Failure in test source_is_masked_dbt_jyeo_my_source_user_name__execute_using_policy_context_current_role_TRANSFORMER_as___ (models/sources.yml)
23:54:43    Got 5 results, configured to fail if != 0
23:54:43  
23:54:43    compiled SQL at target/compiled/snowflake/models/sources.yml/source_is_masked_dbt_jyeo_my_s_b85ec2bbb5a5ac1d694c02574117265d.sql
23:54:43  
23:54:43  Done. PASS=1 WARN=0 ERROR=1 SKIP=0 TOTAL=2

The not null test passed but there were 5 rows that were not masked (as expected).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment