Environment: DEV
Tasks:
- feature branch
- define sources
- develop tests to validate the sources
- develop models
- define the sources
- develop modular models
- develop tests (unit tests and generic tests) to validate the modular models
- develop tests to validate the final output
20220517 Stakeholder-friendly model names: model naming conventions that give context | dbt Developer Blog
├── dbt_project.yml
└── models
├── marts
| └── {database_name or project_id}
| └── {schema_name or dataset_id}
| ├── STG_{source_table_name}.yml
| ├── STG_{source_table_name}.sql
| ├── STG_CUSTOMERS.yml
| ├── STG_CUSTOMERS.sql
| ├── ...
| ├── INT_{object_name}.yml
| ├── INT_{object_name}.sql
| ├── INT_CUSTOMERS__UNIONED.sql
| ├── INT_CUSTOMERS__GROUPED.sql
| ├── ...
| ├── DIM_{dimension_name}.sql
| ├── DIM_CUSTOMERS.sql
| ├── ...
| ├── DWD_{deailed_fact_name}.sql
| ├── DWD_CUSTOMER_ORDERS.sql
| ├── ...
| ├── DWS_{summarized_fact_name}.sql
| └── DWS_CUSTOMER_ORDERS_1D.sql
|
└── raw
└── {database_name or project_id}
└── {schema_name or dataset_id}
├── {table_name}.yml
├── {table_name}.docs
├── CUSTOMERS.yml
└── CUSTOMERS.docs
The data in any of our projects has three distinct checkpoints:
- Sources: Schemas and tables in a source-conformed structure (i.e. tables and columns in a structure based on what an API returns), loaded by a third party tool.
- These configuration files (.yml and .sql) will be located under
models/raw
directory. - These configuration files are not prefixed, such as
CUSTOMERS.sql
.
- These configuration files (.yml and .sql) will be located under
- Staging models: The atomic unit of data modeling. Each model bears a one-to-one relationship with the source data table it represents. It has the same granularity, but the columns have been renamed, recast, or usefully reconsidered into a consistent format.
- These configuration files (.yml and .sql) will be located under
models/marts
directory. - These configuration files are prefixed with
STG_
, such asstg_CUSTOMERS.sql
.
- These configuration files (.yml and .sql) will be located under
- Intermediate models: Intermediate transformations required to get to a fact or dimension model.
- These configuration files (.yml and .sql) will be located under
models/marts
directory. - These configuration files are prefixed with
INT_
and end with a past tense verb indicating the action performed on the object, such as:INT_CUSTOMERS__UNIONED
.
- These configuration files (.yml and .sql) will be located under
- Marts models: Models that represent business processes and entities, abstracted from the data sources that they are based on.
- These configuration files (.yml and .sql) will be located under
models/marts
directory. - These configuration files are prefixed with
DIM_
orDWD_
orDWS_
:dim_
: dimensions tablesdwd_
: detailed fact tablesdws_
: summarized fact tables
- These configuration files (.yml and .sql) will be located under
- All objects should be plural, such as:
stg_stripe__invoices
-
Model-specific attributes (like sort/dist keys) should be specified in the model file (
.sql
). -
If a particular configuration applies to all models in a directory, it should be specified in the
dbt_project.yml
file.# dbt_project.yml models: <project>: # Config indicated by + and applies to all files under models/example/ +materialized: 'ephemeral'
-
Marts models should always be materialized as tables.
# dim_CUSTOMERS.sql {{ config( materialized='incremental' ) }}
- Only
stg_
models should select fromsources
. - All other models should only select from other models.
- Every model (sources, staging, intermediate, marts) should have a
.yml
file in which the model is tested (with the Generic tests ).- At a minimum,
unique
andnot_null
tests should be applied to the primary key of each model.
- At a minimum,
20200630 Unit Testing on dbt Models Using a Static Test Dataset in Snowflake | by Venkat Sekar | HashmapInc | Medium, EN
unit test:
-
Prepare one CSV file in
seeds
directory as the input.- The filename should be the source model name prefixed with
TEST_INPUT_
.- For example, MODEL_B, which will be tested, refers to one source model MODEL_A. A file
seeds/TEST_INPUT_MODEL_A.csv
will include the some sample data of MODEL_A.
- For example, MODEL_B, which will be tested, refers to one source model MODEL_A. A file
- The filename should be the source model name prefixed with
-
Prepre one CSV file in
seeds
directory as the expected output.- The filename should be the name of model been tested prefixed with
TEST_OUTPUT_
.- For example, a file
seeds/TEST_OUTPUT_MODEL_B.csv
will include the expected output generated by MODEL_B with input fromseeds/TEST_INPUT_MODEL_A.csv
.
- For example, a file
- The filename should be the name of model been tested prefixed with
-
Prepare one yml file in
seeds
directory to define the test configurations of the expected output.- The filename would be the same as the expected output name with
yml
suffix.- For example,
seeds/TEST_OUTPUT_MODEL_B.yml
- For example,
- The filename would be the same as the expected output name with
seeds:
- name: TEST_OUTPUT_MODEL_B
tests:
- dbt_utils.equal_rowcount:
compare_model: ref('MODEL_B')
- dbt_utils.equality:
compare_model: ref('MODEL_B')
compare_columns:
- L_ORDERKEY
- NO_OF_LINES
-
Run the
dbt seed command
command to upload seeds- two new tables will be created in your warehouse in your target schema named
- For example,
TEST_INPUT_MODEL_A
andTEST_OUTPIT_MODEL_B
tables were created.
- For example,
- two new tables will be created in your warehouse in your target schema named
-
Develop and Deploy the model
- use a custom macro to replace the source of model defined in the
ref
block- For example,
custom_ref{{'MODEL_A', 'TEST_INPUT_MODEL_A'}}
intead ofref{{'MODEL_A'}}
- For example,
- use a custom macro to replace the source of model defined in the
-
Run the test
- use
dbt test --select
- For example, run
dbt test --select seeds/TEST_OUTPUT_MODEL_B.yml
- For example, run
- use
- Use names based on the business terminology, rather than the source terminology.
- Each model should have a primary key and should be tested in the
.yml
file.
This python script is used to generate source yml files.
Because target.database
and target.schema
can't be used in the snapshot config block, we could only use them in snapshots
section of dbt_project.yml
it works:
# dbt_project.yml
snapshots:
{your_project_name}:
+target_database: "{{ target.database }}"
+target_schema: "{{ target.schema }}"
it doesn't work:
# snapshots/xxx.sql
{% snapshot snapshotname %}
{{
config(
target_schema="target.schema"
,unique_key="id"
,strategy="check"
,check_cols="all"
)
}}
select
*
,{{ dbt_utils.surrogate_key([
'fab',
'piid',
]) }} as id
from {{ source("ABC","DEF") }}
{% endsnapshot %}
The template:
# model.sql
{{ config(
materialized='incremental'
,unique_key='AAA'
,incremental_strategy='merge'
) }}
{% set filter_for_incremental_load %}
and date_day >= (select max(date_day) from {{ this }})
{% endset %}
{% set filter_for_init_load %}
and date_day
between
timestamp_sub( current_timestamp(), interval 2 DAY )
and
timestamp_sub( current_timestamp(), interval 1 DAY )
{% endset %}
select
XXX
from
{{ ref("YYY") }}
where true
{% if is_incremental() %}
{{ filter_for_incremental_load }}
{% else %}
{{ filter_for_init_load }}
{% endif %}
The {{ filter_for_full_load }}
will be applied when you use dbt run --full-refresh
to rebuild the incremental model.