leoricklin/res_data_dbt_style_guide.md Secret

## res_data_dbt_style_guide.md

      
    Raw
  

              res_data_dbt_style_guide.md
            
          
    Style Guide

Development Processes

Environment: DEV
Tasks:

feature branch
define sources

develop tests to validate the sources


develop models

define the sources
develop modular models

develop tests (unit tests and generic tests) to validate the modular models


develop tests to validate the final output


Model Naming

20220517 Stakeholder-friendly model names: model naming conventions that give context | dbt Developer Blog

Data transformation 101

dbt Style Guide

├── dbt_project.yml
└── models
    ├── marts
    |   └── {database_name or project_id}
    |       └── {schema_name or dataset_id}
    |           ├── STG_{source_table_name}.yml
    |           ├── STG_{source_table_name}.sql
    |           ├── STG_CUSTOMERS.yml
    |           ├── STG_CUSTOMERS.sql
    |           ├── ...
    |           ├── INT_{object_name}.yml
    |           ├── INT_{object_name}.sql
    |           ├── INT_CUSTOMERS__UNIONED.sql
    |           ├── INT_CUSTOMERS__GROUPED.sql
    |           ├── ...
    |           ├── DIM_{dimension_name}.sql
    |           ├── DIM_CUSTOMERS.sql
    |           ├── ...
    |           ├── DWD_{deailed_fact_name}.sql
    |           ├── DWD_CUSTOMER_ORDERS.sql
    |           ├── ...
    |           ├── DWS_{summarized_fact_name}.sql
    |           └── DWS_CUSTOMER_ORDERS_1D.sql
    |
    └── raw
        └── {database_name or project_id}
            └── {schema_name or dataset_id}
                ├── {table_name}.yml
                ├── {table_name}.docs
                ├── CUSTOMERS.yml
                └── CUSTOMERS.docs


The data in any of our projects has three distinct checkpoints:

Sources: Schemas and tables in a source-conformed structure (i.e. tables and columns in a structure based on what an API returns), loaded by a third party tool.

These configuration files (.yml and .sql) will be located under models/raw directory.
These configuration files are not prefixed, such as CUSTOMERS.sql.


Staging models: The atomic unit of data modeling. Each model bears a one-to-one relationship with the source data table it represents. It has the same granularity, but the columns have been renamed, recast, or usefully reconsidered into a consistent format.

These configuration files (.yml and .sql) will be located under models/marts directory.
These configuration files are prefixed with STG_, such as stg_CUSTOMERS.sql.


Intermediate models: Intermediate transformations required to get to a fact or dimension model.

These configuration files (.yml and .sql) will be located under models/marts directory.
These configuration files are prefixed with INT_ and end with a past tense verb indicating the action performed on the object, such as: INT_CUSTOMERS__UNIONED.


Marts models: Models that represent business processes and entities, abstracted from the data sources that they are based on.

These configuration files (.yml and .sql) will be located under models/marts directory.
These configuration files are prefixed with DIM_ or DWD_ or DWS_:

dim_: dimensions tables
dwd_: detailed fact tables
dws_: summarized fact tables


All objects should be plural, such as: stg_stripe__invoices

Model configuration

dbt Style Guide


Model-specific attributes (like sort/dist keys) should be specified in the model file (.sql).


If a particular configuration applies to all models in a directory, it should be specified in the dbt_project.yml file.
# dbt_project.yml
models:
  <project>:
    # Config indicated by + and applies to all files under models/example/
    +materialized: 'ephemeral'


Marts models should always be materialized as tables.
# dim_CUSTOMERS.sql
{{ config(
    materialized='incremental'

) }}


dbt conventions

dbt Style Guide


Only stg_ models should select from sources.
All other models should only select from other models.

Testing

dbt Style Guide


Every model (sources, staging, intermediate, marts) should have a .yml file in which the model is tested (with the Generic tests ).

At a minimum, unique and not_null tests should be applied to the primary key of each model.


20200630 Unit Testing on dbt Models Using a Static Test Dataset in Snowflake | by Venkat Sekar | HashmapInc | Medium, EN

unit test:


Prepare one CSV file in seeds directory as the input.

The filename should be the source model name prefixed with TEST_INPUT_.

For example, MODEL_B, which will be tested, refers to one source model MODEL_A. A file seeds/TEST_INPUT_MODEL_A.csv will include the some sample data of MODEL_A.


Prepre one CSV file in seeds directory as the expected output.

The filename should be the name of model been tested prefixed with TEST_OUTPUT_.

For example, a file seeds/TEST_OUTPUT_MODEL_B.csv will include the expected output generated by MODEL_B with input from seeds/TEST_INPUT_MODEL_A.csv.


Prepare one yml file in seeds directory to define the test configurations of the expected output.

The filename would be the same as the expected output name with yml suffix.

For example, seeds/TEST_OUTPUT_MODEL_B.yml


seeds:
  - name: TEST_OUTPUT_MODEL_B
    tests:
      - dbt_utils.equal_rowcount:
          compare_model:  ref('MODEL_B')
      - dbt_utils.equality:
          compare_model: ref('MODEL_B')
          compare_columns:
            - L_ORDERKEY
            - NO_OF_LINES


Run the dbt seed command command to upload seeds

two new tables will be created in your warehouse in your target schema named

For example, TEST_INPUT_MODEL_A and TEST_OUTPIT_MODEL_B tables were created.


Develop and Deploy the model

use a custom macro to replace the source of model defined in the ref block

For example, custom_ref{{'MODEL_A', 'TEST_INPUT_MODEL_A'}} intead of ref{{'MODEL_A'}}


Run the test

use dbt test --select

For example, run dbt test --select seeds/TEST_OUTPUT_MODEL_B.yml


Naming and field conventions

dbt Style Guide


Use names based on the business terminology, rather than the source terminology.
Each model should have a primary key and should be tested in the .yml file.

configuration

sources

This python script is used to generate source yml files.
snapshot

Because target.database and target.schema can't be used in the snapshot config block, we could only use them in snapshots section of dbt_project.yml
it works:
# dbt_project.yml
snapshots:
  {your_project_name}:
    +target_database: "{{ target.database }}"
    +target_schema: "{{ target.schema }}"

it doesn't work:
# snapshots/xxx.sql
{% snapshot snapshotname %}

{{ 
    config(
        target_schema="target.schema"
        ,unique_key="id"
        ,strategy="check"
        ,check_cols="all"
        
    ) 
}}

select
    *
    ,{{ dbt_utils.surrogate_key([
            'fab', 
            'piid',
    ]) }} as id 
    
from {{ source("ABC","DEF") }}

{% endsnapshot %}


incremental

The template:
# model.sql

{{ config(
    materialized='incremental'
    ,unique_key='AAA'
    ,incremental_strategy='merge'
) }}

{% set filter_for_incremental_load %}

    and date_day >= (select max(date_day) from {{ this }})

{% endset %}

{% set filter_for_init_load %}

    and date_day 
       between 
          timestamp_sub( current_timestamp(), interval 2 DAY ) 
       and 
          timestamp_sub( current_timestamp(), interval 1 DAY ) 

{% endset %}


select 
   XXX
from
   {{ ref("YYY") }}
where true

{% if is_incremental() %}

    {{ filter_for_incremental_load }}

{% else %}

    {{ filter_for_init_load }}
    
{% endif %}

The {{ filter_for_full_load }} will be applied when you use dbt run --full-refresh to rebuild the incremental model.