Skip to content

Instantly share code, notes, and snippets.

@leoricklin
Last active September 21, 2022 00:15
Show Gist options
  • Save leoricklin/540519bf8e46b844c4fcf8500320dbbd to your computer and use it in GitHub Desktop.
Save leoricklin/540519bf8e46b844c4fcf8500320dbbd to your computer and use it in GitHub Desktop.

Style Guide

Development Processes

Environment: DEV

Tasks:

  • feature branch
  • define sources
    • develop tests to validate the sources
  • develop models
    • define the sources
    • develop modular models
      • develop tests (unit tests and generic tests) to validate the modular models
    • develop tests to validate the final output

Model Naming

├── dbt_project.yml
└── models
    ├── marts
    |   └── {database_name or project_id}
    |       └── {schema_name or dataset_id}
    |           ├── STG_{source_table_name}.yml
    |           ├── STG_{source_table_name}.sql
    |           ├── STG_CUSTOMERS.yml
    |           ├── STG_CUSTOMERS.sql
    |           ├── ...
    |           ├── INT_{object_name}.yml
    |           ├── INT_{object_name}.sql
    |           ├── INT_CUSTOMERS__UNIONED.sql
    |           ├── INT_CUSTOMERS__GROUPED.sql
    |           ├── ...
    |           ├── DIM_{dimension_name}.sql
    |           ├── DIM_CUSTOMERS.sql
    |           ├── ...
    |           ├── DWD_{deailed_fact_name}.sql
    |           ├── DWD_CUSTOMER_ORDERS.sql
    |           ├── ...
    |           ├── DWS_{summarized_fact_name}.sql
    |           └── DWS_CUSTOMER_ORDERS_1D.sql
    |
    └── raw
        └── {database_name or project_id}
            └── {schema_name or dataset_id}
                ├── {table_name}.yml
                ├── {table_name}.docs
                ├── CUSTOMERS.yml
                └── CUSTOMERS.docs

The data in any of our projects has three distinct checkpoints:

  • Sources: Schemas and tables in a source-conformed structure (i.e. tables and columns in a structure based on what an API returns), loaded by a third party tool.
    • These configuration files (.yml and .sql) will be located under models/raw directory.
    • These configuration files are not prefixed, such as CUSTOMERS.sql.
  • Staging models: The atomic unit of data modeling. Each model bears a one-to-one relationship with the source data table it represents. It has the same granularity, but the columns have been renamed, recast, or usefully reconsidered into a consistent format.
    • These configuration files (.yml and .sql) will be located under models/marts directory.
    • These configuration files are prefixed with STG_, such as stg_CUSTOMERS.sql.
  • Intermediate models: Intermediate transformations required to get to a fact or dimension model.
    • These configuration files (.yml and .sql) will be located under models/marts directory.
    • These configuration files are prefixed with INT_ and end with a past tense verb indicating the action performed on the object, such as: INT_CUSTOMERS__UNIONED.
  • Marts models: Models that represent business processes and entities, abstracted from the data sources that they are based on.
    • These configuration files (.yml and .sql) will be located under models/marts directory.
    • These configuration files are prefixed with DIM_ or DWD_ or DWS_:
      • dim_: dimensions tables
      • dwd_: detailed fact tables
      • dws_: summarized fact tables
  • All objects should be plural, such as: stg_stripe__invoices

Model configuration

  • Model-specific attributes (like sort/dist keys) should be specified in the model file (.sql).

  • If a particular configuration applies to all models in a directory, it should be specified in the dbt_project.yml file.

    # dbt_project.yml
    models:
      <project>:
        # Config indicated by + and applies to all files under models/example/
        +materialized: 'ephemeral'
    
  • Marts models should always be materialized as tables.

    # dim_CUSTOMERS.sql
    {{ config(
        materialized='incremental'
    
    ) }}
    

dbt conventions

  • Only stg_ models should select from sources.
  • All other models should only select from other models.

Testing

  • Every model (sources, staging, intermediate, marts) should have a .yml file in which the model is tested (with the Generic tests ).
    • At a minimum, unique and not_null tests should be applied to the primary key of each model.

unit test:

  • Prepare one CSV file in seeds directory as the input.

    • The filename should be the source model name prefixed with TEST_INPUT_.
      • For example, MODEL_B, which will be tested, refers to one source model MODEL_A. A file seeds/TEST_INPUT_MODEL_A.csv will include the some sample data of MODEL_A.
  • Prepre one CSV file in seeds directory as the expected output.

    • The filename should be the name of model been tested prefixed with TEST_OUTPUT_.
      • For example, a file seeds/TEST_OUTPUT_MODEL_B.csv will include the expected output generated by MODEL_B with input from seeds/TEST_INPUT_MODEL_A.csv.
  • Prepare one yml file in seeds directory to define the test configurations of the expected output.

    • The filename would be the same as the expected output name with yml suffix.
      • For example, seeds/TEST_OUTPUT_MODEL_B.yml
seeds:
  - name: TEST_OUTPUT_MODEL_B
    tests:
      - dbt_utils.equal_rowcount:
          compare_model:  ref('MODEL_B')
      - dbt_utils.equality:
          compare_model: ref('MODEL_B')
          compare_columns:
            - L_ORDERKEY
            - NO_OF_LINES
  • Run the dbt seed command command to upload seeds

    • two new tables will be created in your warehouse in your target schema named
      • For example, TEST_INPUT_MODEL_A and TEST_OUTPIT_MODEL_B tables were created.
  • Develop and Deploy the model

    • use a custom macro to replace the source of model defined in the ref block
      • For example, custom_ref{{'MODEL_A', 'TEST_INPUT_MODEL_A'}} intead of ref{{'MODEL_A'}}
  • Run the test

    • use dbt test --select
      • For example, run dbt test --select seeds/TEST_OUTPUT_MODEL_B.yml

Naming and field conventions

  • Use names based on the business terminology, rather than the source terminology.
  • Each model should have a primary key and should be tested in the .yml file.

configuration

sources

This python script is used to generate source yml files.

snapshot

Because target.database and target.schema can't be used in the snapshot config block, we could only use them in snapshots section of dbt_project.yml

it works:

# dbt_project.yml
snapshots:
  {your_project_name}:
    +target_database: "{{ target.database }}"
    +target_schema: "{{ target.schema }}"

it doesn't work:

# snapshots/xxx.sql
{% snapshot snapshotname %}

{{ 
    config(
        target_schema="target.schema"
        ,unique_key="id"
        ,strategy="check"
        ,check_cols="all"
        
    ) 
}}

select
    *
    ,{{ dbt_utils.surrogate_key([
            'fab', 
            'piid',
    ]) }} as id 
    
from {{ source("ABC","DEF") }}

{% endsnapshot %}


incremental

The template:

# model.sql

{{ config(
    materialized='incremental'
    ,unique_key='AAA'
    ,incremental_strategy='merge'
) }}

{% set filter_for_incremental_load %}

    and date_day >= (select max(date_day) from {{ this }})

{% endset %}

{% set filter_for_init_load %}

    and date_day 
       between 
          timestamp_sub( current_timestamp(), interval 2 DAY ) 
       and 
          timestamp_sub( current_timestamp(), interval 1 DAY ) 

{% endset %}


select 
   XXX
from
   {{ ref("YYY") }}
where true

{% if is_incremental() %}

    {{ filter_for_incremental_load }}

{% else %}

    {{ filter_for_init_load }}
    
{% endif %}

The {{ filter_for_full_load }} will be applied when you use dbt run --full-refresh to rebuild the incremental model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment