Skip to content

Instantly share code, notes, and snippets.

View jeremyyeo's full-sized avatar

Jeremy Yeo jeremyyeo

View GitHub Profile
@jeremyyeo
jeremyyeo / README.md
Last active April 3, 2024 01:21
Converting a dbt jinja macro to a python function for use in a python model #dbt

Converting a dbt jinja macro to a python function for use in a python model

Here's a quick example of converting a dbt jinja macro used in a sql model into a python function that is used in a python model instead. It is currently not possible to use a jinja macro as is in a python model.

# dbt_project.yml
name: my_dbt_project
profile: all
config-version: 2
version: "1.0.0"
@jeremyyeo
jeremyyeo / README.md
Last active March 19, 2024 04:25
Debugging dbt snapshots #dbt

Debugging dbt snapshots

Loom: https://www.loom.com/share/84f2ae5463fa48048b9c578244ceb440

At a high level, what we're trying to do here is to:

  1. At the start of the run, make backups of the snapshot snappy and the data it is snapshotting - the model raw.
  2. Test that after the snapshot has completed - if the dbt_scd_id is still unique.
  3. If it passes, everything is good - so we can drop the backups we created in (1).
  4. If it fails, it means some duplicates has gotten into the snapshot somehow - here we leave the backups we created in (1) so that we can come back the next day and replay all the steps so as to figure out what rows caused the duplicate. Here we also create a backup copy of raw at the end of the run so that we have the exact copy of rawas of the time we used it to insert into snappy. (This is because we may have other processes or runs that try to modify raw before we had the time to do our debugging).
import csv
import datetime
import string
from faker import Faker # make sure to `pip install Faker`
NUMS = 10000
fake = Faker()
current_time = datetime.datetime.utcnow()
# header = [
@jeremyyeo
jeremyyeo / README.md
Last active February 2, 2024 06:23
Which dbt nodes respect the generate_schema_name macro? #dbt

Which dbt nodes respect the generate_schema_name macro?

dbt has many types of "nodes"/"resources" - e.g. models, sources, seeds - so which of them actually respect the generate_schema_name() macro? Let's have a look.

The following is tested using:

Core:
  - installed: 1.7.7
  - latest:    1.7.7 - Up to date!
@jeremyyeo
jeremyyeo / README.md
Last active January 22, 2024 21:47
Stopping dbt from auto-expanding column types #dbt

Stopping dbt from auto-expanding column types

By default dbt has functionality that auto-expands similar column types (i.e. varchar(3) to varchar(4)) if the incoming data is too large (https://docs.getdbt.com/faqs/Snapshots/snapshot-schema-changes). We can see this happening like so:

-- models/foo.sql
{{ config(materialized='incremental') }}
select 'foo' as c
@jeremyyeo
jeremyyeo / README.md
Last active December 12, 2023 01:09
A quick explainer of the "dbt was unable to infer all dependencies for the model" error #dbt
@jeremyyeo
jeremyyeo / README.md
Last active December 9, 2023 22:44
How to get models to run only on local timezones if your scheduler only supports UTC #dbt

How to get models to run only on local timezones if your orchestration platform only supports UTC

If you're using an orchestration platform that only supports UTC timezones - you may find it difficult cumbersome to schedule jobs to run on local timezones, accounting for daylight savings time switchovers, etc. Let's see how we can write a dbt macro using some builtin python modules to help us out.

-- macros/is_nz_business_hours.sql
{% macro is_nz_business_hours() %}
    {% set flag = 0 %}
    {% set dt = modules.datetime %}
    {% set pz = modules.pytz %}
@jeremyyeo
jeremyyeo / README.md
Last active March 26, 2024 15:34
How to do multi-threaded / parallel operations with dbt #dbt

How to do multi-threaded / parallel operations with dbt

-!  🚨                                          WARNING                                          🚨  !-
Just because you could does not mean you should!

Examples below are using PostgreSQL however the same idea apply across adapters (Snowflake,etc).

In order to run arbitrary SQL commands with dbt - we would typically use hooks or operations. Imagine for a second, we have to run an operation that performs some action multiple times:

@jeremyyeo
jeremyyeo / README.md
Last active November 10, 2023 21:06
How to correctly use a macro that returns a value to a hook #dbt
@jeremyyeo
jeremyyeo / README.md
Last active November 8, 2023 04:08
Using alternative hosts for dbt hub packages #dbt

Using alternative hosts for dbt hub packages

Full documentation for dbt packages are available here: https://docs.getdbt.com/docs/build/packages and this writeup just reuses what's already there.

The most common pattern of using dbt packages is to use one from the dbt Package hub (https://hub.getdbt.com/). For example:

# packages.yml
packages:
  - package: dbt-labs/dbt_utils