Skip to content

Instantly share code, notes, and snippets.

@jeremyyeo
Last active November 8, 2023 02:33
Show Gist options
  • Save jeremyyeo/88dc852c142fc78af193766d13366234 to your computer and use it in GitHub Desktop.
Save jeremyyeo/88dc852c142fc78af193766d13366234 to your computer and use it in GitHub Desktop.
Using query comments to identify how things came to be #dbt

Using query comments to identify how things came to be #dbt

This is written for Databricks but the same information apply to any other dbt supported datawarehouse. Note that this also uses dbt Cloud specific env vars but is not necessarily required.

Have you ever asked yourself how things came to be? How did a table get created? How did a schema get created?

Well fret no more, using both:

We can figure out which exact dbt Cloud job run (or not) was responsible for a query - and hence attribute blame credit 😉

By default, when dbt sends a SQL query to the datawarehouse, there will be a query comment appened to the top of the query - for example:

image

/* {"app": "dbt", "dbt_version": "1.6.7", "dbt_databricks_version": "1.6.6", "databricks_sql_connector_version": "2.9.3", "profile_name": "all", "target_name": "db", "connection_name": "create_main_dbt_jyeo_dbt_jyeo"} */
create schema if not exists `main`.`dbt_jyeo_dbt_jyeo`

However, a query comment like that does not tell us too much about which dbt Cloud job or run was responsible for that DDL. Let's try and override the default query comment by adding the following macro:

-- macros/query_comment.sql

{% macro query_comment(node) %}
    {%- set comment_dict = {} -%}
    {%- do comment_dict.update(
        app='dbt',
        dbt_version=dbt_version,
        profile_name=target.get('profile_name'),
        target_name=target.get('target_name'),
        dbt_cloud_job_id=env_var('DBT_CLOUD_JOB_ID', 'not-a-dbt-cloud-job'),
        dbt_cloud_run_id=env_var('DBT_CLOUD_RUN_ID', 'not-a-dbt-cloud-run')
    ) -%}
    {%- if node is not none -%}
      {%- do comment_dict.update(
        file=node.original_file_path,
        node_id=node.unique_id,
        node_name=node.name,
        resource_type=node.resource_type,
        package_name=node.package_name,
        relation={
            "database": node.database,
            "schema": node.schema,
            "identifier": node.identifier
        }
      ) -%}
    {% else %}
      {%- do comment_dict.update(node_id='internal') -%}
    {%- endif -%}
    {% do return(tojson(comment_dict)) %}
{% endmacro %}

And in the dbt_project.yml file we also want to do call query_comment() like so:

# dbt_project.yml
...
query-comment: "{{ query_comment(node) }}"

With that in place, the query comment appended now have additional information - for example:

/* {"app": "dbt", "dbt_version": "1.6.7", "profile_name": "all", "target_name": "db", "dbt_cloud_job_id": "not-a-dbt-cloud-job", "dbt_cloud_run_id": "not-a-dbt-cloud-run", "node_id": "internal"} */
create schema if not exists `main`.`dbt_jyeo`

^ We can see that the above keys dbt_cloud_job_id and dbt_cloud_run_id where not attributable to a job or run - i.e. most likely it was invoked outside of dbt Cloud or in an IDE session.

/* {"app": "dbt", "dbt_version": "1.6.7", "profile_name": "all", "target_name": "db", "dbt_cloud_job_id": "123", "dbt_cloud_run_id": "456", "node_id": "internal"} */
create schema if not exists `main`.`dbt_jyeo`
/* {"app": "dbt", "dbt_version": "1.6.7", "profile_name": "all", "target_name": "db", "dbt_cloud_job_id": "123", "dbt_cloud_run_id": "456", "file": "models/foo.sql", "node_id": "model.my_dbt_project.foo", "node_name": "foo", "resource_type": "model", "package_name": "my_dbt_project", "relation": {"database": "main", "schema": "dbt_not_jyeo", "identifier": "foo"}} */
create
or replace table `main`.`dbt_jyeo`.`foo` using delta as
select
  1 id

^ We can see that these queries on the other hand, have keys dbt_cloud_job_id and dbt_cloud_run_id keys that are directly attributable to the job (123) and run (456).

Note that not all keys from the default query comment can be reused in this manner - for example dbt_databricks_version and databricks_sql_connector_version is not something that can be reused in our custom override. This is because those are baked directly into the adapter and not retrivable in the query comment compilation context. Even then though, I find it more useful to identify which exact run was responsible for doing a thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment