Skip to content

Instantly share code, notes, and snippets.

@jeremyyeo
Last active July 21, 2022 21:21
Show Gist options
  • Save jeremyyeo/4b83c2490e1290a9bd7a5e33c8afaaaa to your computer and use it in GitHub Desktop.
Save jeremyyeo/4b83c2490e1290a9bd7a5e33c8afaaaa to your computer and use it in GitHub Desktop.
Why can't I upload the run_results.json file (using dbt_artifacts) in an on-run-end hook? #dbt

Why can't I upload the run_results.json file (using dbt_artifacts) in an on-run-end hook?

TL;DR - the run_results.json file is only written to disk AFTER the on-run-end hooks are executed without error and so isn't available DURING the on-run-end hook.

You have got the following setup:

# packages.yml
packages:
  - package: brooklyn-data/dbt_artifacts
    version: 0.8.0
# dbt_project.yml
name: "my_dbt_project"
version: "1.0.0"
config-version: 2
clean-targets:
  - "logs"
  - "target"
  # Commenting this out so that we don't clean the packages folder with a `dbt clean`
  # command which will then require a `dbt deps` to reinstall the dbt_artifacts package.
  # - "dbt_packages"
on-run-end:
  - "{{ dbt_artifacts.upload_dbt_artifacts(['run_results']) }}"

And perhaps a simple source (note that it doesn't matter if the test passes or fails):

# models/sources.yml

version: 2
sources:
  - name: dbt_jyeo
    tables:
      - name: foo
        columns:
          - name: user_id
            tests:
              - not_null

The intention here is that - we run a test, and then we upload the run_results.json artifact that is created during the on-run-end hook for subsequent processing. The reason we also want to use the on-run-end is because in dbt Cloud, if a prior step has an error (e.g. test fail), then a subsequent dbt run-operation which uploads the artifact will not run.

Now what happens when we do a dbt clean && dbt test --select source:* 1:

~ dbt clean && dbt test --select source:*

02:00:44  Running with dbt=1.1.0
02:00:46  Clearing existing files from Stage: 
    remove @development.dbt_jyeo.my_dbt_project-on-run-end-0 pattern='.*\/0ecc2a03-53b8-41f9-b8e2-bad294522a74\/.*\.json.gz';

02:00:46  Uploading run_results.json to Stage: 
        put file://target/run_results.json @development.dbt_jyeo.my_dbt_project-on-run-end-0/0ecc2a03-53b8-41f9-b8e2-bad294522a74 auto_compress=true;
    
02:00:46  Copying run_results.json from Stage: 

        -- Merge to avoid duplicates
        merge into development.dbt_jyeo.my_dbt_project-on-run-end-0 as old_data using (
            select
            $1 as data,
            $1:metadata:generated_at::timestamp_ntz as generated_at,
            metadata$filename as path,
            regexp_substr(metadata$filename, '([a-z_]+.json)') as artifact_type
            from  @development.dbt_jyeo.my_dbt_project-on-run-end-0/0ecc2a03-53b8-41f9-b8e2-bad294522a74
        ) as new_data
        on old_data.generated_at = new_data.generated_at
        -- NB: No clause for "when matched" - as matching rows should be skipped.
        when not matched then insert (
            data,
            generated_at,
            path,
            artifact_type
        ) values (
            new_data.data,
            new_data.generated_at,
            new_data.path,
            new_data.artifact_type
        )

    
02:00:46  Clearing new files from Stage: 
    remove @development.dbt_jyeo.my_dbt_project-on-run-end-0 pattern='.*\/0ecc2a03-53b8-41f9-b8e2-bad294522a74\/.*\.json.gz';

02:00:47  Found 21 models, 17 tests, 0 snapshots, 0 analyses, 427 macros, 1 operation, 0 seed files, 5 sources, 0 exposures, 0 metrics
02:00:47  
02:00:49  Concurrency: 4 threads (target='dev')
02:00:49  
02:00:49  1 of 1 START test source_not_null_dbt_jyeo_foo_user_id ......................... [RUN]
02:00:51  1 of 1 PASS source_not_null_dbt_jyeo_foo_user_id ............................... [PASS in 1.84s]
02:00:51  
02:00:51  Running 1 on-run-end hook
02:00:51  Clearing existing files from Stage: 
    remove @development.dbt_artifacts.artifacts pattern='.*\/0ecc2a03-53b8-41f9-b8e2-bad294522a74\/.*\.json.gz';

02:00:52  Uploading run_results.json to Stage: 
        put file://target/run_results.json @development.dbt_artifacts.artifacts/0ecc2a03-53b8-41f9-b8e2-bad294522a74 auto_compress=true;
    
02:00:53  Database error while running on-run-end
02:00:55  Encountered an error:
Database Error
  253006: File doesn't exist: ['target/run_results.json']

We get an error that says the run_results.json file doesn't exist in our target folder. And if you actually check the files that are in the target folder:

~ ls target
compiled              
graph.gpickle         
manifest.json         
partial_parse.msgpack 
run

Indeed there isn't a run_results.json file. So here comes the first lesson - if we run into an error during the on-run-end hook (not the error of a test mind you), a run_results.json file isn't produced.

You can test the above out yourself with a macro:2

-- macros/my_macro.sql
{% macro my_macro() %}
    {% if execute %}
        {{ exceptions.raise_compiler_error('ops') }}
    {% endif %}
{% endmacro %}

Which we then use in our hook:

# dbt_project.yml
...
on-run-end:
  - "{{ my_macro() }}"
~ dbt clean && dbt test --select source:*
02:22:22  Running with dbt=1.1.0
02:22:24  Found 21 models, 17 tests, 0 snapshots, 0 analyses, 428 macros, 1 operation, 0 seed files, 5 sources, 0 exposures, 0 metrics
02:22:24  
02:22:26  Concurrency: 4 threads (target='dev')
02:22:26  
02:22:26  1 of 1 START test source_not_null_dbt_jyeo_foo_user_id ......................... [RUN]
02:22:27  1 of 1 PASS source_not_null_dbt_jyeo_foo_user_id ............................... [PASS in 1.73s]
02:22:27  
02:22:27  Running 1 on-run-end hook
02:22:27  Database error while running on-run-end
02:22:28  Encountered an error:
Compilation Error in operation my_dbt_project-on-run-end-0 (./dbt_project.yml)
  ops
  
  > in macro my_macro (macros/my_macro.sql)
  > called by operation my_dbt_project-on-run-end-0 (./dbt_project.yml)
  > called by operation my_dbt_project-on-run-end-0 (./dbt_project.yml)

~ ls target
compiled              
graph.gpickle         
manifest.json         
partial_parse.msgpack 
run

When we encountered an error in our hook, a run_results.json isn't generated (and to be honest, I'm not quite sure why yet). However, that isn't the main issue here. Let's try changing our macro slightly: 3

-- macros/my_macro.sql
{% macro my_macro() %}
    {% if execute %}
        {{ debug() }}
        {{ log('ops', true) }}
    {% endif %}
{% endmacro %}

With the debug() macro (recently only learned about this from @ChenyuLInx), dbt execution will stop right before we log our 'ops' string - in order to do this though, you need to make sure you've installed ipdb and the env var DBT_MACRO_DEBUGGING is set.

Here it is useful if you can open up 2 shell/terminal sessions so we can list out the contents of the target folder whilst dbt is running:

~ pip install ipdb
~ export DBT_MACRO_DEBUGGING=True
~ dbt clean && dbt test --select source:*

02:32:26  Running with dbt=1.1.0
02:32:27  Found 21 models, 17 tests, 0 snapshots, 0 analyses, 428 macros, 1 operation, 0 seed files, 5 sources, 0 exposures, 0 metrics
02:32:27  
02:32:30  Concurrency: 4 threads (target='dev')
02:32:30  
02:32:30  1 of 1 START test source_not_null_dbt_jyeo_foo_user_id ......................... [RUN]
02:32:32  1 of 1 PASS source_not_null_dbt_jyeo_foo_user_id ............................... [PASS in 1.93s]
02:32:32  
02:32:32  Running 1 on-run-end hook
> /Users/jeremy/src/dbt-sandcastles/dbt/snowflake/aff28188abc1d3ee9f951ce6(26)macro()

ipdb> 

While dbt is running... it should suddenly stop just before logging the 'ops' string and drops you into the ipdb debugger. Before you continue, you can open up a new shell/terminal and try listing the files in your target folder at this very time:

~ ls target
compiled              
graph.gpickle         
manifest.json         
partial_parse.msgpack 
run

See how there isn't a run_results.json file? Since this file doesn't exist at this time (within the on-run-end hook execution, we then CANNOT upload it and thus the File doesn't exist: ['target/run_results.json'] error we experienced is not unexpected. Probably worth pointing out that we do have manifest.json (this is generated at the start of the dbt run) available which consequently mean that it can be uploaded at this time.

Feel free to continue dbt's execution by entering c and pressing ENTER:

ipdb> c
02:35:42  ops
02:35:42  1 of 1 START hook: my_dbt_project.on-run-end.0 ................................. [RUN]
02:35:42  1 of 1 OK hook: my_dbt_project.on-run-end.0 .................................... [OK in 0.00s]
02:35:42  
02:35:42  
02:35:42  Finished running 1 test, 1 hook in 194.15s.
02:35:42  
02:35:42  Completed successfully
02:35:42  
02:35:42  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

And then checking the target folder now to see that run_results.json is now available AFTER on-run-end completes:

~ ls target
compiled              
graph.gpickle         
manifest.json         
partial_parse.msgpack 
run                   
run_results.json

This concludes the second and primary lessons - we do not have a run_results.json that's written to disk until AFTER on-run-end.

So what can we do about this? Well, during the on-run-end, we have access to a special results context variable that has the same information that would be contained within the run_results.json file, so we can use that instead (which requires more work but I've written about that here - https://gist.github.com/jeremyyeo/064106e480106b49cd337f33a765ef20).

Footnotes

  1. If you don't select your sources only, you will be testing those that also come with the dbt_artifacts package which we don't want to do for this exercise.

  2. Here we're using the if execute conditional for if not, we would error even before we execute our tests - read more about it in our docs.

  3. We're using the debug() macro - read more about that in our docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment