TL;DR - the run_results.json file is only written to disk AFTER the on-run-end hooks are executed without error and so isn't available DURING the on-run-end hook.
You have got the following setup:
# packages.yml
packages:
- package: brooklyn-data/dbt_artifacts
version: 0.8.0
# dbt_project.yml
name: "my_dbt_project"
version: "1.0.0"
config-version: 2
clean-targets:
- "logs"
- "target"
# Commenting this out so that we don't clean the packages folder with a `dbt clean`
# command which will then require a `dbt deps` to reinstall the dbt_artifacts package.
# - "dbt_packages"
on-run-end:
- "{{ dbt_artifacts.upload_dbt_artifacts(['run_results']) }}"
And perhaps a simple source (note that it doesn't matter if the test passes or fails):
# models/sources.yml
version: 2
sources:
- name: dbt_jyeo
tables:
- name: foo
columns:
- name: user_id
tests:
- not_null
The intention here is that - we run a test, and then we upload the run_results.json
artifact that is created during the on-run-end
hook for subsequent processing. The reason we also want to use the on-run-end
is because in dbt Cloud, if a prior step has an error (e.g. test fail), then a subsequent dbt run-operation
which uploads the artifact will not run.
Now what happens when we do a dbt clean && dbt test --select source:*
1:
~ dbt clean && dbt test --select source:*
02:00:44 Running with dbt=1.1.0
02:00:46 Clearing existing files from Stage:
remove @development.dbt_jyeo.my_dbt_project-on-run-end-0 pattern='.*\/0ecc2a03-53b8-41f9-b8e2-bad294522a74\/.*\.json.gz';
02:00:46 Uploading run_results.json to Stage:
put file://target/run_results.json @development.dbt_jyeo.my_dbt_project-on-run-end-0/0ecc2a03-53b8-41f9-b8e2-bad294522a74 auto_compress=true;
02:00:46 Copying run_results.json from Stage:
-- Merge to avoid duplicates
merge into development.dbt_jyeo.my_dbt_project-on-run-end-0 as old_data using (
select
$1 as data,
$1:metadata:generated_at::timestamp_ntz as generated_at,
metadata$filename as path,
regexp_substr(metadata$filename, '([a-z_]+.json)') as artifact_type
from @development.dbt_jyeo.my_dbt_project-on-run-end-0/0ecc2a03-53b8-41f9-b8e2-bad294522a74
) as new_data
on old_data.generated_at = new_data.generated_at
-- NB: No clause for "when matched" - as matching rows should be skipped.
when not matched then insert (
data,
generated_at,
path,
artifact_type
) values (
new_data.data,
new_data.generated_at,
new_data.path,
new_data.artifact_type
)
02:00:46 Clearing new files from Stage:
remove @development.dbt_jyeo.my_dbt_project-on-run-end-0 pattern='.*\/0ecc2a03-53b8-41f9-b8e2-bad294522a74\/.*\.json.gz';
02:00:47 Found 21 models, 17 tests, 0 snapshots, 0 analyses, 427 macros, 1 operation, 0 seed files, 5 sources, 0 exposures, 0 metrics
02:00:47
02:00:49 Concurrency: 4 threads (target='dev')
02:00:49
02:00:49 1 of 1 START test source_not_null_dbt_jyeo_foo_user_id ......................... [RUN]
02:00:51 1 of 1 PASS source_not_null_dbt_jyeo_foo_user_id ............................... [PASS in 1.84s]
02:00:51
02:00:51 Running 1 on-run-end hook
02:00:51 Clearing existing files from Stage:
remove @development.dbt_artifacts.artifacts pattern='.*\/0ecc2a03-53b8-41f9-b8e2-bad294522a74\/.*\.json.gz';
02:00:52 Uploading run_results.json to Stage:
put file://target/run_results.json @development.dbt_artifacts.artifacts/0ecc2a03-53b8-41f9-b8e2-bad294522a74 auto_compress=true;
02:00:53 Database error while running on-run-end
02:00:55 Encountered an error:
Database Error
253006: File doesn't exist: ['target/run_results.json']
We get an error that says the run_results.json
file doesn't exist in our target folder. And if you actually check the files that are in the target folder:
~ ls target
compiled
graph.gpickle
manifest.json
partial_parse.msgpack
run
Indeed there isn't a run_results.json
file. So here comes the first lesson - if we run into an error during the on-run-end
hook (not the error of a test mind you), a run_results.json
file isn't produced.
You can test the above out yourself with a macro:2
-- macros/my_macro.sql
{% macro my_macro() %}
{% if execute %}
{{ exceptions.raise_compiler_error('ops') }}
{% endif %}
{% endmacro %}
Which we then use in our hook:
# dbt_project.yml
...
on-run-end:
- "{{ my_macro() }}"
~ dbt clean && dbt test --select source:*
02:22:22 Running with dbt=1.1.0
02:22:24 Found 21 models, 17 tests, 0 snapshots, 0 analyses, 428 macros, 1 operation, 0 seed files, 5 sources, 0 exposures, 0 metrics
02:22:24
02:22:26 Concurrency: 4 threads (target='dev')
02:22:26
02:22:26 1 of 1 START test source_not_null_dbt_jyeo_foo_user_id ......................... [RUN]
02:22:27 1 of 1 PASS source_not_null_dbt_jyeo_foo_user_id ............................... [PASS in 1.73s]
02:22:27
02:22:27 Running 1 on-run-end hook
02:22:27 Database error while running on-run-end
02:22:28 Encountered an error:
Compilation Error in operation my_dbt_project-on-run-end-0 (./dbt_project.yml)
ops
> in macro my_macro (macros/my_macro.sql)
> called by operation my_dbt_project-on-run-end-0 (./dbt_project.yml)
> called by operation my_dbt_project-on-run-end-0 (./dbt_project.yml)
~ ls target
compiled
graph.gpickle
manifest.json
partial_parse.msgpack
run
When we encountered an error in our hook, a run_results.json
isn't generated (and to be honest, I'm not quite sure why yet). However, that isn't the main issue here. Let's try changing our macro slightly: 3
-- macros/my_macro.sql
{% macro my_macro() %}
{% if execute %}
{{ debug() }}
{{ log('ops', true) }}
{% endif %}
{% endmacro %}
With the debug()
macro (recently only learned about this from @ChenyuLInx), dbt execution will stop right before we log our 'ops' string - in order to do this though, you need to make sure you've installed ipdb
and the env var DBT_MACRO_DEBUGGING
is set.
Here it is useful if you can open up 2 shell/terminal sessions so we can list out the contents of the target
folder whilst dbt is running:
~ pip install ipdb
~ export DBT_MACRO_DEBUGGING=True
~ dbt clean && dbt test --select source:*
02:32:26 Running with dbt=1.1.0
02:32:27 Found 21 models, 17 tests, 0 snapshots, 0 analyses, 428 macros, 1 operation, 0 seed files, 5 sources, 0 exposures, 0 metrics
02:32:27
02:32:30 Concurrency: 4 threads (target='dev')
02:32:30
02:32:30 1 of 1 START test source_not_null_dbt_jyeo_foo_user_id ......................... [RUN]
02:32:32 1 of 1 PASS source_not_null_dbt_jyeo_foo_user_id ............................... [PASS in 1.93s]
02:32:32
02:32:32 Running 1 on-run-end hook
> /Users/jeremy/src/dbt-sandcastles/dbt/snowflake/aff28188abc1d3ee9f951ce6(26)macro()
ipdb>
While dbt is running... it should suddenly stop just before logging the 'ops' string and drops you into the ipdb
debugger. Before you continue, you can open up a new shell/terminal and try listing the files in your target
folder at this very time:
~ ls target
compiled
graph.gpickle
manifest.json
partial_parse.msgpack
run
See how there isn't a run_results.json
file? Since this file doesn't exist at this time (within the on-run-end
hook execution, we then CANNOT upload it and thus the File doesn't exist: ['target/run_results.json']
error we experienced is not unexpected. Probably worth pointing out that we do have manifest.json
(this is generated at the start of the dbt run) available which consequently mean that it can be uploaded at this time.
Feel free to continue dbt's execution by entering c
and pressing ENTER:
ipdb> c
02:35:42 ops
02:35:42 1 of 1 START hook: my_dbt_project.on-run-end.0 ................................. [RUN]
02:35:42 1 of 1 OK hook: my_dbt_project.on-run-end.0 .................................... [OK in 0.00s]
02:35:42
02:35:42
02:35:42 Finished running 1 test, 1 hook in 194.15s.
02:35:42
02:35:42 Completed successfully
02:35:42
02:35:42 Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
And then checking the target
folder now to see that run_results.json
is now available AFTER on-run-end
completes:
~ ls target
compiled
graph.gpickle
manifest.json
partial_parse.msgpack
run
run_results.json
This concludes the second and primary lessons - we do not have a run_results.json
that's written to disk until AFTER on-run-end
.
So what can we do about this? Well, during the on-run-end
, we have access to a special results
context variable that has the same information that would be contained within the run_results.json
file, so we can use that instead (which requires more work but I've written about that here - https://gist.github.com/jeremyyeo/064106e480106b49cd337f33a765ef20).
Footnotes
-
If you don't select your sources only, you will be testing those that also come with the
dbt_artifacts
package which we don't want to do for this exercise. ↩ -
Here we're using the
if execute
conditional for if not, we would error even before we execute our tests - read more about it in our docs. ↩ -
We're using the
debug()
macro - read more about that in our docs. ↩