2022-11-23: Change catalog_2.sql
to filter graph nodes for models only. Previously we were retrieving all nodes - this included tests. Added catalog_3.sql
that can filter for specific model tags and only retrieve metadata for those.
A small workaround for dbt-labs/dbt-bigquery#115
If you have a BigQuery dbt project that:
- Uses a small number of sources from a dataset with many unrelated tables/views.
- Writes to a dataset with many other unrelated tables/views.
It is likely you will run into the above linked issue. This is because when dbt builds the catalog.json
file, it has to iterate through each dataset in your dbt project AND retrieve all tables/views in it whether they are relevant to your dbt project or not.
Let's take a look at a quick example.
- Setup BigQuery.
create or replace table `cse-sandbox-319708.dbt_jyeo_src.my_source_a` as select 1 as user_id;
create or replace table `cse-sandbox-319708.dbt_jyeo_src.my_source_b` as select 2 as user_id;
create or replace table `cse-sandbox-319708.dbt_jyeo_src.my_source_c` as select 3 as user_id;
create or replace table `cse-sandbox-319708.dbt_jyeo.baz_19700101` as select 1 as user_id;
create or replace table `cse-sandbox-319708.dbt_jyeo.baz_19700102` as select 2 as user_id;
create or replace table `cse-sandbox-319708.dbt_jyeo.baz_19700103` as select 3 as user_id;
create or replace view `cse-sandbox-319708.dbt_jyeo.qux_19700101` as select 1 as user_id;
create or replace view `cse-sandbox-319708.dbt_jyeo.qux_19700102` as select 2 as user_id;
create or replace view `cse-sandbox-319708.dbt_jyeo.qux_19700103` as select 3 as user_id;
I made sure those datasets above were first empty and thus after running the statements above, those are all the relations that are contained within it. The dbt_jyeo
dataset is where we will output our dbt models and the dbt_jyeo_src
dataset will be used as a source in our project.
- Setup dbt project.
-- models/bar.sql
select 1 as user_id
-- models/foo.sql
select 1 as user_id
^ Make sure these write to the dbt_jyeo
dataset.
# models/sources.yml
version: 2
sources:
- name: dbt_jyeo_src
tables:
- name: my_source_a
-
dbt run
to build your models. -
Override the catalog macro and put it in your own project.
-- macros/catalog.sql
{% macro get_catalog(information_schema, schemas) -%}
...
{%- endmacro %
I've modified mine slightly (see catalog_1.sql
attached to this gist) so I can print out some debugging information. Mainly, I want to log the query being used and also print out the number of rows returned by the query for each dataset.
- Generate the docs and see the output.
$ dbt docs generate
04:22:33 Running with dbt=1.2.0
04:22:35 Found 2 models, 0 tests, 0 snapshots, 0 analyses, 286 macros, 0 operations, 0 seed files, 1 source, 0 exposures, 0 metrics
04:22:35
04:22:38 Concurrency: 1 threads (target='dev')
04:22:38
04:22:38 Done.
04:22:38 Building catalog
04:22:49 {'dbt_jyeo'} - rows returned: 6
04:22:58 {'dbt_jyeo_src'} - rows returned: 3
04:22:58 Catalog written to /Users/jeremy/src/dbt-sandcastles/dbt/bigquery/target/catalog.json
^ So we 2 models in our project and using 1 source but instead we have a total of 9 rows (most are unrelated to our dbt project).
- Go to our debug logs, and run the exact query dbt is running in BigQuery itself.
The query for the dbt_jyeo
dataset above is returning 6 rows when only 2 are relevant to our dbt project (same thing if you run the query that is for the dbt_jyeo_src
dataset). 1 2
-
Basically, the fix is as proposed by Jerco already in the issue - we should filter the query to only include relations of interest.
-
Replace the overridden macro with the one we have in this gist (
catalog_2.sql
). This macro tweaks a couple of things:- Makes it so date sharded views show up as a single row and not many rows (dbt-labs/dbt-bigquery#115 (comment)).
- Builds list of all the model names and source names in our project using the
graph
context variable which we then use to filter on later.
-
Let's regenerate our docs and look at the output with our new override.
$ dbt docs generate
04:41:28 Running with dbt=1.2.0
04:41:30 Found 2 models, 0 tests, 0 snapshots, 0 analyses, 286 macros, 0 operations, 0 seed files, 1 source, 0 exposures, 0 metrics
04:41:30
04:41:33 Concurrency: 1 threads (target='dev')
04:41:33
04:41:33 Done.
04:41:33 Building catalog
04:41:44 {'dbt_jyeo'} - rows returned: 2
04:41:51 {'dbt_jyeo_src'} - rows returned: 1
04:41:51 Catalog written to /Users/jeremy/src/dbt-sandcastles/dbt/bigquery/target/catalog.json
^ We have significantly reduced the number of rows retrieved to build our catalog by only filtering on the ones that truly matter.
- Jerco notes that the limitation would be how big the actual list of relations are - if you have too many of them in your project, the query that dbt has to send to BigQuery could potentially exceed the limit of 1MB of text. Now I have tested this by doing something like (and removing the
{%- set relations_in_project = set(relations_in_project) | list -%}
line:
{%- for n in range(33333) -%}
{%- for node in graph.nodes.values() -%}
{%- do relations_in_project.append(node.alias) -%}
{%- endfor -%}
{%- for source in graph.sources.values() -%}
{%- do relations_in_project.append(source.name) -%}
{%- endfor -%}
{%- endfor -%}
{% do log(relations_in_project | length, True) %}
3 x 33333 ~ 100k
in the filtering and
clause that we added and it ran just fine. When I tried range(100000) ~ 300k
, I ran into the The query is too large. The maximum standard SQL query length is 1024.00K characters, including comments and white space characters.
error. The balance here probably depends on how long your model names / source table names are here since in this example they are quite small (e.g. foo
).
Some projects may have got many thousands of intermediate models so perhaps we really only want to narrow down the catalog query to only certain "key" models. We can do that by utilizing tags.
Let's add a tag to one of our models:
-- models/bar.sql
{{ config(tags=['important']) }}
select 1 as user_id
-- models/foo.sql
select 1 as user_id
P.s. While I'm tagging via the models config block, you can indeed tag a whole folder of models in the
dbt_project.yml
file instead of doing it model per model file here.
And then setup a var in our dbt_project.yml
:
# dbt_project.yml
...
vars:
docs_include_tags: ['important']
...
Add our override macro catalog_3.sql
.
Then let's build our docs:
$ dbt docs generate
21:47:59 Running with dbt=1.3.0
21:48:01 Found 2 models, 0 tests, 0 snapshots, 0 analyses, 320 macros, 0 operations, 0 seed files, 1 source, 0 exposures, 0 metrics
21:48:01
21:48:02 Concurrency: 1 threads (target='dev')
21:48:02
21:48:03 Done.
21:48:03 Building catalog
21:48:09 {'dbt_jyeo'} - rows returned: 1
21:48:14 {'dbt_jyeo_src'} - rows returned: 1
21:48:14 Catalog written to /Users/jeremy/src/dbt-sandcastles/dbt/bigquery/target/catalog.json
One of the downsides here is that only the tagged model(s) will have metadata:
But perhaps that's a-okay.
Footnotes
-
Ftr, it looks like if you have many hundreds and thousands of rows to be returned, the BigQuery UI can elegantly display this by pagination but because dbt has to "download all those rows" to the local machine, it can mean a really bad time (https://github.com/dbt-labs/dbt-bigquery/issues/115#issuecomment-1035304410). ↩
-
There should be 1 row per table-column (since all my toy tables only have 1 column, we only see 1 row per table here). ↩
Thanks @jeremyyeo! This was very helpful.
If anyone winds up here looking for a Snowflake version, here is one I modified from this gist and the dbt-snowflake
catalog.sql
query: