If you download the debug logs from a dbt Cloud run:
We can have a look at the exact command dbt is executing during the source freshness step. In the case above, it executed the following:
select
max(updated_at) as max_loaded_at,
convert_timezone('UTC', current_timestamp()) as snapshotted_at
from development.dbt_jyeo.my_source_a
For a source that I have defined in my project like so:
# models/sources.yml
version: 2
sources:
- name: dbt_jyeo
loaded_at_field: updated_at
freshness:
warn_after: { count: 12, period: hour }
error_after: { count: 24, period: hour }
tables:
- name: my_source_a
Given the above and these docs from Snowflake... I found that the query above executed scanning 0 bytes:
Note that we can use the explain
command to show us the query plan too:
Typically the way Snowflake works with tables is that it stores metadata about each of the columns in a table - this means that operations such as the count, min, max, and so on of a column happen without doing any table scan. You can see here when I try to explain the same source freshness query that would run on a Snowflake sample table (RAW_TPCH.WEATHER.HOURLY_14_TOTAL
) that has about 1B rows (1.7 TB in size):
Snowflake didn't have to scan any bytes because the metadata on the t
column is already stored.
While the above is for Snowflake, it should apply to BigQuery too I think.
Update from Jerco:
Even though the query is very fast, BigQuery does charge $$ to scan the column 😞 just the column, though. There are legacySQL ways to access the partition metadata for the table instead (free) (editors note: though this appear to be quite gnarly).