beckjake/dbt_internals.md

## dbt_internals.md

      
    Raw
  

              dbt_internals.md
            
          
    When you dbt compile, dbt at a very high level does this:

read your dbt_project.yml in, rendering most fields with jinja (hooks and query comments are deferred until later, when they can have more information available). Any projects in your modules directory are also read in and rendered - consider them part of "your project" when reading.
read your profiles.yml in, rendering everything with jinja
find all the relevant files (.sql, .yml) and read them in, as defined in your dbt_project.yml
dbt renders each sql file with jinja, primarily collecting calls to ref, source, and config. The string result of the actual rendering is then discarded. The model's materialization type is finalized here, as are any other relevant model-level configuration items (database/schema/alias are easy examples here!).
build a dependency graph using the ref information and use the command-line arguments to decide which nodes to iterate over
for each selected node, in "graph order":

a) "compile the node" by rendering the jinja and collecting the resulting sql into a string - this is what's written to target/compiled. The result of rendering here is stored
b) If you're running dbt run, render another jinja document, the materialization, with the sql generated in the previous step as "sql".
FAQs:

Where do hooks come in?

on-run-(end/start) hooks are rendered in jinja and run before and after step 6 as appropriate. model hooks are rendered and run before and after step 6b as appropriate. Neither have a real "parse" phase, they basically go straight to compiling.
How do ephemeral models work?

Ephemeral models are treated specially. During compilation dbt converts them into a partial CTE (including a name), and their execution is skipped. During compilation, models that depend upon ephemeral models have the appropriate with statements  their dependent models looked up their results and use that information to create a CTE.
How do tests work?

When dbt parses schema.yml files and finds tests, it creates some jinja that calls the relevant macro (the test name with test_ prefixed). This includes a ref to any referenced model. After that, dbt treats tests schema and data tests the same: it executes the SQL and looks for  normally, going through the parsing/compiling steps and then executing that SQL. The result is expected to have exactly one row with exactly one column, which should be the number of rows that failed the test.
What parts of dbt are concurrent?

Almost all of dbt's concurrency happens in step 6. The on-run-start and on-run-end hooks are deliberately not concurrent, as they can very reasonably have dependencies.
How can I avoid X happening during parsing?

In your model, you may need to react slightly differently in parsing vs compiling: for example, you might have a log statement that should run at runtime, but makes no sense at parse-time. The execute value is provided for that purpose: it's False in parsing and True during compilation. You can use it to wrap your statements, like so:
{% if execute %}
  {{ log('executing the model', info=False) }}
{% endif %}

Note that when execute=False, things can be a little funky! Because dbt doesn't know what your final database/schema/identifier values are, it fills them with dummy values that could be completely wrong.
Definitely don't use execute to change the shape of the graph between parsing and runtime by choosing ref targets based on the value: dbt will do bad stuff like drop .. cascade relations that have dependencies and break those dependent models.
This model has a jinja syntax error and I want to skip parsing!

There's no way for dbt to skip files in your models directory for parsing. Because you can enable/disable models with config(enabled=...), dbt can't tell if a model is enabled until it's been parsed. Instead, to avoid parsing a bad file, you can rename the file to not have a .sql extension, or you can move it out of your models directory.
Where does the cache come in?

The cache is populated in run, seed, snapshot, and test just before on-run-start hooks run. It enumerates all the databases and schemas referenced by your project, and collects information about what currently exists and its table type. This informaiton is relevant for materializations, which might have to decide to drop a view and create a table in its place for an incremental model that used to be a view, for example.