Full documentation for dbt packages are available here: https://docs.getdbt.com/docs/build/packages and this writeup just reuses what's already there.
The most common pattern of using dbt packages is to use one from the dbt Package hub (https://hub.getdbt.com/). For example:
# packages.yml
packages:
- package: dbt-labs/dbt_utils
version: 1.1.1
What happens when we install dbt packages (i.e. run dbt deps
):
$ dbt deps
03:30:12 Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'start', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x104cbf8e0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x1078eed00>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x107927310>]}
03:30:12 Running with dbt=1.6.7
03:30:12 running dbt with arguments {'printer_width': '80', 'indirect_selection': 'eager', 'log_cache_events': 'False', 'write_json': 'True', 'partial_parse': 'False', 'cache_selected_only': 'False', 'profiles_dir': '/Users/jeremy/.dbt', 'version_check': 'True', 'fail_fast': 'False', 'log_path': '/Users/jeremy/src/dbt-basic/logs', 'debug': 'True', 'warn_error': 'None', 'use_colors': 'True', 'use_experimental_parser': 'False', 'no_print': 'None', 'quiet': 'False', 'warn_error_options': 'WarnErrorOptions(include=[], exclude=[])', 'invocation_command': 'dbt --debug deps', 'introspect': 'True', 'static_parser': 'True', 'target_path': 'None', 'log_format': 'default', 'send_anonymous_usage_stats': 'True'}
03:30:12 Sending event: {'category': 'dbt', 'action': 'project_id', 'label': '74d35237-88e5-42ed-837e-ea3f6e001a81', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x1078ee3a0>]}
03:30:12 Set downloads directory='/var/folders/ql/36kn0w_d03q56l3zphv3znmm0000gp/T/dbt-downloads-9e7lppxm'
03:30:12 Making package index registry request: GET https://hub.getdbt.com/api/v1/index.json
03:30:13 Response from registry index: GET https://hub.getdbt.com/api/v1/index.json 200
03:30:13 Making package registry request: GET https://hub.getdbt.com/api/v1/dbt-labs/dbt_utils.json
03:30:14 Response from registry: GET https://hub.getdbt.com/api/v1/dbt-labs/dbt_utils.json 200
03:30:14 Installing dbt-labs/dbt_utils
03:30:16 Installed from version 1.1.1
03:30:16 Up to date!
03:30:16 Sending event: {'category': 'dbt', 'action': 'package', 'label': '74d35237-88e5-42ed-837e-ea3f6e001a81', 'property_': 'install', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x1079914c0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x107991a60>]}
03:30:16 Command `dbt deps` succeeded at 16:30:16.340594 after 3.74 seconds
03:30:16 Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'end', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x104cbf8e0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x10793da30>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x107991af0>]}
03:30:16 Flushing usage events
First we retrieve a json
from the package hub (https://hub.getdbt.com) (https://hub.getdbt.com/api/v1/dbt-labs/dbt_utils.json) and in that file contain a direct link to the tarball (by version):
"1.1.1": {
"id": "dbt-labs/dbt_utils/1.1.1",
"name": "dbt_utils",
"version": "1.1.1",
"published_at": "1970-01-01T00:00:00.000000+00:00",
"packages": [],
"require_dbt_version": [">=1.3.0", "<2.0.0"],
"works_with": [],
"_source": {
"type": "github",
"url": "https://github.com/dbt-labs/dbt-utils/tree/1.1.1/",
"readme": "https://raw.githubusercontent.com/dbt-labs/dbt-utils/1.1.1/README.md"
},
"downloads": {
"tarball": "https://codeload.github.com/dbt-labs/dbt-utils/tar.gz/1.1.1",
"format": "tgz",
"sha1": "73a0f4a598e11d18525603991ffef9c0fa36cb1f"
}
}
Which we then download and unzip the contents into the dbt_packages/
folder.
All hub packages are hosted on github.com
Occasionally, there can be connectivity issues to:
- The package hub itself - thus we can't retrieve the
json
file to figure out the tarball download path. - GitHub itself - so even if we retrieve the
json
above - GitHub itself does not serve us the download.
Which would result in packages unable to be downloaded and thusly a dbt job to fail.
- The dbt-utils package repository is available at https://github.com/dbt-labs/dbt-utils - thus we can actually download it straight from GitHub via (just like the docs show):
# packages.yml
packages:
- git: "https://github.com/dbt-labs/dbt-utils.git"
revision: 1.1.1
- Let's mirror from that public repository into yet another public repository. Here, I'm using GitLab instead (https://gitlab.com/dbt-packages-mirror/dbt-utils). Now we have our mirror, we can make use of it like so:
# packages.yml
packages:
- git: "https://gitlab.com/dbt-packages-mirror/dbt-utils.git"
revision: 1.1.1
We will not be covering how to mirror here - refer to your own git provider documentation. Additionally, if your mirror is a private repo, then you'll need to add various bits like the Git Token (https://docs.getdbt.com/docs/build/packages#private-packages) to the URL too - this will not be covered here as well.
- Let's use an env var to switch between GitHub and GitLab.
# packages.yml
# export DBT_PACKAGE_MIRROR='https://github.com/dbt-labs'
# export DBT_PACKAGE_MIRROR='https://gitlab.com/dbt-packages-mirror'
packages:
- git: "{{ env_var('DBT_PACKAGE_MIRROR', 'https://github.com/dbt-labs')}}/dbt-utils.git"
revision: 1.1.1
We're using the 2 arg
env_var
method meaning if we forget to set the env varDBT_PACKAGE_MIRROR
then we would be resolving to a default of'https://github.com/dbt-labs'
instead of just erroring.
Now, what we can do is to set the env var accordingly and then switch them whenever we need to (the following shows dbt Cloud UI - but you can do this on core/cli by exporting the env var):
By having the ability to quickly swap (by simply changing an env var) - we can change from one host to another when things are on 🔥.
Now if GitHub is down AND GitLab is down too... [insert paywall content].