Skip to content

Instantly share code, notes, and snippets.

@jeremyyeo
Last active November 8, 2023 04:08
Show Gist options
  • Save jeremyyeo/1dab5d07c661c94a8be0ddf56397e621 to your computer and use it in GitHub Desktop.
Save jeremyyeo/1dab5d07c661c94a8be0ddf56397e621 to your computer and use it in GitHub Desktop.
Using alternative hosts for dbt hub packages #dbt

Using alternative hosts for dbt hub packages

Full documentation for dbt packages are available here: https://docs.getdbt.com/docs/build/packages and this writeup just reuses what's already there.

The most common pattern of using dbt packages is to use one from the dbt Package hub (https://hub.getdbt.com/). For example:

# packages.yml
packages:
  - package: dbt-labs/dbt_utils
    version: 1.1.1

What happens when we install dbt packages (i.e. run dbt deps):

$ dbt deps
03:30:12  Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'start', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x104cbf8e0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x1078eed00>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x107927310>]}
03:30:12  Running with dbt=1.6.7
03:30:12  running dbt with arguments {'printer_width': '80', 'indirect_selection': 'eager', 'log_cache_events': 'False', 'write_json': 'True', 'partial_parse': 'False', 'cache_selected_only': 'False', 'profiles_dir': '/Users/jeremy/.dbt', 'version_check': 'True', 'fail_fast': 'False', 'log_path': '/Users/jeremy/src/dbt-basic/logs', 'debug': 'True', 'warn_error': 'None', 'use_colors': 'True', 'use_experimental_parser': 'False', 'no_print': 'None', 'quiet': 'False', 'warn_error_options': 'WarnErrorOptions(include=[], exclude=[])', 'invocation_command': 'dbt --debug deps', 'introspect': 'True', 'static_parser': 'True', 'target_path': 'None', 'log_format': 'default', 'send_anonymous_usage_stats': 'True'}
03:30:12  Sending event: {'category': 'dbt', 'action': 'project_id', 'label': '74d35237-88e5-42ed-837e-ea3f6e001a81', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x1078ee3a0>]}
03:30:12  Set downloads directory='/var/folders/ql/36kn0w_d03q56l3zphv3znmm0000gp/T/dbt-downloads-9e7lppxm'
03:30:12  Making package index registry request: GET https://hub.getdbt.com/api/v1/index.json
03:30:13  Response from registry index: GET https://hub.getdbt.com/api/v1/index.json 200
03:30:13  Making package registry request: GET https://hub.getdbt.com/api/v1/dbt-labs/dbt_utils.json
03:30:14  Response from registry: GET https://hub.getdbt.com/api/v1/dbt-labs/dbt_utils.json 200
03:30:14  Installing dbt-labs/dbt_utils
03:30:16  Installed from version 1.1.1
03:30:16  Up to date!
03:30:16  Sending event: {'category': 'dbt', 'action': 'package', 'label': '74d35237-88e5-42ed-837e-ea3f6e001a81', 'property_': 'install', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x1079914c0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x107991a60>]}
03:30:16  Command `dbt deps` succeeded at 16:30:16.340594 after 3.74 seconds
03:30:16  Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'end', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x104cbf8e0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x10793da30>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x107991af0>]}
03:30:16  Flushing usage events

First we retrieve a json from the package hub (https://hub.getdbt.com) (https://hub.getdbt.com/api/v1/dbt-labs/dbt_utils.json) and in that file contain a direct link to the tarball (by version):

    "1.1.1": {
      "id": "dbt-labs/dbt_utils/1.1.1",
      "name": "dbt_utils",
      "version": "1.1.1",
      "published_at": "1970-01-01T00:00:00.000000+00:00",
      "packages": [],
      "require_dbt_version": [">=1.3.0", "<2.0.0"],
      "works_with": [],
      "_source": {
        "type": "github",
        "url": "https://github.com/dbt-labs/dbt-utils/tree/1.1.1/",
        "readme": "https://raw.githubusercontent.com/dbt-labs/dbt-utils/1.1.1/README.md"
      },
      "downloads": {
        "tarball": "https://codeload.github.com/dbt-labs/dbt-utils/tar.gz/1.1.1",
        "format": "tgz",
        "sha1": "73a0f4a598e11d18525603991ffef9c0fa36cb1f"
      }
    }

Which we then download and unzip the contents into the dbt_packages/ folder.

All hub packages are hosted on github.com

Occasionally, there can be connectivity issues to:

  • The package hub itself - thus we can't retrieve the json file to figure out the tarball download path.
  • GitHub itself - so even if we retrieve the json above - GitHub itself does not serve us the download.

Which would result in packages unable to be downloaded and thusly a dbt job to fail.


Downloading packages straight from GitHub (with a mirror backup)

  1. The dbt-utils package repository is available at https://github.com/dbt-labs/dbt-utils - thus we can actually download it straight from GitHub via (just like the docs show):
# packages.yml
packages:
  - git: "https://github.com/dbt-labs/dbt-utils.git"
    revision: 1.1.1
  1. Let's mirror from that public repository into yet another public repository. Here, I'm using GitLab instead (https://gitlab.com/dbt-packages-mirror/dbt-utils). Now we have our mirror, we can make use of it like so:
# packages.yml
packages:
  - git: "https://gitlab.com/dbt-packages-mirror/dbt-utils.git"
    revision: 1.1.1

We will not be covering how to mirror here - refer to your own git provider documentation. Additionally, if your mirror is a private repo, then you'll need to add various bits like the Git Token (https://docs.getdbt.com/docs/build/packages#private-packages) to the URL too - this will not be covered here as well.

  1. Let's use an env var to switch between GitHub and GitLab.
# packages.yml

# export DBT_PACKAGE_MIRROR='https://github.com/dbt-labs'
# export DBT_PACKAGE_MIRROR='https://gitlab.com/dbt-packages-mirror'

packages:
  - git: "{{ env_var('DBT_PACKAGE_MIRROR', 'https://github.com/dbt-labs')}}/dbt-utils.git"
    revision: 1.1.1

We're using the 2 arg env_var method meaning if we forget to set the env var DBT_PACKAGE_MIRROR then we would be resolving to a default of 'https://github.com/dbt-labs' instead of just erroring.

Now, what we can do is to set the env var accordingly and then switch them whenever we need to (the following shows dbt Cloud UI - but you can do this on core/cli by exporting the env var):

image

image

By having the ability to quickly swap (by simply changing an env var) - we can change from one host to another when things are on 🔥.

Now if GitHub is down AND GitLab is down too... [insert paywall content].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment