Skip to content

Instantly share code, notes, and snippets.

@JeanChristopheMorinPerso
Last active December 28, 2023 15:53
Show Gist options
  • Save JeanChristopheMorinPerso/9a0705d1498378178b53061d90f3b2d4 to your computer and use it in GitHub Desktop.
Save JeanChristopheMorinPerso/9a0705d1498378178b53061d90f3b2d4 to your computer and use it in GitHub Desktop.
Rez - Package Payload Repository

REP-005: Remote package repositories

Things to cover:

  • This is laying the ground and doesn't intend to cover everything possible. We expect that future REPs will extend this REP.
  • Packing/unpacking (zip, tar, etc) with plugin type
  • Package definition database
  • Package payload storage
  • What about the memcached cache?
  • TODO: default_cachable_per_repository and default_relocatable_per_repository. How do they play with remote repos?

Abstract

This REP describes the concept of remote package repositories. As part of this REP, a new plugin type is introduced to split the responsibility of storage package definition and package payloads. Another introduction will be the ability to compress package payloads at release time with the ability to also automatically uncompress them at rez-env time. Additionally, some pre-existing concepts will be clarified.

It tries to stay at a relatively high level to avoid dictating what the implementation should look like.

Rationale

Since its inception, Rez has relied on files being available on an accessible filesystem (shared or local) as the default mechanism to store and access both package definitions and package payloads. This is implemented using a plugin system where two plugins are currently available: filesystem (the default) and memory.

The memory plugin is used for development purposes only.

The filesystem plugin made sense, and still makes sense today for a lot of cases, in a world where there were no cloud or container workloads and where artists were working in the office, connected directly to a studio's internal network and shared filesystems. It's also starter friendly because it doesn't require any infrastructure to start evaluating rez and bootstrap rez in a studio. For a lot of studios, this will be enough and will work perfectly for them.

But there are studios for which filesystem repositories don't work well for them or are simply not enough. Some studios support artists working remotely or use cloud-backed workstations. Some studios also use cloud-hosted machines in their render farm. Others have a multitude of micro-services that use containers. When it comes to these use cases, accessing a common shared filesystem can be difficult or even unwanted or impossible due to multiple factors like latency, security, costs, complexity, etc.

Some studios would prefer to let their filesystems be used for actual production data and load and not have rez take space and resources of their precious filesystem resources.

Package payload caching was introduced to help fill some gaps with the filesystem storage. It copies package payloads locally to avoid running the code from a network filesystem. One excellent use case for this is DCCs, which are renowned for being unstable and slow when run from a network filesystem. It's a great solution in itself and is already used by multiple of our users. But it still requires access to a shared filesystem.

When it comes to rendering in the cloud, cloud workstations, or even satellite studios, studios have to come up with custom workflows to sync packages from their main studio to other locations. This can sometimes be complex and could be simplified if packages and their payloads were stored somewhere else than on a filesystem.

The solution to these problems is to extend Rez so that it can support new ways of storing package definitions and payloads.

Terminology

  • Package Repository: A repository of package definitions. This can be a simple filesystem, a database, etc.
  • Artifacts Store (new): A storage for package artifacts. This can be as simple as a shared filesystem, or more complex like a cloud-backed object storage solution (AWS S3, Azure Blob, GCP GCS, MinIO, etc), artifacts repositories (Nexus, Artifactory, etc), etc.
  • Bundler (new): A plugin that takes care of preparing a payload to be stored in a package store and also takes care of unbundling it. This could be a zip/unzip, tar/untar, etc.
  • Credentials Provider (new): A new plugin type that will allow the different plugins to load credentials based on their needs.
  • Artifacts Cache (rename): This is what we currently call "package caching". I'm proposing to use a new name to help differentiate everything clearly.

Considerations

  • Speed of downloading/uploading, packing/unpacking.
  • Interactivity and useability.

Requirements

  • Must be extensible by using plugins.
  • Plugins should live outside of Rez's source code to encourage faster iterations and discourage tight coupling to Rez's internals.
  • Packages will need to be relocatable. This should be clearly stated in the documentation.
  • rez-cp should be adapted or a new tool should be created to easily migrate from the filesystem-based repository to a remote repository.
  • Existing documentation must be updated.
  • Complete documentation covering use cases, tutorials, etc.
  • rez-build should continue to use a local filesystem repository by default.

MVP

  • At least one implementation of a remote package repository and an artifact store. But 2 implementations would be desirable to confirm that the design works for more than one implementation.

Proposed design

We propose a new scheme where package definitions and payloads are completely independent of each other. They can be stored in different places. Some examples that will become possible:

  • Store package definitions in a DB-backed package repository and store package payloads in S3. It could be PostgreSQL, MongoDB, etc.
  • Store both package definitions and payloads on a filesystem, but store the payloads as zip files.

Configuration

Rez currently supports setting package repositories by using the packages_path setting. By default, each path is considered a "filesystem" type path. It is possible to specify the type of repository by using the <type>@<location> syntax where type is the plugin type and location is whatever the plugin accepts.

We could continue to use it, but the fact that it contains path in its name would be confusing. Also, fitting a lot of information in a single string can become complex rapidly.

A new setting called package_repositories is introduced that accepts a list of dictionaries. Each dictionary corresponds to a package repository. It can also specify an artifact store to use. The setting name is clear and concise.

package_repositories = [
    # Example 1: Both definitions and payloads are stored on a filesystem at the same location.
    {
        'url': 'filesystem:///studio/path/to/repo',
    },
    # Example 3: Both payloads and definitions are stored on the filesystem but at different locations.
    {
        'url': 'filesystem:///studio/path/to/repo',
        'store': {
            'url': 'filesystem:///studio/path/to/payloads'
        }
    }
    # Example 2: Definitions stored in MongoDB and payloads in S3.
    {
        'url': 'filesystem:///studio/path/to/repo',
        'store': {
            'url': 'mongodb://localhost:1234/definitions',
            'credentials_provider': {
                'type': 'hashicorp_vault'
            }
        },
        'credentials_provider': {
            'type': 'env_vars'
        },
    }
]

Question: How will we support rez-{cp,depends,env,plugins,search} --paths, rez-{build,mv} --path, rez-pip --prefix, rez-{pkg-ignore,rm} PATH, rez-bind --install-path, etc? We should catalog all flags and the different ways that repositories are set right now.

Package repository

This stores the package definitions.

Responsibilities:

  • store all information about packages (name, version, variants, requirements, etc).
  • It does not need to know where a package payload is stored. The idea is that the artifact store can be given a package definition or variant hash, which the store can use to organize/store artifacts.
  • Details are a little bit blurry to me. Needs more work.

Artifact store

The artifact store plugin takes package definitions from a package repository to determine the location in the store. It will then do whatever is needed to make the payload accessible to a rez environment.

For example, an artifact store might use AWS S3, in which case the artifact plugin could download and upload the artifacts using the AWS CLI or the AWS SDK.

  • Given a package definition, the store plugin can determine the location of the artifact. This is left vague because it might be storage dependent. For example, one could use variant hashes, or use the good old variant filesystem structure.
  • Has the responsibility of copying payloads to/from the store.
  • Can implement its own internal cache to avoid useless re-downloads, similar to pip, rpm, etc. That internal cache would be different from the artifacts cache. If possible, that cache should be shared between users. This is essential in a render farm where local disk space might be limited. The location of that cache should also be configurable. It is unknown right now if part of should be provided by rez to make it consistent across plugins or not.

Bundler

The responsibility of a bundler is to take a folder containing the content of a package's payload and bundle it. An implementation could zip all the files in the directory for example. It also takes care of extracting the bundle.

Using plugins will allow for reusability and will also allow for flexibility. For example, someone might want to optimize for size while others might want to optimize for speed. Also, a studio might prefer zip files over tarballs because zip files are a little bit more portable across platforms.

Some artifact stores will mandate a specific bundle type, but others won't. One example is S3, where the user will be free to choose whichever bundle format works for them.

We need to consider that a bundler can also be a no-op. We don't know why someone would prefer to upload individual files one by one to a cloud object storage for example, but we should design for it if it doesn't make the implementation more complex.

Question: Where do we plug this? Is it the artifact store that calls the right bundler plugin?

Credentials provider

A credential provider is a plugin type that will initially be used by both package repository and artifact store plugins. The job of a provider is to provide the right set of credentials when needed. The key point is to allow users to decide how they want to fetch and provide credentials to plugins and rez. This will allow flexible workflows that won't require forking rez.

We could provide a built-in generic environment variable based plugin that would work for a lot of use cases.

Using a plugin system will help with securing other areas of rez, like the AMQP context tracking events for example. Right now it relies on credentials in config files. With a plugin system, it would be possible to provide credentials using a multitude of methods.

A plugin that requires credentials will provide information on the format of credentials it accepts. More specific providers could be implemented. For example, an AWS provider could be implemented, which plugins could use. But it's important to note that it's the plugin requiring credentials that defines how the credentials should look. TODO: This requires some more thought.

Open questions

  • What's the relationship with package payload caching? Can both live together? If so, how?
  • From Thorsten:
    • One of the beauties of the fully self-contained packages in a folder approach is the simplicity of building, distributing and deploying. With a DB we need to a) make sure things end up in the db, we need to make sure db and filesystem stay in sync and also clearly define how multi repositories work. E.g. we may want multiple DBs for latency and also multiple payload repos. What happens if DBs contradict? What happens if a payload is in multiple places? And what if not? I think that not everything that currently "just works incidentally" will translate easily into a db vs. payload store setup.
    • I also think that it might be beneficial to make both file transfer and packing parts pluggable. But that's really just "think" right now and it might moving towards over-engineering. But there are fairly big differences in speed even for local copies. VFX here has a system to pull zip payloads too. To my knowledge they did not fork rez though, they built it into packages (basically in pre_commands checking if the payload is where it should be, and if not pull the zip). I think they also did benchmarking and it turned out that directly unzipping from the fileshare was way faster than copy -> extract.
  • How can we make sure that the downloaded artifacts match what's expected? For example say we have one DB and multiple stores and a variant artifact is present in all stores. Should we store checksums and if so where? SHould both the package repo and the artifact store contain the checksums?
  • Should we design with signing in mind? For example should we make it possible to cryptographically sign artifacts? We might want to look at https://github.com/ossf/wg-securing-software-repos.
  • How does this potentially play with future rez receipes or community repositories?

Prior art

Known use cases

  • This could be used as a package sync system between studios. Or at least part of it. For example, a studio could use S3 to store and access their packages. S3 is accessible from everywhere. On the first resolve, a package could be installed into a shared filesystem. S3 transfers are free between regions, but it could be cheaper than a full-grown file transfer solution. Package payloads would be on the "edge", maybe fronted by a CDN (CloudFront in the case of AWS), etc.

Future additions

  • Composable repositories with fallbacks: https://academysoftwarefdn.slack.com/archives/C0321B828FM/p1677280013597799?thread_ts=1677245578.118999&cid=C0321B828FM. For example you may have an artifact repo that downloads package payloads from S3, and copies them into shared posix storage, where they're consumed by multiple users at your studio. A different artifact repo might unzip files from shared storage, into directories directly on users' local disk. It could even make sense to chain artifact repos together - perhaps your S3 repo downloads zipped artifacts to shared storage, then a "localization" repo unzips them onto the local disk on demand.
@ttrently
Copy link

Thanks for this great writeup @JeanChristopheMorinPerso ! It's exciting to see this concept take a much more concrete shape.

On the configuration, I like this new proposed setup for storing that information as it keeps it extremely readable and easy to extend. You mention the question of how this fits into the existing rez commands and its one we ran into as well as the amount of information needed to be provided is greater and more complex. I think a case could be made for simply naming / aliasing these package_repositories for easy shorthand access, not only internally within the plugins but also from a user point of view.

Ex. Adding a name or alias entry to a package repository definition.

package_repositories = [
    # Example 1: Both definitions and payloads are stored on a filesystem at the same location.
    {
        'name': 'studioLocal'
        'url': 'filesystem:///studio/path/to/repo',
    }

Then in a rez-build command.

rez-build --repository studioLocal

This may also allow for or address a concern that Thorsten raised about having definitions in multiple DB locations that point to different payloads.

# Example 1: Builds the package and stores information in all databases associated with the repository
rez-build --repository studioLocal

# Example 2: Build the package and stores information in a specific database.
rez-build --repository studioLocal/<store>

Typing the above also makes me realize that maybe we need to support store as a list as to have multiple database associations?

@JeanChristopheMorinPerso
Copy link
Author

Yeah, aliases would be nice indeed. store is the place where package payloads are stored. I'm not entirely sure why we would need a list of stores. @ttrently Do you have a use case in mind? Or maybe you were thinking about repositories?

@ttrently
Copy link

ttrently commented Mar 6, 2023

Possibly both? This was something we experimented with before realizing it was introducing too much complexity and shut it down.

The one thing we did keep though was letting a singular repository access multiple stores. For example - we wanted to manage one MongoDB instance but limit access for teams to different S3 stores. So entries would look something like:

foo-package | ... | ... | s3://address-1
foo-package | ... | ... | s3://address-2

Ultimately made some management easier but we had to account for this in build commands and lookups etc.

I'd be interested to know if this rings true for any others interested in this behavior. Otherwise it may be best to do a one-to-one association between repositories and stores vs. one-to-many for ease of entry.

@JeanChristopheMorinPerso
Copy link
Author

JeanChristopheMorinPerso commented Mar 6, 2023

@ttrently You could achieve what you want by defining multiple repositories like this:

[
    {
        'url': 'mongodb://localhost:1234/definitions',
        'store': {
            'url': 's3://prefix/repo1',
            'credentials_provider': {
                'type': 'hashicorp_vault'
            }
        },
        'credentials_provider': {
            'type': 'env_vars'
        },
    },
    {
        'url': 'mongodb://localhost:1234/definitions',
        'store': {
            'url': 's3://prefix/repo2',
            'credentials_provider': {
                'type': 'hashicorp_vault'
            }
        },
        'credentials_provider': {
            'type': 'env_vars'
        },
    }
]

So two repositories that use the same DB but different stores.

(Edited because I mixed up store and repo)

@ttrently
Copy link

ttrently commented Mar 6, 2023

Ah! Yeah that is exactly the functionality.

@JeanChristopheMorinPerso
Copy link
Author

Note to myself, maybe storage instead of store.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment