JeanChristopheMorinPerso/README.md

## README.md

      
    Raw
  

              README.md
            
          
    REP-005: Remote package repositories

Things to cover:

This is laying the ground and doesn't intend to cover everything possible. We expect that future REPs will extend this REP.
Packing/unpacking (zip, tar, etc) with plugin type
Package definition database
Package payload storage
What about the memcached cache?
TODO: default_cachable_per_repository and default_relocatable_per_repository. How do they play with remote repos?

Abstract

This REP describes the concept of remote package repositories. As part of this REP, a new plugin type is introduced to split the responsibility of storage package definition and package payloads. Another introduction will be the ability to compress package payloads at release time with the ability to also automatically uncompress them at rez-env time. Additionally, some pre-existing concepts will be clarified.
It tries to stay at a relatively high level to avoid dictating what the implementation should look like.
Rationale

Since its inception, Rez has relied on files being available on an accessible filesystem (shared or local) as the default mechanism to store and access both package definitions and package payloads. This is implemented using a plugin system where two plugins are currently available: filesystem (the default) and memory.
The memory plugin is used for development purposes only.
The filesystem plugin made sense, and still makes sense today for a lot of cases, in a world where there were no cloud or container workloads and where artists were working in the office, connected directly to a studio's internal network and shared filesystems. It's also starter friendly because it doesn't require any infrastructure to start evaluating rez and bootstrap rez in a studio. For a lot of studios, this will be enough and will work perfectly for them.
But there are studios for which filesystem repositories don't work well for them or are simply not enough. Some studios support artists working remotely or use cloud-backed workstations. Some studios also use cloud-hosted machines in their render farm. Others have a multitude of micro-services that use containers. When it comes to these use cases, accessing a common shared filesystem can be difficult or even unwanted or impossible due to multiple factors like latency, security, costs, complexity, etc.
Some studios would prefer to let their filesystems be used for actual production data and load and not have rez take space and resources of their precious filesystem resources.
Package payload caching was introduced to help fill some gaps with the filesystem storage. It copies package payloads locally to avoid running the code from a network filesystem. One excellent use case for this is DCCs, which are renowned for being unstable and slow when run from a network filesystem. It's a great solution in itself and is already used by multiple of our users. But it still requires access to a shared filesystem.
When it comes to rendering in the cloud, cloud workstations, or even satellite studios, studios have to come up with custom workflows to sync packages from their main studio to other locations. This can sometimes be complex and could be simplified if packages and their payloads were stored somewhere else than on a filesystem.
The solution to these problems is to extend Rez so that it can support new ways of storing package definitions and payloads.
Terminology


Package Repository: A repository of package definitions. This can be a simple filesystem, a database, etc.
Artifacts Store (new): A storage for package artifacts. This can be as simple as a shared filesystem, or more complex like a cloud-backed object storage solution (AWS S3, Azure Blob, GCP GCS, MinIO, etc), artifacts repositories (Nexus, Artifactory, etc), etc.
Bundler (new): A plugin that takes care of preparing a payload to be stored in a package store and also takes care of unbundling it. This could be a zip/unzip, tar/untar, etc.
Credentials Provider (new): A new plugin type that will allow the different plugins to load credentials based on their needs.
Artifacts Cache (rename): This is what we currently call "package caching". I'm proposing to use a new name to help differentiate everything clearly.

Considerations


Speed of downloading/uploading, packing/unpacking.
Interactivity and useability.

Requirements


Must be extensible by using plugins.
Plugins should live outside of Rez's source code to encourage faster iterations and discourage tight coupling to Rez's internals.
Packages will need to be relocatable. This should be clearly stated in the documentation.
rez-cp should be adapted or a new tool should be created to easily migrate from the filesystem-based repository to a remote repository.
Existing documentation must be updated.
Complete documentation covering use cases, tutorials, etc.
rez-build should continue to use a local filesystem repository by default.

MVP


At least one implementation of a remote package repository and an artifact store. But 2 implementations would be desirable to confirm that the design works for more than one implementation.

Proposed design

We propose a new scheme where package definitions and payloads are completely independent of each other. They can be stored in different places. Some examples that will become possible:

Store package definitions in a DB-backed package repository and store package payloads in S3. It could be PostgreSQL, MongoDB, etc.
Store both package definitions and payloads on a filesystem, but store the payloads as zip files.

Configuration

Rez currently supports setting package repositories by using the packages_path setting. By default, each path is considered a "filesystem" type path. It is possible to specify the type of repository by using the <type>@<location> syntax where type is the plugin type and location is whatever the plugin accepts.
We could continue to use it, but the fact that it contains path in its name would be confusing. Also, fitting a lot of information in a single string can become complex rapidly.
A new setting called package_repositories is introduced that accepts a list of dictionaries. Each dictionary corresponds to a package repository. It can also specify an artifact store to use. The setting name is clear and concise.
package_repositories = [
    # Example 1: Both definitions and payloads are stored on a filesystem at the same location.
    {
        'url': 'filesystem:///studio/path/to/repo',
    },
    # Example 3: Both payloads and definitions are stored on the filesystem but at different locations.
    {
        'url': 'filesystem:///studio/path/to/repo',
        'store': {
            'url': 'filesystem:///studio/path/to/payloads'
        }
    }
    # Example 2: Definitions stored in MongoDB and payloads in S3.
    {
        'url': 'filesystem:///studio/path/to/repo',
        'store': {
            'url': 'mongodb://localhost:1234/definitions',
            'credentials_provider': {
                'type': 'hashicorp_vault'
            }
        },
        'credentials_provider': {
            'type': 'env_vars'
        },
    }
]
Question: How will we support rez-{cp,depends,env,plugins,search} --paths, rez-{build,mv} --path, rez-pip --prefix, rez-{pkg-ignore,rm} PATH, rez-bind --install-path, etc? We should catalog all flags and the different ways that repositories are set right now.
Package repository

This stores the package definitions.
Responsibilities:

store all information about packages (name, version, variants, requirements, etc).
It does not need to know where a package payload is stored. The idea is that the artifact store can be given a package definition or variant hash, which the store can use to organize/store artifacts.
Details are a little bit blurry to me. Needs more work.

Artifact store

The artifact store plugin takes package definitions from a package repository to determine the location in the store. It will then do whatever is needed to make the payload accessible to a rez environment.
For example, an artifact store might use AWS S3, in which case the artifact plugin could download and upload the artifacts using the AWS CLI or the AWS SDK.

Given a package definition, the store plugin can determine the location of the artifact. This is left vague because it might be storage dependent. For example, one could use variant hashes, or use the good old variant filesystem structure.
Has the responsibility of copying payloads to/from the store.
Can implement its own internal cache to avoid useless re-downloads, similar to pip, rpm, etc. That internal cache would be different from the artifacts cache. If possible, that cache should be shared between users. This is essential in a render farm where local disk space might be limited. The location of that cache should also be configurable. It is unknown right now if part of should be provided by rez to make it consistent across plugins or not.

Bundler

The responsibility of a bundler is to take a folder containing the content of a package's payload and bundle it. An implementation could zip all the files in the directory for example. It also takes care of extracting the bundle.
Using plugins will allow for reusability and will also allow for flexibility. For example, someone might want to optimize for size while others might want to optimize for speed. Also, a studio might prefer zip files over tarballs because zip files are a little bit more portable across platforms.
Some artifact stores will mandate a specific bundle type, but others won't. One example is S3, where the user will be free to choose whichever bundle format works for them.
We need to consider that a bundler can also be a no-op. We don't know why someone would prefer to upload individual files one by one to a cloud object storage for example, but we should design for it if it doesn't make the implementation more complex.
Question: Where do we plug this? Is it the artifact store that calls the right bundler plugin?
Credentials provider

A credential provider is a plugin type that will initially be used by both package repository and artifact store plugins. The job of a provider is to provide the right set of credentials when needed. The key point is to allow users to decide how they want to fetch and provide credentials to plugins and rez. This will allow flexible workflows that won't require forking rez.
We could provide a built-in generic environment variable based plugin that would work for a lot of use cases.
Using a plugin system will help with securing other areas of rez, like the AMQP context tracking events for example. Right now it relies on credentials in config files. With a plugin system, it would be possible to provide credentials using a multitude of methods.
A plugin that requires credentials will provide information on the format of credentials it accepts. More specific providers could be implemented. For example, an AWS provider could be implemented, which plugins could use. But it's important to note that it's the plugin requiring credentials that defines how the credentials should look. TODO: This requires some more thought.
Open questions


What's the relationship with package payload caching? Can both live together? If so, how?
From Thorsten:

One of the beauties of the fully self-contained packages in a folder approach is the simplicity of building, distributing and deploying. With a DB we need to a) make sure things end up in the db, we need to make sure db and filesystem stay in sync and also clearly define how multi repositories work. E.g. we may want multiple DBs for latency and also multiple payload repos. What happens if DBs contradict? What happens if a payload is in multiple places? And what if not? I think that not everything that currently "just works incidentally" will translate easily into a db vs. payload store setup.
I also think that it might be beneficial to make both file transfer and packing parts pluggable. But that's really just "think" right now and it might moving towards over-engineering. But there are fairly big differences in speed even for local copies. VFX here has a system to pull zip payloads too. To my knowledge they did not fork rez though, they built it into packages (basically in pre_commands checking if the payload is where it should be, and if not pull the zip). I think they also did benchmarking and it turned out that directly unzipping from the fileshare was way faster than copy -> extract.


How can we make sure that the downloaded artifacts match what's expected? For example say we have one DB and multiple stores and a variant artifact is present in all stores. Should we store checksums and if so where? SHould both the package repo and the artifact store contain the checksums?
Should we design with signing in mind? For example should we make it possible to cryptographically sign artifacts? We might want to look at https://github.com/ossf/wg-securing-software-repos.
How does this potentially play with future rez receipes or community repositories?

Prior art


Modularity overview
MongoDB implementation. It stores payloads on the filesystem.
S3 + MongoDB
Proposal for S3 + MongoDB

Known use cases


This could be used as a package sync system between studios. Or at least part of it. For example, a studio could use S3 to store and access their packages. S3 is accessible from everywhere. On the first resolve, a package could be installed into a shared filesystem. S3 transfers are free between regions, but it could be cheaper than a full-grown file transfer solution. Package payloads would be on the "edge", maybe fronted by a CDN (CloudFront in the case of AWS), etc.

Future additions


Composable repositories with fallbacks: https://academysoftwarefdn.slack.com/archives/C0321B828FM/p1677280013597799?thread_ts=1677245578.118999&cid=C0321B828FM. For example you may have an artifact repo that downloads package payloads from S3, and copies them into shared posix storage, where they're consumed by multiple users at your studio. A different artifact repo might unzip files from shared storage, into directories directly on users' local disk. It could even make sense to chain artifact repos together - perhaps your S3 repo downloads zipped artifacts to shared storage, then a "localization" repo unzips them onto the local disk on demand.