anthony-c-martin/spec.md

## spec.md

      
    Raw
  

              spec.md
            
          
    Stacks + Bicep Extensibility

Problem Statement

Deployment Stacks requires the ability to manage the full lifecycle (including deletion) of all resources defined inside a Stack. Currently this is only possible for Azure resources, not extensible ones, due to lack of an id field, and an authentication mechanism for an extensible control plane.
Goals


Provide an id equivalent which Stacks can use to uniquely identify an extensible resource for change tracking.
Provide a mechanism to allow Stacks to submit Delete & Get operations against extensible resources.

This spec will be broken down into two parts to cover these separate goals.
Part 1: Authentication

JSON Mechanics

Each provider can choose to expose a config property named auth. This property will have special handling in the Deployment Engine, which will understand that it contains the instructions to fetch a secret, rather than the secret itself (similar to the handling of KeyVault secret reference).
The auth property is defined as an object, with a required type field. The type field is be used as a discriminator to provide validation for the various authentication mechanisms available.
The following auth types will be available in template deployments:

UserProvided:
{
  "type": "UserProvided",
  "value": "<raw secret value>"
}

KeyVault:
{
  "type": "KeyVaultSecret",
  "keyVaultId": "<key vault resource id>",
  "secretName": "<name of the secret to fetch>"
}


NOTE: We could consider extending the built-in auth mechanisms to simplify certain scenarios - e.g. for Kubernetes:
auth: {
  type: 'AksResource'
  id: aks.id
}
This is out of scope for this spec.


NOTE: auth should be used for control planes which do not support AAD OBO tokens. Once the platform has the capability to support OBO, we should ensure we first-class this experience.


NOTE: We are working on improving the ergonomics of the import statement, such that config values can be passed in by the parent module. When we do this work, we should aim to align on a syntax similar to kv.getSecret() for setting keyvault-provided credentials.

Bicep Syntax

param kv resource 'Microsoft.KeyVault/vaults@2022-07-01'

import kubernetes as k8s {
  auth: {
    type: 'KeyVaultSecret'
    keyVaultId: kv.id
    secretName: 'myKubeConfig'
  }
  namespace: 'default'
}

NOTE: We may want to provide syntactic sugar to simplify authoring; conceivably this could look something like:
import kubernetes as k8s {
  auth: kv.getSecret('myKubeConfig')
  namespace: 'default'
}
This is out of scope for this spec.

Interop with Stacks

If an extensible resource is configured using auth, the body of the resource in the outputResources section of the deployment must contain the evaluated auth property. For UserProvided auth type, the value property must not be present.

UserProvided example:
"outputResources": [
  {
    "id": "...",
    "auth": {
      "type": "UserProvided"
    }
    // other properties omitted
  }
]

KeyVaultSecret example:
"outputResources": [
  {
    "id": "...",
    "auth": {
      "type": "KeyVaultSecret",
      "keyVaultId": "/subscriptions/...",
      "secretName": "myKubeConfig"
    }
    // other properties omitted
  }
]


NOTE: Stacks will not support the UserProvided option, because it has no capability to store credentials securely, and has no capability for interactive authentication when performing cleanup. It is however trivial to insert credentials into a KV during a deployment, and thus use the KeyVaultSecret mode.

Part 2: IDs

Extensible resources do not contain a well-known name or id field, and instead can consist of one or more identifying fields with different keys (e.g. for Kubernetes, metadata.namespace & metadata.name). Stacks however needs an identifier (or set of identifiers) which it can use to accurately track the lifecyle of a resource.
This identifier must be sufficiently unique such that it cannot be confused with other resources in the same deployment, but it must only be composed of properties that identify the resource. There are also cases where there is no direct property in the resource body which can be mapped to the identifier.
For example, a kubernetes resource must contain the namespace & the name of the resource. However, if multiple clusters are being deployed to with the same deployment, namespace + name may not be sufficiently unique - therefore it is necessary to also inject the cluster name into the id.
Authoring Mechanics

The Bicep type provider must indicate which fields in a resource body compose the identifier, as we must verify they are always set in a resource declaration, and are the only properties set in an existing resource declaration. This capability exists in Bicep today, but should be instead moved into types.json so that the type provider is able to make this decision when authoring types.
Overall, the Bicep authoring experience will be unchanged.
Extensibility Contract - Changes

New GetId endpoint

Each extensibility provider must implement a mandatory /GetId endpoint for obtainin g the predicted id field, given a resource body. This will be used at the start of a deployment operation, to obtain the id for logging. The POST body will be of the same format as the /Save & /PreviewSave endpoints. The response body will be of similar format to other API responses, but just containing the id property:
{
  "resource": {
    "id": "cluster/ant-test-cluster/metadata.namespace/default/metadata.name/foo",
    "type": "apps/Deployment",
    "apiVersion": "v1"
  }
}
Returning of Ids

Since the extensibility provider will be responsible for defining the format of the id field, the extensibility response contract will be updated to include a mandatory id string for all APIs:
{
  "resource": {
    "id": "cluster/ant-test-cluster/metadata.namespace/default/metadata.name/foo",
    "type": "apps/Deployment",
    "apiVersion": "v1",
    "properties": {
      ...
    }
  }
}
Get & Delete changes

Stacks will need to be able to issue a /Get or a /Delete purely using the id, so these methods will be modified to require fetching or deleting a resource by id:
{
  "import": {...},
  "resource": {
    "id": "cluster/ant-test-cluster/metadata.namespace/default/metadata.name/foo",
    "type": "apps/Deployment",
    "apiVersion": "v1"
  }
}

If the Deployments Engine needs to fetch an existing resource, it will need to first perform a /GetId followed by a /Get, using this id.

Deployment Engine - Changes

The Deployment Engine will execute a /GetId at the start of a resource deployment, to obtain the id field. This will be used to save deployment operation results, and for logging.
Interop with Stacks

To add to the "outputResources" body described in Part 1, the Deployment Engine now needs to ensure it provides the full import configuration, along with a list of identifiers for the resource.
"outputResources": [
  {
    "import": {
      "provider": "Kubernetes",
      "version": "0.1",
      "config": {
        "cluster": "myCluster",
        "auth": {
          "type": "KeyVaultSecret",
          "keyVaultId": "/subscriptions/...",
          "secretName": "myKubeConfig"
        }
      }
    },
    "type": "apps/Deployment",
    "apiVersion": "v1",
    "id": "cluster/ant-test-cluster/metadata.namespace/default/metadata.name/foo"
  }
]
Other Changes

In the samples in this document, I have also proposed we split up the "type" field in the Deployments representation as well as the extensibility contract (which currently contains the type AND the apiVersion). This makes it easier to communicate to Stacks which properties should & shouldn't contribute to the uniqueness of a resource. I propose we make this change at the same time as we introduce the id field.

DISCUSSION TOPIC: Is there any reason to make id a string? We could instead define it as a set of keys & values:
{ "cluster": "ant-test-cluster", "metadata.namespace": "default", "metadata.name": "foo" }
For the purpose of logging & deployment operations, we can come up with a consistent mechanism for converting it to a string.

Part 3: Resource Deletion

BulkDelete API

The Deployments service will provide a new ARM API which Stacks can invoke to clean up any extraneous resources (on a Stacks PUT removing resources, or on a Stacks deletion). This will accept an array of resources, in the format described above for the outputResources property.
This API will batch up and send ARM resources to be deleted by ARM's /bulkDelete API. Extensible resources will be handled internally by the Deployment Engine code, using a similar algorithm to the one powering ARM's bulk delete. The format described above will be sufficient to generate the Delete request body which the Deployment Engine needs to submit to the Extensibility Host to perform each resource cleanup.
Similar to the behavior on a template PUT, the Deployment Engine will need the capability to resolve auth credentials in order to submit the Delete request to the Extensibility Host.
For example:
{
  "import": {
    "provider": "Kubernetes",
    "version": "0.1",
    "config": {
      "cluster": "myCluster",
      "auth": {
        "type": "KeyVaultSecret",
        "keyVaultId": "/subscriptions/...",
        "secretName": "myKubeConfig"
      }
    }
  },
  "type": "apps/Deployment",
  "apiVersion": "v1",
  "id": "cluster/ant-test-cluster/metadata.namespace/default/metadata.name/foo"
}
Stacks RP changes

The changes required for Stacks to support the end-to-end:

Persist the format used for extensible resources in the outputResources body, and supply it on a /bulkDelete request.
Use the new Deployments /bulkDelete API for cleanup.
Understand which properties of the output resource definition constitute a unique identifier for a resource, for change tracking.

Design Notes/Assumptions


This spec doesn't aim to solve the repetition associated with import statements in Bicep; there are separate proposals for this. I am however making the assumption that this will be a solved problem in the future.
Some of the samples use proposed syntax (e.g. Resources as parameters) for simplicity, but do not require this syntax to exist.
We would like to retain the capability to pass raw credentials for local-mode (non-Azure) evaluation, but do not need to support this mode with Stacks on Azure.
The current Kubernetes provider has already implicitly introduced the concept of defaulting properties in the import configuration block. I have removed this capability to simplify the generation of resource IDs, with the understanding that it'll be brought back with a more generic proposal - to avoid the repetition of having to specify the Kubernetes namespace multiple times in a file.

Challenges/Considerations

Immutability

Any id generated by Bicep must uniquely identify a resource, and be immutable for the lifecycle of that resource. This means that we need to be careful that it is composed of all the unique identifying characteristics of a resource, in a deterministic order. It will also be important to ensure that the id identifies the same resource across different versions of a provider.
Example

In the context of Kubernetes, we will want to include the type, namespace, name & identifying cluster information in the id.
Auth

Extensible resources generally require an authentication context to communicate with an external control plane. Deployment Stacks will need to be able to access this same authentication context in order to submit a Delete request.
Dependency on 'runtime' values

In certain scenarios, the authentication context will not be known at the start of the deployment - for example using an Azure list* method to access the kubeConfig for an AKS cluster. It's not clear how Deployment Stacks would be able to capture and utilize this operation to obtain the auth needed to clean up a deleted resource.
Locking

Deployment Stacks supports locking of ARM resources to prevent external modification or deletion. This isn't something that can be generically extended to other control planes. We could consider adding this to the extensibility contract in future if we have a compelling reason to do so (e.g. a particular control plane supports it).
Layering with the Deployment Stacks RP

The current Deployment Stacks design uses the Deployments RP in order to orchestrate a deployment, but uses the ARM bulk delete API to perform cleanup. Deployments extensibility is purely built into the Deployments RP, and doesn't (and conceptually shouldn't) involve ARM. This may necessitate the creation of an API on the Deployments RP to perform cleanup.
Uniqueness

There will be resources which cannot have a globally unique identifier - for example a private Kubernetes cluster with a non-public DNS name. We should design the feature in a way that mixing up two resources is difficult, but it will not be possible to define a globally-unique id.