Skip to content

Instantly share code, notes, and snippets.

@rynowak
Last active September 6, 2023 19:00
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rynowak/c787f92ce072b6d515f6b7881cb47b5f to your computer and use it in GitHub Desktop.
Save rynowak/c787f92ce072b6d515f6b7881cb47b5f to your computer and use it in GitHub Desktop.
Notes on the implementation of tracked resources

Tracked Resources Notes

Synchronous Flow

Today, it looks something like this:

---
title: Sychronous resource flow
---

sequenceDiagram
    Client->>UCP: PUT /planes/.../providers/Applications.Core/applications/myapp
    UCP->>Applications RP: Proxy PUT /planes/.../providers/Applications.Core/applications/myapp
    Applications RP->>UCP: Respond HTTP 200 OK
    UCP->>Client: Respond HTTP 200 OK

We can add tracked behavior by updating our stored state in UCP when the RP responds.

---
title: Sychronous tracked resource flow
---

sequenceDiagram
    Client->>UCP: PUT /planes/.../providers/Applications.Core/applications/myapp
    UCP->>Applications RP: Proxy PUT /planes/.../providers/Applications.Core/applications/myapp
    Applications RP->>UCP: Respond HTTP 200 OK
    UCP-->>UCP: Update stored record /planes/.../providers/Applications.Core/applications/myapp
    UCP->>Client: Respond HTTP 200 OK

Asychronous Flow

For an async operation the flow includes getting an initial async responsed (HTTP 201 or 202) with an operation URL. The operation is used for monitoring, and is polled by the client.

---
title: Asynchronous resource flow
---

sequenceDiagram
    Client->>UCP: PUT /planes/.../providers/Applications.Core/containers/mycontainer
    UCP->>Applications RP: Proxy PUT /planes/.../providers/Applications.Core/containers/mycontainer
    activate Applications RP
    Applications RP->>UCP: Respond HTTP 202 Accepted - Operation 9bcbe4d8-c103-483b-b2a2-fd5ca0c99a48
    UCP->>Client: Respond HTTP 202 Accepted
    activate Client
    Client-->Client: Monitor operation 9bcbe4d8-c103-483b-b2a2-fd5ca0c99a48
    Applications RP-->>Applications RP: Complete operation 9bcbe4d8-c103-483b-b2a2-fd5ca0c99a48 processing
    deactivate Applications RP
    Client->>UCP: GET operation 9bcbe4d8-c103-483b-b2a2-fd5ca0c99a48
    UCP->>Applications RP: Proxy GET operation 9bcbe4d8-c103-483b-b2a2-fd5ca0c99a48
    Applications RP->>UCP: Respond HTTP 200 OK operation 9bcbe4d8-c103-483b-b2a2-fd5ca0c99a48 complete
    UCP->>Client: Respond HTTP 200 OK operation 9bcbe4d8-c103-483b-b2a2-fd5ca0c99a48 complete
    deactivate Client

To add tracking to this flow, UCP also needs to monitor the completion of the operation.

Note: In this diagram monitoring by the client is omitted for simplicity. It may still occur, but does not influence the behavior of UCP. If the client disconnectors or shuts down, UCP will still update its state and does not rely on the client remaining active.

---
title: Asynchronous tracked resource flow
---

sequenceDiagram
    Client->>UCP: PUT /planes/.../providers/Applications.Core/containers/mycontainer
    UCP->>Applications RP: Proxy PUT /planes/.../providers/Applications.Core/containers/mycontainer
    activate Applications RP
    Applications RP->>UCP: Respond HTTP 202 Accepted - Operation 9bcbe4d8-c103-483b-b2a2-fd5ca0c99a48
    activate UCP
    UCP-->UCP: Monitor operation 9bcbe4d8-c103-483b-b2a2-fd5ca0c99a48
    UCP->>Client: Respond HTTP 202 Accepted
    Applications RP-->>Applications RP: Complete operation 9bcbe4d8-c103-483b-b2a2-fd5ca0c99a48 processing
    deactivate Applications RP
    UCP->>Applications RP: GET operation 9bcbe4d8-c103-483b-b2a2-fd5ca0c99a48
    Applications RP->>UCP: Respond HTTP 200 OK operation 9bcbe4d8-c103-483b-b2a2-fd5ca0c99a48 complete
    deactivate UCP
    UCP-->UCP: Update stored record /planes/.../providers/Applications.Core/containers/containers

Data races

Seems simple right? This approach makes some heavy assumptions about ordering that are not true in practice:

  • Assuming that only one async operation is in progress at a time.
  • Assuming that the order of operations observed by UCP will be the order of operations processed by the RP.
  • Assuming that there are no data races between UCP observing the results of an operation and beginning another.

Consider the following cases:

  • Two PUT operations arrive to UCP at the same time.
    • Depending on ordering the resource could end in state A or state B.
    • The goal is for UCP and the RP to agree on either A or B.
    • If the RP does not support concurrent execution then one of these operations will be rejected.
    • If the RP supports concurrent execution then the completion order is determined by the RP (UCP can make no assumptions).
      • If UCP attepts to enforce ordering using a monotonic counter (like generation in Kubernetes) then it may observe a different order than the RP.
      • The RP contract does not include a monitonic counter so the RP has no way to communicate its internal ordering.
  • A PUT and a DELETE operation arrive to UCP at the same time.
    • Depending on ordering the resouce either exists or does not (Shroedinger's Resource).
    • The goal is for UCP and the RP to agree on whether the resource exists.
    • All of the same problems exist.

This does not even require asynchrony to cause issues. The same kinds of data-races exist for synchronous resources operations.

Solutions

What if UCP buffered resource operations and acted as flow-control?

  • UCP could enqueue resource operations.
    • This would require strong ordering inside UCP.
    • This would require UCP to generate operation IDs on behalf of the RP.

What if UCP relied on optimistic concurrency control in it's database?

  • UCP monitors all resource operations and captures the ETag of its stored data at the beginning of the operation. When an operation completes UCP does an OCC update of its stored data. If the ETag does not match, then UCP will requery the RP knowing that the current state of the resource may not match the result of the operation that just completed.
    • This ensures that all operations are "completed" by UCP (no missed updates).
    • We only observe/store/notify the current state.

Tracked Resources

  • Status: Pending
  • Author: Ryan Nowak (@rynowak)

Overview

UCP provides rich resource audit and lifecycle management features for all kinds of resources. In particular we're inspired by the capabilities of ARM (Azure Resource Manager) that are both user-facing and internal. As an open-source project both of these categories are vital - user-facing features enable users, and internal features enable extensibility. As motivating examples consider 1) resource groups as a user-facing feature, 2) notifications for lifecycle events of resources as an internal feature that enables extensibility. Users can use resource groups to organize, audit and bulk-delete resources with a related lifecycle. Radius can use internal notifications to implement and maintain the application graph - a complex data-structure that spans the whole system.

These capabilies of ARM are based on a centralized feature called tracked resources. The ability for the centralized parts of the control-plane (UCP in our case) to understand and track the state and lifecycle of each resource as it is created, modified, and deleted. This functionality is centralized because it simplifies the contract and dependencies on each resource provider that contributes resources to the overall system.

This document details the design of tracked resources as generic functionality upon which other stateful features can be built. The initial proposal also describes a new feature and scenario: listing all resources in a resource group, regardless of their resource type. This is the most basic feature that can be layered on top of tracked resources, and serves as verification that the system behaves as expected and works as a foundation for us to build additional features.

Terms and definitions

An understanding of the following concepts pages and their associated terminology is required for this document:

Objectives

Issue Reference: radius-project/radius#4844

This issue is the best fit, but it is lacking in detail. Most of our issues track user experiences or scenarios, and so many of them related to this work.

Goals

  • Enrich UCP with knowledge of the state and lifecycle operations of each resource
  • Follow the ARM precedent for the design of tracked resources and related features
    • We're far behind ARM in terms of our capabilities, and the road ahead of us to very smooth if we stick to it.
    • Let's not assume we know better or reinvent the wheel.
  • Set ourselves up for success with our future capabilities
    • Resource group list (in scope for initial work)
    • Resource group deletion (out of scope for initial work)
    • Notifications (out of scope for initial work)
    • Support for stapling of non-Radius resources into a resource group (non-idempotent AWS resources)
  • Build this right without creating technical debt for our future-selves.
    • Support decentralized databases.
    • Rely only on the ARM-RPC contract.
    • No assumptions about concurrency support for individual resources.

Non goals

  • Introduce more complexity than necessary to build tracked resources. These are features we will eventually do but are not required to build tracked resources.
    • Proxy resource support. (Resources that are top-level but NOT tracked resources)
    • Registration of resource types.
  • Build all possible tracked-resource scenarios at once
    • Resource group deletion is out of scope for initial work
    • Notifications are out of scope for initial work
    • Non-Radius resources appearing in a resource group (resource stapling)

User scenarios (optional)

Listing resources

Rod is trying out Radius and has deployed a few sample applications as part of the Radius tutorials. A few days later he wants to clean up his Kubernetes cluster, but before he does that he wants to understand what was deployed. It would be unwise to do a bulk-deletion when one is ignorant of what is being deleted ... and Rod is a very wise man.

To understand what resources will be deleted he runs rad resource list --all. This command lists all of the resources in all resource groups at the command line. Now that he understands, he can wisely continue.

.... Rod will return in part 2: Revenge of the resource group

Design

UCP is responsible for proxying resource lifecycle operations to the appropriate resource provider. Lifecyle operations are either synchronous or asychronous. During this proxy flow UCP can observe all of the mutating request traffic and update its internal state after the request (synchronous operation) or after the asynchronous operation completes (asychronous operation). Observing the results of asychronous operations requires that UCP schedule its own asynchronous jobs to monitor the resource. We require that all resource lifecycle traffic flow through UCP for RBAC and correctness reasons, and so we can assume that UCP has full visibility into the state.

We can think of this design as having the following parts:

  • Proxying operations to the resource provider (already implemented)
  • Classifying an operation and its result based on the HTTP response to the operation
    • Mutating or non-mutating
    • Success or failure
    • Synchronous or asynchronous
  • Waiting for operation completion (in the case of asynchronous operations)
  • Retrieving resource state
  • Updating UCP's store of resource state

Most critical is UCP's ability to classify the operation being performed as we cannot behave correctly without a correct understanding of this.

Design details

API design (if applicable)

Alternatives considered

Test plan

Security

Compatibility (optional)

Monitoring

Development plan

Open issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment