Skip to content

Instantly share code, notes, and snippets.

@ytimocin
Last active November 22, 2022 20:32
Show Gist options
  • Save ytimocin/fb16dc08f3e7352ef7f0740b2708d58c to your computer and use it in GitHub Desktop.
Save ytimocin/fb16dc08f3e7352ef7f0740b2708d58c to your computer and use it in GitHub Desktop.
[Observability] Core and Custom Metrics in Radius

[Observability] Radius Metrics

Author: Yetkin Timocin (@ytimocin) Last Updated: 11/15/2022 Status: [Work in Progress]

This document proposes a design for the set of metrics in Radius.

Introduction

This proposal covers the addition of OpenTelemetry metrics for Radius. This proposal is not for instrumenting user applications. The main goal is the supportability and instrumentation of Radius using industry standard patterns. That means we are going to be using existing widely-used solutions for the instrumentation of Radius.

Goals

  • Instrumentation of Radius with OpenTelemetry.
  • Metrics for AppCoreRP, UCP, Deployment Engine.
  • A working example for Radius customers using Prometheus and Grafana.

Non-Goals

  • Traces and logs.
  • CLI Metrics.

Tools

As mentioned above, we will use vendor-neutral, open-source Observability framework OpenTelemetry. For the working example, we will be using Prometheus and Grafana.

  • Client SDK: OpenTelemetry Metrics SDK
  • Telemetry Backend: Prometheus
  • Analytics/Dashboard: Grafana

OpenTelemetry

"OpenTelemetry is a collection of tools, APIs, and SDKs. Use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software's performance and behavior." (https://opentelemetry.io/)

Why OpenTelemetry?

  1. Open Standards and Data Portability: Its format is vendor neutral which means that you can change the backend that consumes your telemetry data without changing the way you collect this data.
  2. Contributors and Adopters: OpenTelemetry SDKs provide great automatic instrumentation because of their community of contributors and vendors. You can see a full list of adopters here: https://github.com/open-telemetry/community/blob/main/ADOPTERS.md.
  3. CNCF: OpenTelemetry is one of the top 2 active open-source projects. This item goes along with the second item on this list.

Alternatives

There aren't really a lot of alternatives to OpenTelemetry that are compelling in this space.

High-Level Design

image

Radius Metrics

We will be gathering metrics from Radius Services and also Clients that we are using within those services.

Radius Services are as follows:

  1. Core RP
  2. Link RP
  3. UCP
  4. Deployment Engine

Helper Services:

  1. Worker Server (for Async Operations)

Clients that we use are as follows:

  1. Azure Client
  2. AWS Client
  3. Kubernetes Client
  4. Data Store Client
  5. Queue Client
  6. Secret Client

Proposed Radius Metrics

System Metrics

  • CPU Metrics
  • Memory Metrics
  • Network Metrics

Goroutine Metrics

DotNet Runtime Metrics

Common HTTP Server Metrics

  • Request Duration (Histogram)
  • Request Size (Histogram)
  • Response Size (Histogram)
  • Active Requests (UpDownCounter)
  • Response Grouped By Status Code (500s, 400s, 200s...)
  • Response Grouped By Resource and Action (ex: PUT Container)

Common Client Metrics

  • Request Duration (Histogram)
  • Request Size (Histogram)
  • Response Size (Histogram)
  • Response Grouped By Requested Action

Core RP and Link RP Metrics

  • Common HTTP Server Metrics
  • Additional Metrics

UCP Metrics

  • Common HTTP Server Metrics
  • Requests Grouped By Plane
  • Additional Metrics

Deployment Engine Metrics

  • Common HTTP Server Metrics
  • Provisioning Time
  • Response Grouped By Provisioning Status (Failed, Completed...)

Azure Client

  • Common Client Metrics
  • Additional Metrics

AWS Client

  • Common Client Metrics
  • Additional Metrics

Kubernetes Client

  • Common Client Metrics
  • Additional Metrics

Data Store Client

  • Common Client Metrics
  • Additional Metrics

Queue Client

  • Common Client Metrics
  • Group By Operation Result and Resource
  • Message Count

Worker Server

  • Number of Workers per Service (Core RP, Link RP)
  • Status of each Worker
  • Average number of Messages processed per Worker
  • Average time it takes to process a Message
  • Number of Successful and Failed processings
  • Number of Extended Messages
  • Messages grouped by Resource Type and Action (ex: Container PUT)
  • Number of Duplicated Messages
  • Number of Timed out Operations

Phases of Work

The main goal is to get the foundation up and running in phase 1. After building the foundation, the work can be parallelized between scrum teams based on the expertise of each team.

  • Phase 1 (Laying the foundation)
    • Metrics:
      • System Metrics
      • Goroutine Metrics
  • Phase 2 (Laying the foundation for the HTTP Servers)
    • Prerequisites:
      • Phase 1
    • Metrics:
      • Common HTTP Server Metrics
    • Services:
      • UCP
      • Core RP
      • Link RP
  • Phase 3 (Laying the foundation for the Clients)
    • Prerequisites:
      • Phase 1
    • Metrics:
      • Common Client Metrics
    • Clients:
      • Azure Client
      • AWS Client
  • Phase 4 (Clients cont'd)
    • Prerequisites:
      • Phase 3
    • Clients:
      • Data Store Client
      • Kubernetes Client
      • Queue Client
      • Worker Server
  • Phase 5 (Working on the Deployment Engine side)
    • Prerequisites:
      • Phase 1
    • Metrics:
      • Dotnet Metrics
    • Service:
      • Deployment Engine
  • Phase 6 (Creating the Example)
    • Prerequisites:
      • Almost all phases if we want to showcase all the services and clients

Risks

This is a low-risk project. It will not have serious performance effects on the system.

Future Work

  • CLI Metrics: The reason I think that CLI Metrics might be useful is mainly that a team may like to know which commands are being used the most.
  • More Examples: We can provide more examples with other 3rd party tools that are provided by Azure and AWS.

Open Questions

Could we collect metrics from end-to-end resource creation to get the end-to-end completion rate?

References

  1. https://margara.faculty.polimi.it/papers/2020_debs_kaiju.pdf
  2. https://github.com/kubernetes/design-proposals-archive/blob/main/instrumentation/monitoring_architecture.md
  3. https://github.com/dapr/dapr/blob/master/docs/development/dapr-metrics.md
  4. https://opentelemetry.io/docs/
  5. https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification/resource/semantic_conventions
  6. https://medium.com/jaegertracing/jaeger-embraces-opentelemetry-collector-90a545cbc24
  7. https://opentelemetry.io/docs/reference/specification/metrics/semantic_conventions/http-metrics/

AB#5058

@lakshmimsft
Copy link

lakshmimsft commented Nov 18, 2022

I asked Yetkin questions on how we're approaching data sizing, defaults and configuration possible on data retention policies and long term thoughts on helping radius clients wanting to store their data long term etc.
Found some initial documentation:
https://prometheus.io/docs/prometheus/latest/storage/
https://prometheus.io/docs/prometheus/1.8/storage/
We can possibly add some info to this proposal.

@youngbupark
Copy link

youngbupark commented Nov 18, 2022

I asked Yetkin questions on how we're approaching data sizing, defaults and configuration possible on data retention policies and long term thoughts on helping radius clients wanting to store their data long term etc. Found some initial documentation: https://prometheus.io/docs/prometheus/latest/storage/ https://prometheus.io/docs/prometheus/1.8/storage/ We can possibly add some info to this proposal.

Data retention is related to how to set up metric collector and observability platform such as prometheus, ELK, etc. This is out of scope in this feature. Each user has their own observability platform, such as datadog, azure monitor, stackdriver, cloud watch etc. Data retention configuration is their own preference based on their org policy. All these observability platform have the ability to collect prometheus metrics from each service without installing additional prometheus because prometheus pull mode endpoint is now de-facto way in cloud native environment. We want to use prometheus metric protocol, but not enforce the user to use only prometheus collector and its pipeline.

In other words, even though we can provide the sample prometheus set up tutorial, it is just an example to show how to collect radius metrics. Whole infra setup is up to radius users.

@snehabandla
Copy link

Would it be relevant to break some of these down by resource type? It might help some in understanding which resource is more commonly used and also where we need to improve on response time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment