Skip to content

Instantly share code, notes, and snippets.

@ytimocin
Last active November 22, 2022 20:32
Show Gist options
  • Save ytimocin/fb16dc08f3e7352ef7f0740b2708d58c to your computer and use it in GitHub Desktop.
Save ytimocin/fb16dc08f3e7352ef7f0740b2708d58c to your computer and use it in GitHub Desktop.
[Observability] Core and Custom Metrics in Radius

[Observability] Radius Metrics

Author: Yetkin Timocin (@ytimocin) Last Updated: 11/15/2022 Status: [Work in Progress]

This document proposes a design for the set of metrics in Radius.

Introduction

This proposal covers the addition of OpenTelemetry metrics for Radius. This proposal is not for instrumenting user applications. The main goal is the supportability and instrumentation of Radius using industry standard patterns. That means we are going to be using existing widely-used solutions for the instrumentation of Radius.

Goals

  • Instrumentation of Radius with OpenTelemetry.
  • Metrics for AppCoreRP, UCP, Deployment Engine.
  • A working example for Radius customers using Prometheus and Grafana.

Non-Goals

  • Traces and logs.
  • CLI Metrics.

Tools

As mentioned above, we will use vendor-neutral, open-source Observability framework OpenTelemetry. For the working example, we will be using Prometheus and Grafana.

  • Client SDK: OpenTelemetry Metrics SDK
  • Telemetry Backend: Prometheus
  • Analytics/Dashboard: Grafana

OpenTelemetry

"OpenTelemetry is a collection of tools, APIs, and SDKs. Use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software's performance and behavior." (https://opentelemetry.io/)

Why OpenTelemetry?

  1. Open Standards and Data Portability: Its format is vendor neutral which means that you can change the backend that consumes your telemetry data without changing the way you collect this data.
  2. Contributors and Adopters: OpenTelemetry SDKs provide great automatic instrumentation because of their community of contributors and vendors. You can see a full list of adopters here: https://github.com/open-telemetry/community/blob/main/ADOPTERS.md.
  3. CNCF: OpenTelemetry is one of the top 2 active open-source projects. This item goes along with the second item on this list.

Alternatives

There aren't really a lot of alternatives to OpenTelemetry that are compelling in this space.

High-Level Design

image

Radius Metrics

We will be gathering metrics from Radius Services and also Clients that we are using within those services.

Radius Services are as follows:

  1. Core RP
  2. Link RP
  3. UCP
  4. Deployment Engine

Helper Services:

  1. Worker Server (for Async Operations)

Clients that we use are as follows:

  1. Azure Client
  2. AWS Client
  3. Kubernetes Client
  4. Data Store Client
  5. Queue Client
  6. Secret Client

Proposed Radius Metrics

System Metrics

  • CPU Metrics
  • Memory Metrics
  • Network Metrics

Goroutine Metrics

DotNet Runtime Metrics

Common HTTP Server Metrics

  • Request Duration (Histogram)
  • Request Size (Histogram)
  • Response Size (Histogram)
  • Active Requests (UpDownCounter)
  • Response Grouped By Status Code (500s, 400s, 200s...)
  • Response Grouped By Resource and Action (ex: PUT Container)

Common Client Metrics

  • Request Duration (Histogram)
  • Request Size (Histogram)
  • Response Size (Histogram)
  • Response Grouped By Requested Action

Core RP and Link RP Metrics

  • Common HTTP Server Metrics
  • Additional Metrics

UCP Metrics

  • Common HTTP Server Metrics
  • Requests Grouped By Plane
  • Additional Metrics

Deployment Engine Metrics

  • Common HTTP Server Metrics
  • Provisioning Time
  • Response Grouped By Provisioning Status (Failed, Completed...)

Azure Client

  • Common Client Metrics
  • Additional Metrics

AWS Client

  • Common Client Metrics
  • Additional Metrics

Kubernetes Client

  • Common Client Metrics
  • Additional Metrics

Data Store Client

  • Common Client Metrics
  • Additional Metrics

Queue Client

  • Common Client Metrics
  • Group By Operation Result and Resource
  • Message Count

Worker Server

  • Number of Workers per Service (Core RP, Link RP)
  • Status of each Worker
  • Average number of Messages processed per Worker
  • Average time it takes to process a Message
  • Number of Successful and Failed processings
  • Number of Extended Messages
  • Messages grouped by Resource Type and Action (ex: Container PUT)
  • Number of Duplicated Messages
  • Number of Timed out Operations

Phases of Work

The main goal is to get the foundation up and running in phase 1. After building the foundation, the work can be parallelized between scrum teams based on the expertise of each team.

  • Phase 1 (Laying the foundation)
    • Metrics:
      • System Metrics
      • Goroutine Metrics
  • Phase 2 (Laying the foundation for the HTTP Servers)
    • Prerequisites:
      • Phase 1
    • Metrics:
      • Common HTTP Server Metrics
    • Services:
      • UCP
      • Core RP
      • Link RP
  • Phase 3 (Laying the foundation for the Clients)
    • Prerequisites:
      • Phase 1
    • Metrics:
      • Common Client Metrics
    • Clients:
      • Azure Client
      • AWS Client
  • Phase 4 (Clients cont'd)
    • Prerequisites:
      • Phase 3
    • Clients:
      • Data Store Client
      • Kubernetes Client
      • Queue Client
      • Worker Server
  • Phase 5 (Working on the Deployment Engine side)
    • Prerequisites:
      • Phase 1
    • Metrics:
      • Dotnet Metrics
    • Service:
      • Deployment Engine
  • Phase 6 (Creating the Example)
    • Prerequisites:
      • Almost all phases if we want to showcase all the services and clients

Risks

This is a low-risk project. It will not have serious performance effects on the system.

Future Work

  • CLI Metrics: The reason I think that CLI Metrics might be useful is mainly that a team may like to know which commands are being used the most.
  • More Examples: We can provide more examples with other 3rd party tools that are provided by Azure and AWS.

Open Questions

Could we collect metrics from end-to-end resource creation to get the end-to-end completion rate?

References

  1. https://margara.faculty.polimi.it/papers/2020_debs_kaiju.pdf
  2. https://github.com/kubernetes/design-proposals-archive/blob/main/instrumentation/monitoring_architecture.md
  3. https://github.com/dapr/dapr/blob/master/docs/development/dapr-metrics.md
  4. https://opentelemetry.io/docs/
  5. https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification/resource/semantic_conventions
  6. https://medium.com/jaegertracing/jaeger-embraces-opentelemetry-collector-90a545cbc24
  7. https://opentelemetry.io/docs/reference/specification/metrics/semantic_conventions/http-metrics/

AB#5058

@youngbupark
Copy link

  • Why do we need to collect CLI metrics? How to collect user's data? will we spin off ingestion pipeline for this? As we spoke, CLI is out of scope in the current metrics work since CLI is running on user context, not radius service context. Also we can collect CLI behaviors from server side.
  • Can you define DE metrics consistent with the other RP? these are server side metrics so it should be same name and dimentions.
  • What are the goal and non-goal of this proposal?
  • Gathering Kubernetes metrics is out of scope. This needs to focus only on radius metrics
  • What is the proposal of telemetry backend service infra? which telemetry client will you ? e.g. opentelemetry
  • The below metrics are missing:
    1. Client sdk metrics too such as kubeclient/azure client sdk/UCP client/etc.
    2. Backend worker metrics such as how many operation is failed or succeeded.
  • What kinds of minimum attributions(a.k.a dimensions) will be included in each metrics?

@bjoginapally
Copy link

  • We should include UCP metrics as that is going to be part of public release.
  • Are you imagining a separate Prometheus handler for each of the arrows (-> /metrics). If you look at the implementation in core rp it spins up a server for the handler. Do we have control over kube components to do that? example kube-controllermanager, kube-scheduler?
  • Are we going to package a single prom collector for each component like core-rp, links, ucp and other components? How would this prom collector run out of the box.
  • Maybe you can divide this into phases, starting with UCP since we are targeting it for public release and move onto other components?

@rynowak
Copy link

rynowak commented Nov 15, 2022

Good start!

Why do we need to collect CLI metrics? How to collect user's data? will we spin off ingestion pipeline for this? As we spoke, CLI is out of scope in the current metrics work since CLI is running on user context, not radius service context. Also we can collect CLI behaviors from server side.

+1

We should avoid telemetry in the CLI. Collecting telemetry in CLI/developer tools always surprises people in an open-source project.

Client sdk metrics too such as kubeclient/azure client sdk/UCP client/etc.

Strong ack from me on tracking metrics for our outgoing calls and operations. The data store and secret clients are things we definitely need to instrument.

@youngbupark
Copy link

youngbupark commented Nov 16, 2022

  1. System metrics - There are no good way to collect network metrics. also network metrics will be covered by server/client metrics. So we can deliver CPU/Memory/Go routine metrics (optional)

System Metrics
CPU Metrics
Memory Metrics
Network Metrics

  1. These are database side metric. I do not think Radius can collect these metrics as a client.

Common Database Metrics
This will only apply if active Radius is using a SQL or a NoSQL Database as the Data Store.
Connection Usage (UpDownCounter)
Pending Requests (UpDownCounter)
Average Operation Time
Operation Types

  1. RPS/percentile needs to be covered.

Common HTTP Server Metrics
Request Duration (Histogram)
Request Size (Histogram)
Response Size (Histogram)
Active Requests (UpDownCounter)

Is this RPS ? Don't we collect percentile of failure and successful requests?

Response Status Codes

This will be part of attributes, so prometheus can aggregate by the status code not the actual metrics.

  1. Looks like async operation worker metrics are missing. Can you please add async operation worker metrics too ? Also, LinkRP & CoreRP share the same metrics. we do not need to separate out between two.

  2. Please add the high level metrics infra structure by three sections - client/backend/analytics UI tool.
    For instance.

  • Client SDK : Opentelemetry metric SDK
  • Telemetry backend: Prometheus
  • Analytics/Dashboard : Grafana
  1. I recommend to deliver all three parts(SDK, backend, dashboard template) in each phase like we deliver RP change/CLI both for new features. Please add the summary/goal for each phase.

  2. Please double-check if AWS/Azure/Kube clients use Go built-in http client. then we can create one http client middleware for all three clients. we can get those client metrics for free.

@rynowak
Copy link

rynowak commented Nov 16, 2022

For database metrics we should still collect these when we're using the APIServer store:

  • Average Operation Time
  • Operation Types

In particular I'm curious to see performance data for the APIServer store. There's definitely a scale limit to how far it can go and we'll want to advise users about when to switch to something else.


One thing that's missing here is overall operation time for our async operations. We should be able to track and report the e2e time of each async operation even though it crosses a component boundary.


Something that's missing here is dimensions. All of these are described right now as a single value, but metrics are more flexible than that.

For example for metrics in our RPs, the operation (eg: Applications.Core/containers/PUT) might be one of the most important dimensions. It's really reasonable for us to expect that the performance of different resources will have different characteristics.

@youngbupark
Copy link

For database metrics we should still collect these when we're using the APIServer store:

  • Average Operation Time
  • Operation Types

Yup, these metrics are outgoing client-side metrics, so they should be tracked.

@vinayada1
Copy link

For UCP, these metrics could be useful-

  • Upstream response time
  • Total response time
  • Error Rate by status code e.g. number of 400/total requests
  • Request Rate
  • Number of requests by plane

@youngbupark
Copy link

youngbupark commented Nov 17, 2022

Ryan and I mentioned above, can you please add async operation progress related metrics? such as processing time per operation or per resource type. number of failures and success. Also please add more context on highlevel and the goals for each phase.

@lakshmimsft
Copy link

lakshmimsft commented Nov 18, 2022

I asked Yetkin questions on how we're approaching data sizing, defaults and configuration possible on data retention policies and long term thoughts on helping radius clients wanting to store their data long term etc.
Found some initial documentation:
https://prometheus.io/docs/prometheus/latest/storage/
https://prometheus.io/docs/prometheus/1.8/storage/
We can possibly add some info to this proposal.

@youngbupark
Copy link

youngbupark commented Nov 18, 2022

I asked Yetkin questions on how we're approaching data sizing, defaults and configuration possible on data retention policies and long term thoughts on helping radius clients wanting to store their data long term etc. Found some initial documentation: https://prometheus.io/docs/prometheus/latest/storage/ https://prometheus.io/docs/prometheus/1.8/storage/ We can possibly add some info to this proposal.

Data retention is related to how to set up metric collector and observability platform such as prometheus, ELK, etc. This is out of scope in this feature. Each user has their own observability platform, such as datadog, azure monitor, stackdriver, cloud watch etc. Data retention configuration is their own preference based on their org policy. All these observability platform have the ability to collect prometheus metrics from each service without installing additional prometheus because prometheus pull mode endpoint is now de-facto way in cloud native environment. We want to use prometheus metric protocol, but not enforce the user to use only prometheus collector and its pipeline.

In other words, even though we can provide the sample prometheus set up tutorial, it is just an example to show how to collect radius metrics. Whole infra setup is up to radius users.

@snehabandla
Copy link

Would it be relevant to break some of these down by resource type? It might help some in understanding which resource is more commonly used and also where we need to improve on response time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment