Author: Yetkin Timocin (@ytimocin) Last Updated: 11/15/2022 Status: [Work in Progress]
This document proposes a design for the set of metrics in Radius.
This proposal covers the addition of OpenTelemetry metrics for Radius. This proposal is not for instrumenting user applications. The main goal is the supportability and instrumentation of Radius using industry standard patterns. That means we are going to be using existing widely-used solutions for the instrumentation of Radius.
- Instrumentation of Radius with OpenTelemetry.
- Metrics for AppCoreRP, UCP, Deployment Engine.
- A working example for Radius customers using Prometheus and Grafana.
- Traces and logs.
- CLI Metrics.
As mentioned above, we will use vendor-neutral, open-source Observability framework OpenTelemetry. For the working example, we will be using Prometheus and Grafana.
- Client SDK: OpenTelemetry Metrics SDK
- Telemetry Backend: Prometheus
- Analytics/Dashboard: Grafana
"OpenTelemetry is a collection of tools, APIs, and SDKs. Use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software's performance and behavior." (https://opentelemetry.io/)
- Open Standards and Data Portability: Its format is vendor neutral which means that you can change the backend that consumes your telemetry data without changing the way you collect this data.
- Contributors and Adopters: OpenTelemetry SDKs provide great automatic instrumentation because of their community of contributors and vendors. You can see a full list of adopters here: https://github.com/open-telemetry/community/blob/main/ADOPTERS.md.
- CNCF: OpenTelemetry is one of the top 2 active open-source projects. This item goes along with the second item on this list.
There aren't really a lot of alternatives to OpenTelemetry that are compelling in this space.
We will be gathering metrics from Radius Services
and also Clients
that we are using within those services.
Radius Services are as follows:
- Core RP
- Link RP
- UCP
- Deployment Engine
Helper Services:
- Worker Server (for Async Operations)
Clients that we use are as follows:
- Azure Client
- AWS Client
- Kubernetes Client
- Data Store Client
- Queue Client
- Secret Client
- CPU Metrics
- Memory Metrics
- Network Metrics
- Request Duration (Histogram)
- Request Size (Histogram)
- Response Size (Histogram)
- Active Requests (UpDownCounter)
- Response Grouped By Status Code (500s, 400s, 200s...)
- Response Grouped By Resource and Action (ex: PUT Container)
- Request Duration (Histogram)
- Request Size (Histogram)
- Response Size (Histogram)
- Response Grouped By Requested Action
- Common HTTP Server Metrics
- Additional Metrics
- Common HTTP Server Metrics
- Requests Grouped By Plane
- Additional Metrics
- Common HTTP Server Metrics
- Provisioning Time
- Response Grouped By Provisioning Status (Failed, Completed...)
- Common Client Metrics
- Additional Metrics
- Common Client Metrics
- Additional Metrics
- Common Client Metrics
- Additional Metrics
- Common Client Metrics
- Additional Metrics
- Common Client Metrics
- Group By Operation Result and Resource
- Message Count
- Number of Workers per Service (Core RP, Link RP)
- Status of each Worker
- Average number of Messages processed per Worker
- Average time it takes to process a Message
- Number of Successful and Failed processings
- Number of Extended Messages
- Messages grouped by Resource Type and Action (ex: Container PUT)
- Number of Duplicated Messages
- Number of Timed out Operations
The main goal is to get the foundation up and running in phase 1. After building the foundation, the work can be parallelized between scrum teams based on the expertise of each team.
- Phase 1 (Laying the foundation)
- Metrics:
- System Metrics
- Goroutine Metrics
- Metrics:
- Phase 2 (Laying the foundation for the HTTP Servers)
- Prerequisites:
- Phase 1
- Metrics:
- Common HTTP Server Metrics
- Services:
- UCP
- Core RP
- Link RP
- Prerequisites:
- Phase 3 (Laying the foundation for the Clients)
- Prerequisites:
- Phase 1
- Metrics:
- Common Client Metrics
- Clients:
- Azure Client
- AWS Client
- Prerequisites:
- Phase 4 (Clients cont'd)
- Prerequisites:
- Phase 3
- Clients:
- Data Store Client
- Kubernetes Client
- Queue Client
- Worker Server
- Prerequisites:
- Phase 5 (Working on the Deployment Engine side)
- Prerequisites:
- Phase 1
- Metrics:
- Dotnet Metrics
- Service:
- Deployment Engine
- Prerequisites:
- Phase 6 (Creating the Example)
- Prerequisites:
- Almost all phases if we want to showcase all the services and clients
- Prerequisites:
This is a low-risk project. It will not have serious performance effects on the system.
- CLI Metrics: The reason I think that CLI Metrics might be useful is mainly that a team may like to know which commands are being used the most.
- More Examples: We can provide more examples with other 3rd party tools that are provided by Azure and AWS.
Could we collect metrics from end-to-end resource creation to get the end-to-end completion rate?
- https://margara.faculty.polimi.it/papers/2020_debs_kaiju.pdf
- https://github.com/kubernetes/design-proposals-archive/blob/main/instrumentation/monitoring_architecture.md
- https://github.com/dapr/dapr/blob/master/docs/development/dapr-metrics.md
- https://opentelemetry.io/docs/
- https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification/resource/semantic_conventions
- https://medium.com/jaegertracing/jaeger-embraces-opentelemetry-collector-90a545cbc24
- https://opentelemetry.io/docs/reference/specification/metrics/semantic_conventions/http-metrics/
I asked Yetkin questions on how we're approaching data sizing, defaults and configuration possible on data retention policies and long term thoughts on helping radius clients wanting to store their data long term etc.
Found some initial documentation:
https://prometheus.io/docs/prometheus/latest/storage/
https://prometheus.io/docs/prometheus/1.8/storage/
We can possibly add some info to this proposal.