ytimocin/proposal.md Secret

## proposal.md

      
    Raw
  

              proposal.md
            
          
    [Observability] Radius Metrics

Author: Yetkin Timocin (@ytimocin)
Last Updated: 11/15/2022
Status: [Work in Progress]
This document proposes a design for the set of metrics in Radius.

Introduction

Goals
Non-Goals
Tools
OpenTelemetry
Alternatives
High-Level Design


Radius Metrics

Proposed Radius Metrics


Phases of Work
Risks
Future Work
Open Questions
References

Introduction

This proposal covers the addition of OpenTelemetry metrics for Radius. This proposal is not for instrumenting user applications. The main goal is the supportability and instrumentation of Radius using industry standard patterns. That means we are going to be using existing widely-used solutions for the instrumentation of Radius.
Goals


Instrumentation of Radius with OpenTelemetry.
Metrics for AppCoreRP, UCP, Deployment Engine.
A working example for Radius customers using Prometheus and Grafana.

Non-Goals


Traces and logs.
CLI Metrics.

Tools

As mentioned above, we will use vendor-neutral, open-source Observability framework OpenTelemetry. For the working example, we will be using Prometheus and Grafana.

Client SDK: OpenTelemetry Metrics SDK
Telemetry Backend: Prometheus
Analytics/Dashboard: Grafana

OpenTelemetry

"OpenTelemetry is a collection of tools, APIs, and SDKs. Use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software's performance and behavior." (https://opentelemetry.io/)
Why OpenTelemetry?


Open Standards and Data Portability: Its format is vendor neutral which means that you can change the backend that consumes your telemetry data without changing the way you collect this data.
Contributors and Adopters: OpenTelemetry SDKs provide great automatic instrumentation because of their community of contributors and vendors. You can see a full list of adopters here: https://github.com/open-telemetry/community/blob/main/ADOPTERS.md.
CNCF: OpenTelemetry is one of the top 2 active open-source projects. This item goes along with the second item on this list.

Alternatives

There aren't really a lot of alternatives to OpenTelemetry that are compelling in this space.
High-Level Design


Radius Metrics

We will be gathering metrics from Radius Services and also Clients that we are using within those services.
Radius Services are as follows:

Core RP
Link RP
UCP
Deployment Engine

Helper Services:

Worker Server (for Async Operations)

Clients that we use are as follows:

Azure Client
AWS Client
Kubernetes Client
Data Store Client
Queue Client
Secret Client

Proposed Radius Metrics

System Metrics


CPU Metrics
Memory Metrics
Network Metrics

Goroutine Metrics

DotNet Runtime Metrics

Common HTTP Server Metrics


Request Duration (Histogram)
Request Size (Histogram)
Response Size (Histogram)
Active Requests (UpDownCounter)
Response Grouped By Status Code (500s, 400s, 200s...)
Response Grouped By Resource and Action (ex: PUT Container)

Common Client Metrics


Request Duration (Histogram)
Request Size (Histogram)
Response Size (Histogram)
Response Grouped By Requested Action

Core RP and Link RP Metrics


Common HTTP Server Metrics
Additional Metrics

UCP Metrics


Common HTTP Server Metrics
Requests Grouped By Plane
Additional Metrics

Deployment Engine Metrics


Common HTTP Server Metrics
Provisioning Time
Response Grouped By Provisioning Status (Failed, Completed...)

Azure Client


Common Client Metrics
Additional Metrics

AWS Client


Common Client Metrics
Additional Metrics

Kubernetes Client


Common Client Metrics
Additional Metrics

Data Store Client


Common Client Metrics
Additional Metrics

Queue Client


Common Client Metrics
Group By Operation Result and Resource
Message Count

Worker Server


Number of Workers per Service (Core RP, Link RP)
Status of each Worker
Average number of Messages processed per Worker
Average time it takes to process a Message
Number of Successful and Failed processings
Number of Extended Messages
Messages grouped by Resource Type and Action (ex: Container PUT)
Number of Duplicated Messages
Number of Timed out Operations

Phases of Work

The main goal is to get the foundation up and running in phase 1. After building the foundation, the work can be parallelized between scrum teams based on the expertise of each team.

Phase 1 (Laying the foundation)

Metrics:

System Metrics
Goroutine Metrics


Phase 2 (Laying the foundation for the HTTP Servers)

Prerequisites:

Phase 1


Metrics:

Common HTTP Server Metrics


Services:

UCP
Core RP
Link RP


Phase 3 (Laying the foundation for the Clients)

Prerequisites:

Phase 1


Metrics:

Common Client Metrics


Clients:

Azure Client
AWS Client


Phase 4 (Clients cont'd)

Prerequisites:

Phase 3


Clients:

Data Store Client
Kubernetes Client
Queue Client
Worker Server


Phase 5 (Working on the Deployment Engine side)

Prerequisites:

Phase 1


Metrics:

Dotnet Metrics


Service:

Deployment Engine


Phase 6 (Creating the Example)

Prerequisites:

Almost all phases if we want to showcase all the services and clients


Risks

This is a low-risk project. It will not have serious performance effects on the system.
Future Work


CLI Metrics: The reason I think that CLI Metrics might be useful is mainly that a team may like to know which commands are being used the most.
More Examples: We can provide more examples with other 3rd party tools that are provided by Azure and AWS.

Open Questions

Could we collect metrics from end-to-end resource creation to get the end-to-end completion rate?
References


https://margara.faculty.polimi.it/papers/2020_debs_kaiju.pdf
https://github.com/kubernetes/design-proposals-archive/blob/main/instrumentation/monitoring_architecture.md
https://github.com/dapr/dapr/blob/master/docs/development/dapr-metrics.md
https://opentelemetry.io/docs/
https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification/resource/semantic_conventions
https://medium.com/jaegertracing/jaeger-embraces-opentelemetry-collector-90a545cbc24
https://opentelemetry.io/docs/reference/specification/metrics/semantic_conventions/http-metrics/

AB#5058