lmolkova/metrics_azure_sdk.md

## metrics_azure_sdk.md

      
    Raw
  

              metrics_azure_sdk.md
            
          
    Metrics in Azure SDK for Java

The immediate goal is to report metrics from Azure messaging SDKs (EventHubs and ServiceBus) to help customers detect and investigate configuration issues, performance bottlenecks, application and SDK bugs.
It can be broken down into smaller goals:

define metrics essential for messaging scenarios
define Metrics API in azure-core
metrics plugin implementations

Scenarios

User scenarios

We expect users to be interested to know how many messages were received, processed, checkpointed; what's the delay of messages consumers receive; batch size, success rate of network operations and other key metrics we're going to define.
Some of these metrics can be calculated from traces, but not all of them and we're going to focus on the latter ones. Metrics would provide more performant, cheap and production-ready solution than tracing.
We expect users to have one or another metrics solution in their app. Based on Spring One survey, 90%+ of attendees use an APM tool (for logs, metrics, or traces), out of them, ~20%+ use Prometheus, ~30% use Azure Monitor.
SDK scenarios


Supportability: our TSGs should include steps that ask users to check metrics emitted by SDK instead of verbose logs. It'd help narrow down problems without reconfiguring logging and reproducing it.
Stress tests: assuming SDKs report metrics, stress tests would be just a regular user of this feature. If we see an issue in stress test run, we can use built-in metrics to investigate it in the same way as users would.

Usage beyond messaging SDKs


HTTP-based SDK: limited and can be done in core. Can be done automagically in tracing calls before sampling.
thick clients: CosmosDB (already uses Micrometer in Java, has similar ask for .NET)

Beyond Java


.NET: OTel Metrics are included in DiagnosticSource 6.0.
Python: OTel metrics API are in RC
JS: OTel metrics are in development
Go: Alpha
C++: Alpha

The proposal here is to polish scenario in Java where we have a partner ask and learn from it before doing any work in other languages.
Metric solution

We're going to pick the solution that

works for Spring Cloud
works with Azure Monitor and Prometheus and

compatible with variety of other APM vendors
has a fair amount of existing instrumentations


Summary


we'll have Meter API abstractions in azure-core
Provide OTel-based implementation for Meter APIs
Spring will keep using micrometer and will provide Micrometer-based implementation similar to this one sample

OTel vs Micrometer analysis
Metrics API

Closely follow OTel metrics API, Micrometer APIs are quite similar.
Do only a subset, perf is more important than convenience:

Arch board review
azure-core API view
azure-core-metrics-opentelemetry: API view

Naming choice: Azure prefix is added to avoid collision with OTel Meter.
Example of usage in Client libraries

// Create attributes with possible error status could be created upfront, usually along with client instance.
Map<String, Object> successAttributes = createAttributes("http://service-endpoint.azure.com", false);
Map<String, Object> errorAttributes = createAttributes("http://service-endpoint.azure.com", true);

// Create instruments for possible error codes. Can be done lazily once specific error code is received.
AzureLongCounter successfulHttpConnections = defaultMeter.createLongCounter("az.core.http.connections",
    "Number of created HTTP connections", null, successAttributes);

AzureLongCounter failedHttpConnections = defaultMeter.createLongCounter("az.core.http.connections",
    "Number of created HTTP connections", null, errorAttributes);

boolean success = false;
try {
    success = connect();
} finally {
    if (success) {
        successfulHttpConnections.add(1, currentContext);
    } else {
        failedHttpConnections.add(1, currentContext);
    }
}
Users apps

Basic

// configure OpenTelemetry SDK as usual and register global configuration
SdkMeterProvider meterProvider = SdkMeterProvider.builder()
    .registerMetricReader(PeriodicMetricReader.builder(OtlpGrpcMetricExporter.builder().build()).build())
    .build();

OpenTelemetrySdk.builder()
    .setMeterProvider(meterProvider)
    .setPropagators(ContextPropagators.create(W3CTraceContextPropagator.getInstance()))
    .buildAndRegisterGlobal();

// configure Azure Client, no metric configuration needed, client will use global OTel configuration
AzureClient sampleClient = new AzureClientBuilder()
    .endpoint("https://my-client.azure.com")
    .build();

// use client as usual, if it emits metric, they will be exported
sampleClient.methodCall("get items", Context.NONE);
Custom configuration and along with tracing

// configure OpenTelemetry SDK as usual
SdkMeterProvider meterProvider = SdkMeterProvider.builder()
    .registerMetricReader(PeriodicMetricReader.builder(OtlpGrpcMetricExporter.builder().build()).build())
    .build();

OpenTelemetry openTelemetry = OpenTelemetrySdk.builder()
    .setMeterProvider(meterProvider)
    .setPropagators(ContextPropagators.create(W3CTraceContextPropagator.getInstance()))
    .build();

// Pass OTel meterProvider to MetricsOptions - it will be used instead of implicit global singleton.
MetricsOptions customMetricsOptions = new MetricsOptions()
    .setProvider(meterProvider);

// configure Azure Client, no metric configuration needed
AzureClient sampleClient = new AzureClientBuilder()
    .endpoint("https://my-client.azure.com")
    .build();

Span span = openTelemetry.getTracer("azure-core-samples")
    .spanBuilder("doWork")
    .startSpan();

try (Scope scope = span.makeCurrent()) {
    // do some work

    // Current context flows to OpenTelemetry metrics and is used to populate exemplars
    String response = sampleClient.methodCall("get items");
    // do more work
}

span.end();
Messaging Metrics

Prior art


Current EventHubs broker metrics

Requests: incoming/outgoing, success, throttles
Messages: incoming, outgoing, captured
Bytes: incoming, outgoing, size of EH


Track -1 ServiceBus performance counters

SendMessage/ReceiveMessage/CompleteMessage/AcceptSeasstion/CancelScheduled

count (error/success)
rate (error/success)
duration
per namespace and per entity


Exceptions: count and rate (by type)
TokenAcquisition: rate (success/error), latency
Pending ReceiveMessage/AcceptMessageSession/AcceptMessageSessionByNamespace: count
EventProcessor process: latency, batch size
Connections: reset count (per entity), redirect count
Prefetch queue size and depth(?) per entity
Throughput (in/out): byte rate (per ns/entity)


XBox EventHubs Perf Counters - internal

Blob offset store: time since last offset flush
Producer (per topic):

Latency
Throughput
Request rate, retry rate, timeout rate, error receive rate
Transmission error rate
Event rate per partition


Buffered producer

Queue size
Queue full rate
Enqueue rate
Batch size
Event Time in queue


Consumer (per topic, per partition)

Lag (which is two other metrics)

last received (seqNo) - last published (seqNo)
Receive rate (success/error)
Producer-to-consumer latency (receive timestamp - enqueued-time) - approx?
Seconds to Zero: Lag * consumption rate
Consumption rate
Consumer queue size
Delivery queue: size, incoming rate, delivery rate, delivery failure rate


Current Kafka metrics

Producer:

batch size, splits
throughput: outgoing bytes, compression rate
metadata age
throttle time
record: errors, rate, time in send buffer, retries, size
request: rate, size, active
response rate, bytes


Connections: close, creation, io stats
Consumer:

fetch: latency, rate, size, throttle time, counts
records: rate, lag, batch size
bytes consumed
consumer groups: partitions, commit latency, rate, join rate, etc


Kafka proposal

intent: Kafka client library internals observability
metrics:

connections creations/active, errors
requests: rate, rtt, errors
internal queue latency, size
client io wait time
producer queue size, bytes
consumer

poll interval, latency, last time
consumer queue count, bytes
consumer group: errors, rebalance, partitions counts


DataDog article on Key RabbitMq metrics

broker side, mostly irrelevant


DataDog article on Kafka metrics

producer: response/response rate, latency, io wait time, batch size, throughput produced and batch-compression rate
consumer: record lag records rate, fetch rate, throughput consumed


OTel proposal, early WIP

EventHubs Metrics Proposal

Report metrics that are useful for customers when operating applications with EventHubs or ServiceBus.
We can add more to expose internals later.
All


Metric
Type
Comment


Last offset on broker
counter
[TODO] Opt-in, offset of the last message published successfully


Last sequence number on broker
counter
[TODO] Opt-in, sequence number of the last message published successfully


AMQP link: errors
counter
link errors counter by error code


AMQP session: errors
counter
session errors counter by error code


AMQP Connections: active
up-down-counter
Number of active connections; available on broker, not per client process


AMQP Connections: creations
counter
Number of created connections; available on broker, not per client


Dimensions:

Namespace
Entity
EntityPath

Both can only be reported as opt-in metrics (additional charges apply), customers would be expected to opt in on either producer or consumer.
Producer


Metric
Type
Comment


Send: duration
histogram
Number of milliseconds send ProducerClient.Send call takes with all retries


Send: messages in batch
counter
Number of messages sent per Producer.Send call


Send: bytes in batch
histogram
Number of bytes sent per Producer.Send call


AMQP link: send duration
histogram
Response time (in milliseconds) of AMQP request


Dimensions:

Namespace
Entity
EntityPath
Error code (or success)

Notes:

can calculate attempts metrics, e.g. avg attemps # = count(link_duration)/count(send_duration). If it's proven to be insufficient, we can come up with a better one.

Consumer


Metric
Type
Comment


AMQP: messages received
counter
Number of messages received per Consumer.Receive call


AMQP: credits requested
counter
Number of credits requested from broker.


Processor: duration
histogram
available on broker, not per client


Processor: error handler
counter
Error Handler Invocations


Checkpoint: duration
histogram
available on broker, not per client


Checkpoint: last offset checkpointed
counter


Checkpoint: last sequence number checkpointed
counter


Dimensions:

Namespace
Entity
Error code (or success)
EntityPath
Consumer GroupId

It will allow following views with slicing, dicing and filtering per any dimension

Histogram: count, rate, percentiles, avg, max
Gauge: count, rate, max, avg, sum
Counters: count, rate, total, avg, max

...
[WIP] Spec: https://gist.github.com/lmolkova/489a2b280b8fa68e4c3780c2afaa3b39
OpenTelemetry


status (5/9/2022): API and SDK stable as of 1.14
OpenTelemetry micrometer plugin: alpha
Application Insights agent: supports otel metrics in 3.3.0-beta release
Azure Monitor exporter: does not support metrics - TBD - roadmap
Other exporters: OTLP - stable, Prometheus - alpha
OTel exporter registry - here're the backends that support metrics (diff with Micrometer in bold): AWS CloudWatch, Datadog, Dynatrace, Elastic, Graphite, Influx, Instana, JMX, NewRelic, Stackdriver, Sumologic, Logzio, Honeycomb, Prometheus, SignalFx, StatsD (as a source), Wavefront
OTel instrumentations registry - enormous list both traces (and metrics from traces).
Semantics: OTel attempts to standartize metrics, dimensions and attribute names accross languages for generic scenarios (e.g. messaging)

Micrometer


Status: stable
Application Insights agent: supports micrometer (stable)
OpenTelemetry Java agent: supports micrometer (stable)
OpenTelemetry micrometer plugin: alpha
Micrometer backend registry (diff with OTel in bold) - AppOptics, Atlas, AWS CloudWatch, Datadog, Dynatrace, Elastic, Ganglia, Graphite, Humio, Influx, Instana, JMX, KairosDB, NewRelic, Prometheus, SignalFx, Stackdriver, StatsD, Wavefront.
Micrometer instrumentations - Spring Boot, JVM, Cache, OkHttp, Jetty and Jersey.
Micrometer does not have guidance or standards on attributes for generic scenarios

OpenTelemetry has alot of instrumentations available in OTel, supporting it would mean minimizing future list of dependencies for users.
Micrometer is more stable solution though.
OTel and Micrometer provide similar sets of Meters (sync and call-back based): counters, gauge, histogram.

OTel supports exemplars of metrics that allows to see examples of traces corresponding to specific measurement
OTel allows to efficiently and conveniently use dynamic attribute values


Plan


 Custom attributes (OTel baggage, Micrometer tags)

OTel baggage is not supported YET
Micrometer: registry.config().commonTags("custom-tag", "foo");


 Core changes and API review: done and released
 AMQP core changes: in progress
 ServiceBus changes
 EventHubs changes
 release otel plugin
 micrometer to samples
 document our metrics conventions
 AzMon review
 Blog
 Update EH/SB tsgs
Metric	Type	Comment
Last offset on broker	counter	[TODO] Opt-in, offset of the last message published successfully
Last sequence number on broker	counter	[TODO] Opt-in, sequence number of the last message published successfully
AMQP link: errors	counter	link errors counter by error code
AMQP session: errors	counter	session errors counter by error code
AMQP Connections: active	up-down-counter	Number of active connections; available on broker, not per client process
AMQP Connections: creations	counter	Number of created connections; available on broker, not per client
Metric	Type	Comment
Send: duration	histogram	Number of milliseconds send ProducerClient.Send call takes with all retries
Send: messages in batch	counter	Number of messages sent per Producer.Send call
Send: bytes in batch	histogram	Number of bytes sent per Producer.Send call
AMQP link: send duration	histogram	Response time (in milliseconds) of AMQP request
Metric	Type	Comment
AMQP: messages received	counter	Number of messages received per Consumer.Receive call
AMQP: credits requested	counter	Number of credits requested from broker.
Processor: duration	histogram	available on broker, not per client
Processor: error handler	counter	Error Handler Invocations
Checkpoint: duration	histogram	available on broker, not per client
Checkpoint: last offset checkpointed	counter
Checkpoint: last sequence number checkpointed	counter