The immediate goal is to report metrics from Azure messaging SDKs (EventHubs and ServiceBus) to help customers detect and investigate configuration issues, performance bottlenecks, application and SDK bugs.
It can be broken down into smaller goals:
- define metrics essential for messaging scenarios
- define Metrics API in azure-core
- metrics plugin implementations
We expect users to be interested to know how many messages were received, processed, checkpointed; what's the delay of messages consumers receive; batch size, success rate of network operations and other key metrics we're going to define. Some of these metrics can be calculated from traces, but not all of them and we're going to focus on the latter ones. Metrics would provide more performant, cheap and production-ready solution than tracing.
We expect users to have one or another metrics solution in their app. Based on Spring One survey, 90%+ of attendees use an APM tool (for logs, metrics, or traces), out of them, ~20%+ use Prometheus, ~30% use Azure Monitor.
- Supportability: our TSGs should include steps that ask users to check metrics emitted by SDK instead of verbose logs. It'd help narrow down problems without reconfiguring logging and reproducing it.
- Stress tests: assuming SDKs report metrics, stress tests would be just a regular user of this feature. If we see an issue in stress test run, we can use built-in metrics to investigate it in the same way as users would.
- HTTP-based SDK: limited and can be done in core. Can be done automagically in tracing calls before sampling.
- thick clients: CosmosDB (already uses Micrometer in Java, has similar ask for .NET)
- .NET: OTel Metrics are included in DiagnosticSource 6.0.
- Python: OTel metrics API are in RC
- JS: OTel metrics are in development
- Go: Alpha
- C++: Alpha
The proposal here is to polish scenario in Java where we have a partner ask and learn from it before doing any work in other languages.
We're going to pick the solution that
- works for Spring Cloud
- works with Azure Monitor and Prometheus and
- compatible with variety of other APM vendors
- has a fair amount of existing instrumentations
- we'll have Meter API abstractions in azure-core
- Provide OTel-based implementation for Meter APIs
- Spring will keep using micrometer and will provide Micrometer-based implementation similar to this one sample
Closely follow OTel metrics API, Micrometer APIs are quite similar. Do only a subset, perf is more important than convenience:
Naming choice: Azure
prefix is added to avoid collision with OTel Meter
.
// Create attributes with possible error status could be created upfront, usually along with client instance.
Map<String, Object> successAttributes = createAttributes("http://service-endpoint.azure.com", false);
Map<String, Object> errorAttributes = createAttributes("http://service-endpoint.azure.com", true);
// Create instruments for possible error codes. Can be done lazily once specific error code is received.
AzureLongCounter successfulHttpConnections = defaultMeter.createLongCounter("az.core.http.connections",
"Number of created HTTP connections", null, successAttributes);
AzureLongCounter failedHttpConnections = defaultMeter.createLongCounter("az.core.http.connections",
"Number of created HTTP connections", null, errorAttributes);
boolean success = false;
try {
success = connect();
} finally {
if (success) {
successfulHttpConnections.add(1, currentContext);
} else {
failedHttpConnections.add(1, currentContext);
}
}
// configure OpenTelemetry SDK as usual and register global configuration
SdkMeterProvider meterProvider = SdkMeterProvider.builder()
.registerMetricReader(PeriodicMetricReader.builder(OtlpGrpcMetricExporter.builder().build()).build())
.build();
OpenTelemetrySdk.builder()
.setMeterProvider(meterProvider)
.setPropagators(ContextPropagators.create(W3CTraceContextPropagator.getInstance()))
.buildAndRegisterGlobal();
// configure Azure Client, no metric configuration needed, client will use global OTel configuration
AzureClient sampleClient = new AzureClientBuilder()
.endpoint("https://my-client.azure.com")
.build();
// use client as usual, if it emits metric, they will be exported
sampleClient.methodCall("get items", Context.NONE);
// configure OpenTelemetry SDK as usual
SdkMeterProvider meterProvider = SdkMeterProvider.builder()
.registerMetricReader(PeriodicMetricReader.builder(OtlpGrpcMetricExporter.builder().build()).build())
.build();
OpenTelemetry openTelemetry = OpenTelemetrySdk.builder()
.setMeterProvider(meterProvider)
.setPropagators(ContextPropagators.create(W3CTraceContextPropagator.getInstance()))
.build();
// Pass OTel meterProvider to MetricsOptions - it will be used instead of implicit global singleton.
MetricsOptions customMetricsOptions = new MetricsOptions()
.setProvider(meterProvider);
// configure Azure Client, no metric configuration needed
AzureClient sampleClient = new AzureClientBuilder()
.endpoint("https://my-client.azure.com")
.build();
Span span = openTelemetry.getTracer("azure-core-samples")
.spanBuilder("doWork")
.startSpan();
try (Scope scope = span.makeCurrent()) {
// do some work
// Current context flows to OpenTelemetry metrics and is used to populate exemplars
String response = sampleClient.methodCall("get items");
// do more work
}
span.end();
- Current EventHubs broker metrics
- Requests: incoming/outgoing, success, throttles
- Messages: incoming, outgoing, captured
- Bytes: incoming, outgoing, size of EH
- Track -1 ServiceBus performance counters
- SendMessage/ReceiveMessage/CompleteMessage/AcceptSeasstion/CancelScheduled
- count (error/success)
- rate (error/success)
- duration
- per namespace and per entity
- Exceptions: count and rate (by type)
- TokenAcquisition: rate (success/error), latency
- Pending ReceiveMessage/AcceptMessageSession/AcceptMessageSessionByNamespace: count
- EventProcessor process: latency, batch size
- Connections: reset count (per entity), redirect count
- Prefetch queue size and depth(?) per entity
- Throughput (in/out): byte rate (per ns/entity)
- SendMessage/ReceiveMessage/CompleteMessage/AcceptSeasstion/CancelScheduled
- XBox EventHubs Perf Counters - internal
- Blob offset store: time since last offset flush
- Producer (per topic):
- Latency
- Throughput
- Request rate, retry rate, timeout rate, error receive rate
- Transmission error rate
- Event rate per partition
- Buffered producer
- Queue size
- Queue full rate
- Enqueue rate
- Batch size
- Event Time in queue
- Consumer (per topic, per partition)
- Lag (which is two other metrics)
- last received (seqNo) - last published (seqNo)
- Receive rate (success/error)
- Producer-to-consumer latency (receive timestamp - enqueued-time) - approx?
- Seconds to Zero: Lag * consumption rate
- Consumption rate
- Consumer queue size
- Delivery queue: size, incoming rate, delivery rate, delivery failure rate
- Lag (which is two other metrics)
- Current Kafka metrics
- Producer:
- batch size, splits
- throughput: outgoing bytes, compression rate
- metadata age
- throttle time
- record: errors, rate, time in send buffer, retries, size
- request: rate, size, active
- response rate, bytes
- Connections: close, creation, io stats
- Consumer:
- fetch: latency, rate, size, throttle time, counts
- records: rate, lag, batch size
- bytes consumed
- consumer groups: partitions, commit latency, rate, join rate, etc
- Producer:
- Kafka proposal
- intent: Kafka client library internals observability
- metrics:
- connections creations/active, errors
- requests: rate, rtt, errors
- internal queue latency, size
- client io wait time
- producer queue size, bytes
- consumer
- poll interval, latency, last time
- consumer queue count, bytes
- consumer group: errors, rebalance, partitions counts
- DataDog article on Key RabbitMq metrics
- broker side, mostly irrelevant
- DataDog article on Kafka metrics
- producer: response/response rate, latency, io wait time, batch size, throughput produced and batch-compression rate
- consumer: record lag records rate, fetch rate, throughput consumed
- OTel proposal, early WIP
Report metrics that are useful for customers when operating applications with EventHubs or ServiceBus. We can add more to expose internals later.
Metric | Type | Comment |
---|---|---|
Last offset on broker | counter | [TODO] Opt-in, offset of the last message published successfully |
Last sequence number on broker | counter | [TODO] Opt-in, sequence number of the last message published successfully |
AMQP link: errors | counter | link errors counter by error code |
AMQP session: errors | counter | session errors counter by error code |
AMQP Connections: active | up-down-counter | Number of active connections; available on broker, not per client process |
AMQP Connections: creations | counter | Number of created connections; available on broker, not per client |
Dimensions:
- Namespace
- Entity
- EntityPath
Both can only be reported as opt-in metrics (additional charges apply), customers would be expected to opt in on either producer or consumer.
Metric | Type | Comment |
---|---|---|
Send: duration | histogram | Number of milliseconds send ProducerClient.Send call takes with all retries |
Send: messages in batch | counter | Number of messages sent per Producer.Send call |
Send: bytes in batch | histogram | Number of bytes sent per Producer.Send call |
AMQP link: send duration | histogram | Response time (in milliseconds) of AMQP request |
Dimensions:
- Namespace
- Entity
- EntityPath
- Error code (or success)
Notes:
- can calculate attempts metrics, e.g. avg attemps # = count(link_duration)/count(send_duration). If it's proven to be insufficient, we can come up with a better one.
Metric | Type | Comment |
---|---|---|
AMQP: messages received | counter | Number of messages received per Consumer.Receive call |
AMQP: credits requested | counter | Number of credits requested from broker. |
Processor: duration | histogram | available on broker, not per client |
Processor: error handler | counter | Error Handler Invocations |
Checkpoint: duration | histogram | available on broker, not per client |
Checkpoint: last offset checkpointed | counter | |
Checkpoint: last sequence number checkpointed | counter |
Dimensions:
- Namespace
- Entity
- Error code (or success)
- EntityPath
- Consumer GroupId
It will allow following views with slicing, dicing and filtering per any dimension
- Histogram: count, rate, percentiles, avg, max
- Gauge: count, rate, max, avg, sum
- Counters: count, rate, total, avg, max
...
[WIP] Spec: https://gist.github.com/lmolkova/489a2b280b8fa68e4c3780c2afaa3b39
- status (5/9/2022): API and SDK stable as of 1.14
- OpenTelemetry micrometer plugin: alpha
- Application Insights agent: supports otel metrics in 3.3.0-beta release
- Azure Monitor exporter: does not support metrics - TBD - roadmap
- Other exporters: OTLP - stable, Prometheus - alpha
- OTel exporter registry - here're the backends that support metrics (diff with Micrometer in bold): AWS CloudWatch, Datadog, Dynatrace, Elastic, Graphite, Influx, Instana, JMX, NewRelic, Stackdriver, Sumologic, Logzio, Honeycomb, Prometheus, SignalFx, StatsD (as a source), Wavefront
- OTel instrumentations registry - enormous list both traces (and metrics from traces).
- Semantics: OTel attempts to standartize metrics, dimensions and attribute names accross languages for generic scenarios (e.g. messaging)
- Status: stable
- Application Insights agent: supports micrometer (stable)
- OpenTelemetry Java agent: supports micrometer (stable)
- OpenTelemetry micrometer plugin: alpha
- Micrometer backend registry (diff with OTel in bold) - AppOptics, Atlas, AWS CloudWatch, Datadog, Dynatrace, Elastic, Ganglia, Graphite, Humio, Influx, Instana, JMX, KairosDB, NewRelic, Prometheus, SignalFx, Stackdriver, StatsD, Wavefront.
- Micrometer instrumentations - Spring Boot, JVM, Cache, OkHttp, Jetty and Jersey.
- Micrometer does not have guidance or standards on attributes for generic scenarios
OpenTelemetry has alot of instrumentations available in OTel, supporting it would mean minimizing future list of dependencies for users. Micrometer is more stable solution though.
OTel and Micrometer provide similar sets of Meters (sync and call-back based): counters, gauge, histogram.
- OTel supports exemplars of metrics that allows to see examples of traces corresponding to specific measurement
- OTel allows to efficiently and conveniently use dynamic attribute values
- Custom attributes (OTel baggage, Micrometer tags)
- OTel baggage is not supported YET
- Micrometer:
registry.config().commonTags("custom-tag", "foo");
- Core changes and API review: done and released
- AMQP core changes: in progress
- ServiceBus changes
- EventHubs changes
- release otel plugin
- micrometer to samples
- document our metrics conventions
- AzMon review
- Blog
- Update EH/SB tsgs