Skip to content

Instantly share code, notes, and snippets.

@jsuereth
Created October 5, 2022 16:47
Show Gist options
  • Save jsuereth/2b9cbf4524e27526fb22584586e5dca2 to your computer and use it in GitHub Desktop.
Save jsuereth/2b9cbf4524e27526fb22584586e5dca2 to your computer and use it in GitHub Desktop.
Semantic Conventions - Forward Progress proposal

Problem

When investigating #2775, the TC decided to look into expanding the notion of Resource to include Entity.

During this discussion, we identified a lot of hard, challenging problems OpenTelemetry must tackle going forward, including:

  • Telemetry Identity evolving as "scope" increases. E.g. a Jaeger instance running in a single k8s cluster may not need to know the identity of the k8s cluster, as there's only one. However, a datastore spanning mulitple k8s clusters WILL need this information.
  • A simple "Service" model for OpenTelemetry SDKs (e.g. requiring service.name attribute) works in the short run, but beings to struggle in large distributed systems.
  • Defining a consistent guideline on what resource / entity means in practice, how to choose one and what our Entity <-> Signal modelling needs to look like in the long run.

We do believe that an Entity model that allows identity enrichment is the right path forward. These problems are large, throny, and require some deep thought and attention. However, halting progress on instrumentation to tackle them could halt OpenTelemetry's forward momentum and put instrumetnation efforts at great risk. We expect solving (all of) these problems fully to be on the order of years, not months or weeks.

Criteria

This proposal aims to unblock language instrumentation driven through SDKs (i.e. not opentelemetry-collector). Specifically, if accepted, this proposal would allow the continuation of:

  • Trace Instrumentation Semantic conventions (HTTP, RPC, Messaging, etc.)
  • Metric Instrumentation Semantic Conventions (HTTP, RPC, Java, etc. but not Process, host, etc.)
  • RUM / Client-side Instrumentation

In addition this should allow progress towards Logging semantic conventions and community-convergence discussions with Elastic Common Schema.

Proposal

The proposal is split into a few components, but hinges on requring all SDKs to use a "service" as their defacto Resource and identity for metrics. There are these tasks / changes to the OpenTelemetry Specification:

  • Update Service resource Semantic Conventions to require SDKs to provide service.instance.id
  • Update Resource SDK specification to require Service resource attributes to be discovered first
  • Update OpenTelemetry Metrics Data Model such that only identifying attributes in a Resource participate in time series identity
    • For Service resource this would include service.name and service.instance.id and service.namespace when present.
    • Allow other resource types to be defined w/ identity on an ad-hoc / necessity basis. This should only be done to unblock major instrumentation efforts and when a forwards-compatible / "fixable" / "future-proof" design can be made.
  • Update Semantic Conventions to include a "sharable" flag for attributes to indicate applicability of sharing an attribute between metrics and other signals.
    • This flag should denote whether expected cardinality of a flag is acceptable for most metric backends.
    • This will help prevent issues like java-instrumentation#5307, where http.url (a high cardinality label) was encoded in latency metrics.
  • Allow Prometheus Metrics Exporters to:
    • Drop service.* resource attributes as is the expectation in prometheus where service discovery will provide these.
    • Ensure OTLP => Prometheus-Remote-Write leverages these identifying metrics.

Concerns

The RUM / Client-instrumentation SiG has already raised concerns over forcing a "Service" abstraction everywhere. We also know this is a concern for the OpenTelemetry Collector, e.g the hostmetricsreceiver. We intend to lift this restriction as progress is made towards underlying issues around Entity, Identity and topology of signal-generators. However, we see aligning on a simplistic model as a first step towards unblocking instrumentation that is in-line with common industry standards and something we can evolve over time to address these concerns.

@yurishkuro
Copy link

yurishkuro commented Oct 5, 2022

  1. I am not clear on the use of the word Entity as self-explaining. To me Entity sounds just like another name for Resource. Fwiw, I prefer (Observable) Entity over Resource as Entity sounds less compute-oriented, i.e. I can model a business process or workflow as Entity, but it's a stretch to call a workflow a Resource. That digression aside though, I am not clear how "entity model" mentioned here is different.
  2. I am concerned with over-indexing on "service". Say my metric is the number of orders made through the system, how is "service" relevant?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment