flga/caddy_metrics.md

## caddy_metrics.md

      
    Raw
  

              caddy_metrics.md
            
          
    I've been reading the metrics related issues and I must admit it's a little hard to follow.
Performance and metric quality #4644

We instrument every handler in the chain, this is both expensive and at the same time not that useful for complex setups.
As someone who recently had to configure metrics and dashboards for a caddy instance I was underwhelmed.
The core issue (for me) is that a handler's metrics don't provide much value. On one hand, as was mentioned before, with deep chains they tend to be very redundant, on the other, they're not granular enough - I can see I'm serving N qps on file_server, but, which one?
I posit that there are 2 main use cases for http metrics:

having metrics per server/host
having metrics per route

They are not necessarily mutually exclusive, I'll explore what this can look like in more detail later on.
Metric stability guarantees #4016

We must ensure metrics are stable during the lifetime of a caddy process, we claim this is at odds with how caddy works (with config reloads) but I don't understand why that is the case.
This could very well be just a lack of context/undertanding of the underlying issue but it seems to me the two concepts are unrelated, and the conclusion we've come to is based on how we implemented metrics.
There are two main vectors I can see for a metric to change, changing its configuration (like histogram buckets) and changing the label set (effectively the same thing but I see it as 2 distinct classes of change).
However, AFAICT this is only an issue because we use global metrics since this restrition applies at the Registry level.
So I think it's worth considering what having dynamic registries (and associated handlers for serving the metrics) could look like.
The registry bit should be more or less straightforward, conceptually it's just a caddy module. I'm not sure how we would tackle the handler for exposition tho, maybe each app could reserve a port for a /metrics endpoint.
Allowing the exposition of other metrics #4140

This comes up with reverse_proxy for example, in which users care about information that is (technically) unrelated to caddy itself.
So any design put forth must be able to support this use case.
Design

This will require a significant review from core members as I do not yet grasp all of the details of how caddy works.
Registries and Apps

I propose that each app be responsible for exposing its metrics. This entails each app managing its own Registry and /metrics endpoint.
A naive approach would be to let each app start a listener on a user defineable port and have complete separation of concerns.
In order for this to be feasible, metric gathering would have to be disabled by default and the user would be required to configure a port for each app that has metrics otherwise we would either have to pre-define a port for each app (which would require external buy-in for non standard apps) or just letting it crash, which would provide a poor user experience since the most likely scenario is that it would be the generic "addr already in use" message without much context unless apps (including non-standard ones) handle it gracefuly.
By letting the user be the one to pick which app gets which port, if/when there is a conflict the user is already "in the loop" and will hopefully understand what that means and how to fix it. This could be a big assumption though.
The other downside is that now their scraping config gets a little more complex, but, it also provides a certain degree of freedom as they might want to handle different scrape targets differently. It is unclear to me where this sits in the advantage vs disadvantage spectrum as my prometheus setups are fairly basic and I haven't had the need for this kind of customization yet so citation needed.
A less naive one would be to have a single instance level listener on which apps may register their endpoints.
This would perhaps simplify scraping config as it's a single target but it might also limit their options (citation needed), but would require a more complex implementation where every app will need to communicate with a central component and will indoubtedly require a certain degree of magic.
As of right now, I'm leaning towards option one (one scrape target per app) but we need some input here.
In this scenario, every app would manage its own Registry, associated /metrics endpoint + listener, and config.
When there is a config change, the previous Registry should be discarded and a new one created (exact semantics TBC).
This is the first step to ensure metric stability, as a new config might mean there are different metrics (explored further at a later point).
This is in essence the same workflow the http app does (albeit for http things) so it ought to be feasible.
We can provide a reference implementation for this workflow as it is generic enough to apply to every app.
Conceptually it could look something like this:
type MetricsServer interface {
    // Creates a new `Registry` and associated handler and starts the listener.
    Start() error
    Stop() error

    // Returns a handle to the current `Registry`.
    // A caller is not allowed to store this.
    Registerer() prometheus.Registerer
}
* we might need to stick a Provision in there too.
Assuming that a config reload means creating a new app and calling Start on it this should be a sufficient API surface.
Looking at the http app I'm not seeing any synchronization so this would imply the above holds and that we can safely share the underlying registry without the need to manage it with an atomic pointer but citation needed.
All an app would need to do to wire this up is treat it as a guest module, making sure to forward its Start and Stop signals to it.
The lifetime of this module is inherently tied to the lifetime of the host module given it's a guest.
At this stage we have a way for having multiple metric sets in caddy, without globals, and in a way that ensures that metrics in a given Registry are always stable as a config reload will simply create a new one and start exposing it.
Config reloads are treated as "counter resets" from the POV of Prometheus. I consider this correct behaviour as we're effectively restarting the child apps. If we were to extrapolate this to a simple go service, the most typical behaviour for a config change is indeed a restart.
Registering metrics

All metrics should be registered at the Provision stage, the app should forward its registry through Context.
Child modules that need to create metrics may do so by yanking it out of the context (or perhaps we should formalize this with some sort of accessor/helper methods in the context itself).
Child modules must guarantee that they register stable metrics, that is to say, they must be registered once and treated as immutable thereafter. They are not allowed to add new metrics at any point other than Provision.
As previously established, a config change will inherently trigger a new cycle.
Dealing with dynamic metrics descriptors

I'm considering "dynamic metrics descriptors" any metric that in the isolated context of a module is not fully known ahead of time  because it's dependant on child module configuration.
This use case manifests itself when a user wants to add their own label pairs to an existing metric.
This can be supported without violating any of the rules/assumptions made above if a host module delays instantiation of its metrics until its child modules have been instantiated in Provision.
In this scenario, a child module may request additional labels to be added to the host's metrics through Context.
I see three ways of making this work, each with its own tradeoffs:

New labels requested by the guest modules get applied unconditionally to every metric of the host module
The host module creates metrics in 2 stages. In the first stage it creates their metric descriptors and pass them through Context (somehow) allowing guest modules to mutate them. After every guest module is instantiated its guaranteed no more mutations can occur in the descriptors and as such the host module is free to register them in the registry.
Labels must be configured ahead of time in the app itself (labels only, does not apply to values)

While approach #1 keeps things rather simple, at least superficially, it can have a detrimental effect on cardinality as every label will apply to every metric, even if that was not the user's intent.
Approach #2 solves this but forces the guest modules (or more likely, the user) to know what metrics they're dealing with. I don't see this as a problem, it is one more knob to be aware of when configuring caddy, but I claim that if a user is customizing metric labels they are already fully aware what metrics are available and this shouldn't be an impediment.
Approach #3 combines the best of #1 and #2, the implementation is simple and the user keeps control of what labels apply to what metrics, at the cost of some redundancy/friction as if a user wants to add a label pair they will first need to register that label in the metrics module. We would also need a mechanism to detect this error ahead of time for a better user experience.
Dealing with dynamic label values

This is not an issue per se, as it will not cause any of the problems mentioned beforehand. It's just a matter of landing on an API to support this, likely with Replacer, there is some prior art / suggestions in the issues.
TODOS:

api spec
caddyfile support for configuring apps