matthiasr/01_concurrent_requests.md

## 01_concurrent_requests.md

      
    Raw
  

              01_concurrent_requests.md
            
          
    Kubernetes defaults to scaling by CPU usage, because that is what is always available.
However, this is not a great metric to scale on.
Most backend services are not CPU-bound – they mostly wait for responses from other services, caches, or databases.
If one of those gets slow, or worse, talking to one of those gets slow, CPU-based scaling will tear down resources rather than scaling out, because all it sees is "idle" instances.
This is especially bad if the contended resource is concurrency on those network requests.
If many requests are waiting to check out a connection from the database connection pool, scaling down is what you want the least.
My favorit metric to scale on is how many requests you have currently ongoing (per instance).
There's a relationship between latency, request rate, and this number of ongoing requests, that means that say a 20% increase in latency at the same request rate, or a 20% increase in requests at the same latency, result in a 20% increase in ongoing requests (and consequently, you should scale out by 20%).
We get very good results from this approach.
In practical terms, we scale down the application under normal load, to see what the concurrency is near the breaking point.
We set up autoscaling for half of that.
You can use more if you want to conserve resources, but that means less headroom for short-term spikes.
The "physical" reasoning for this scaling strategy is that most of the limitations in your app are not actually on CPU. even when it's sitting there waiting for e.g. the database, an ongoing request is taking up resources – memory, file handles, database connections from the pool, lock contention. Pure CPU usage degrades rather nicely, these all degrade rather badly and in a non-linear way, going from "fine" to "everything is broken" with only a relatively small incremental load.
By scaling on concurrent requests, you are implicitly scaling on these limiting factors. Whatever CPU usage results from that is what you feed back to the Kubernetes scheduler via CPU requests. We set requests from the load, rather than trying to scale to the request.
While I haven't used that myself, this is in line with the "queue depth" approach of KEDA.
This means setting this up with KEDA is a lot easier than wrangling prometheus-adapter and custom metrics.
Ideally, you have the number of ongoing requests directly available as a metric.
If not, in Prometheus-instrumented services, summaries or histograms give us a very neat way of deriving this information.
Both include the _sum time series, which is the total time observed (in this case, time handling requests).
Little's Law says that the concurrency is the product of the average latency and the request rate.
In Prometheus, we calculate the average latency with something like
sum(rate(http_request_latency_sum[1m])) / sum(rate(http_request_latency_total[1m]))

and the request rate with
sum(rate(http_request_latency_total[1m]))

Multiplying the two, the request rate cancels out, and we can directly calculate the average concurrency using only the sum time series:
sum(rate(http_request_latency_sum[1m]))

Another way of looking at this is that we are calculating the seconds spent answering requests, per second.
If, in 1 second, we answered 2 seconds' worth of requests, we were answering 2 requests at the same time on average.
This is sufficient information for concurrency-based autoscaling.
The averaging over time helps smoothe the scaling, and it typically takes some time for new instances to come up anyway.
The concurrency headroom you choose bridges the gap between an increase in load (or latency) and new instances coming online.