codefromthecrypt/obess-duration.md

## obess-duration.md

      
    Raw
  

              obess-duration.md
            
          
    I've been very obsessed with the duration of Zipkin's POST endpoint, more than how many bytes of memory are used while processing a POST (I also obsess about that, but it doesn't keep me up at night). The duration of an endpoint that receives telemetry data, is the part that you can control of the response time.
Callers of Zipkin's POST endpoint are usually little http loops inside an application. Even when these are done neatly, with bounds etc, blocking those loops causes damage (lost spans), and also causes more overhead as these queues fill to capacity. Crazy, but true.. sometimes people literally POST to zipkin inline (ad-hoc or sometimes in php)! While we shouldn't optimize for this, it is crazy the amount of impact you can do.
For this reason, we need to succeed fast, and we also need to fail fast. We want these things to clear or fail quickly (ex in case of failure, the client can try another node, right?). This "fast" must apply at reasonable percentage of requests, because you don't want half your apps hosed right! P99.something is relevant because you are talking about impact caused under diress.
It might seem unintuitive why bother making failures fast if requests might be useless one way or another. However, there's a constant usually in place in telemetry, which is at the end of the day, you are optional. You don't have the right to make production apps fail! tying up queues, acting awful etc indirectly hurt the app. That's why I obsess about this part vs the health of Zipkin itself. Zipkin could use more memory etc if it must, but that's only "zipkin's problem", if we did that.
We must sadly be real about our very secondary state of presence in application architecture, and putting request handling first does that.