codefromthecrypt/auto-trace-trouble.md

## auto-trace-trouble.md

      
    Raw
  

              auto-trace-trouble.md
            
          
    The number one problem to solve in tracing is propagation, at least carrying trace identifiers from one
side of your process to the other. Usually, some thread local state is involved with this. In I'd say most
libraries apart from OpenTracing, this is a separate functionality than timing things. For example, you
put variables in scope with one api and you do timing with another. It might look like this.
span = startSpan();
try (Scope scope = propagation.scope(span.ids())){
  // work here can see the span ids via something like Trace.ids() 
} finally {
  span.finish();
}
This design was discussed quite a bit, and in use for years in things like finagle or servlet filters to handle request variables. Particularly bogdandrutu from Google+Dapper+Census+Stackdriver+gRPC taught me
some things about this I didn't initially grok. Anyway, it ended up in zipkin's core library Brave, inspired by work in Census (google's tracing libary), and most feedback has
been great.
This looks ok, but some java folks might say.. hey, this looks like repetition! Let's do this instead.
try (ScopedScope scopedSpan = tracer.startScopedSpan()) {
  // work here can see the span ids via something like Trace.ids() 
} // no need to close things.
The above approach was added to Census, and more recently Nike Wingtips for reasons like
making it easy for someone to quickly (manually) trace a synchronous call. In Brave (zipkin), we didn't do that for reasons
including most tracing can be made automatic, obviating a lot of scenarios where you need to tell end users how to trace
things. However, it is all tradeoffs.. Basially the scoped span thing trades less things to track explicitly for difficulty
in error handling. This part was most succinctly pointed out to me by raphw who worked on Instana tracing (perhaps more famously for the bytebuddy library).
The problem in the simpler api above is that the span representing the operation closes before you get to error
or finally handling. In other words, anything in catch or finally would happen after the span has finished, and this
can be surprising to people. If you need more concreteness, paste the below into your favorite IDE and run it.
    try (AutoCloseable foo = () -> System.out.println("span.finish()")) {
      System.out.println("span.start()");
      throw new RuntimeException();
    } catch (RuntimeException e) {
      System.out.println("span.tag(exception)");
    }
Other reasons that led to Brave not implementing this (immediately anyway) are that auto-close can be interfered with if
the user calls close by accident. This sort of thing can be mitigated with good docs and logging, for example nicmunroe's
recent work on wingtips has warning for identifying malpractice.
Of course, some apis go well beyond this, for example an ActiveSpan api designed by Ben S in OpenTracing
tried to establish a chain of continuations used by things like callbacks. The last to call close finishes the span (a close race condition).
This approach is troubling to me as it is complex, can easily be messed up, and puts a burden on tracing libraries they don't deserve. Through a lot of hard work, notably a
discussion about python at the distributed tracing workshop,
an integration issue on brave, and with
significant effort from a few folks like ivantopo... hopefully this thing will be removed from OpenTracing api,
ideally permanently, but sometimes things removed just get added back later (ex logging apis in OpenTracing were removed then added back). We'll see.
Anyway, back to the point.. so we are looking to ease users from having to do code like this..
(this code is real code in brave for apache hc client tracing, which is synchronous, and properly handles errors)
    return (route, request, clientContext, execAware) -> {
      Span span = handler.nextSpan(request);
      CloseableHttpResponse response = null;
      Throwable error = null;
      try (SpanInScope ws = tracer.withSpanInScope(span)) {
        return response = protocolExec.execute(route, request, clientContext, execAware);
      } catch (IOException | HttpException | RuntimeException | Error e) {
        error = e;
        throw e;
      } finally {
        handler.handleReceive(response, error, span);
      }
    };
One argument I have is that users shouldn't be doing ^^, framework authors should! In zipkin at least, a lot of folks in many
languages make helpers for users to auto-trace things, severaly reducing the need to ever deal with lifecycle. For example,
there are folks defining java @Trace annotations which completely hide things that can be enclosed in a method. For the hard
libraries, well, my argument is do it right! This code once written rarely changes. Even better if you are lucky enough to
use an automatic framework or agent such that someone else is doing that code. There's really no glory in trying to force
everyone to learn tracing code.
Another argument is that it is increasingly rare that RPC calls are synchronous anyway, and even if they are, they are not
guaranteed to close over a method neatly. I found there was no place this could be used for instrumentation, not unless we
stopped doing error handling. So if we added this in Brave (zipkin java tracing), it would be only for users (who again are
often served by aspectj or bytebuddy etc).
Finally, it is really important to stop conflating scoping apis with lifecyle, at least it should be very clear that these
are separate functions. The most sincere lock-in possible is to force scoping only via a tracer api. This means all the
insrumentation needed for deadlines, user id propagation etc, have to use an api that changes much more frequently. I'm
hopeful folks like CodingFabian can help make a "just scoping" api which even if used by tracing, can be reused by other
consumers of scoped variables such as those needed in service meshi. Some of this is already present in the tags api in
census,
and starting to brew in a TraceContext spec discussion. Ideally, this
results in a very small api that anyone can use to push things across one side of a process to another, and in a meaningful
way. IOTW, not just unguaranteed junk, rather if you place a deadline, you can surely read it on the other side. I'm positive
that at least in zipkin (and spring cloud sleuth) we will participate in something like this.
Anyway, I hope the ramble was of some interest even if the scope of it is narrow :)