Skip to content

Instantly share code, notes, and snippets.

@codefromthecrypt
Last active September 21, 2017 21:53
Show Gist options
  • Save codefromthecrypt/bc3ff070b6cf4e21ab83b44e0d64a016 to your computer and use it in GitHub Desktop.
Save codefromthecrypt/bc3ff070b6cf4e21ab83b44e0d64a016 to your computer and use it in GitHub Desktop.
Auto-close tracing can be troublesome

The number one problem to solve in tracing is propagation, at least carrying trace identifiers from one side of your process to the other. Usually, some thread local state is involved with this. In I'd say most libraries apart from OpenTracing, this is a separate functionality than timing things. For example, you put variables in scope with one api and you do timing with another. It might look like this.

span = startSpan();
try (Scope scope = propagation.scope(span.ids())){
  // work here can see the span ids via something like Trace.ids() 
} finally {
  span.finish();
}

This design was discussed quite a bit, and in use for years in things like finagle or servlet filters to handle request variables. Particularly bogdandrutu from Google+Dapper+Census+Stackdriver+gRPC taught me some things about this I didn't initially grok. Anyway, it ended up in zipkin's core library Brave, inspired by work in Census (google's tracing libary), and most feedback has been great.

This looks ok, but some java folks might say.. hey, this looks like repetition! Let's do this instead.

try (ScopedScope scopedSpan = tracer.startScopedSpan()) {
  // work here can see the span ids via something like Trace.ids() 
} // no need to close things.

The above approach was added to Census, and more recently Nike Wingtips for reasons like making it easy for someone to quickly (manually) trace a synchronous call. In Brave (zipkin), we didn't do that for reasons including most tracing can be made automatic, obviating a lot of scenarios where you need to tell end users how to trace things. However, it is all tradeoffs.. Basially the scoped span thing trades less things to track explicitly for difficulty in error handling. This part was most succinctly pointed out to me by raphw who worked on Instana tracing (perhaps more famously for the bytebuddy library).

The problem in the simpler api above is that the span representing the operation closes before you get to error or finally handling. In other words, anything in catch or finally would happen after the span has finished, and this can be surprising to people. If you need more concreteness, paste the below into your favorite IDE and run it.

    try (AutoCloseable foo = () -> System.out.println("span.finish()")) {
      System.out.println("span.start()");
      throw new RuntimeException();
    } catch (RuntimeException e) {
      System.out.println("span.tag(exception)");
    }

Other reasons that led to Brave not implementing this (immediately anyway) are that auto-close can be interfered with if the user calls close by accident. This sort of thing can be mitigated with good docs and logging, for example nicmunroe's recent work on wingtips has warning for identifying malpractice.

Of course, some apis go well beyond this, for example an ActiveSpan api designed by Ben S in OpenTracing tried to establish a chain of continuations used by things like callbacks. The last to call close finishes the span (a close race condition). This approach is troubling to me as it is complex, can easily be messed up, and puts a burden on tracing libraries they don't deserve. Through a lot of hard work, notably a discussion about python at the distributed tracing workshop, an integration issue on brave, and with significant effort from a few folks like ivantopo... hopefully this thing will be removed from OpenTracing api, ideally permanently, but sometimes things removed just get added back later (ex logging apis in OpenTracing were removed then added back). We'll see.

Anyway, back to the point.. so we are looking to ease users from having to do code like this..

(this code is real code in brave for apache hc client tracing, which is synchronous, and properly handles errors)

    return (route, request, clientContext, execAware) -> {
      Span span = handler.nextSpan(request);
      CloseableHttpResponse response = null;
      Throwable error = null;
      try (SpanInScope ws = tracer.withSpanInScope(span)) {
        return response = protocolExec.execute(route, request, clientContext, execAware);
      } catch (IOException | HttpException | RuntimeException | Error e) {
        error = e;
        throw e;
      } finally {
        handler.handleReceive(response, error, span);
      }
    };

One argument I have is that users shouldn't be doing ^^, framework authors should! In zipkin at least, a lot of folks in many languages make helpers for users to auto-trace things, severaly reducing the need to ever deal with lifecycle. For example, there are folks defining java @Trace annotations which completely hide things that can be enclosed in a method. For the hard libraries, well, my argument is do it right! This code once written rarely changes. Even better if you are lucky enough to use an automatic framework or agent such that someone else is doing that code. There's really no glory in trying to force everyone to learn tracing code.

Another argument is that it is increasingly rare that RPC calls are synchronous anyway, and even if they are, they are not guaranteed to close over a method neatly. I found there was no place this could be used for instrumentation, not unless we stopped doing error handling. So if we added this in Brave (zipkin java tracing), it would be only for users (who again are often served by aspectj or bytebuddy etc).

Finally, it is really important to stop conflating scoping apis with lifecyle, at least it should be very clear that these are separate functions. The most sincere lock-in possible is to force scoping only via a tracer api. This means all the insrumentation needed for deadlines, user id propagation etc, have to use an api that changes much more frequently. I'm hopeful folks like CodingFabian can help make a "just scoping" api which even if used by tracing, can be reused by other consumers of scoped variables such as those needed in service meshi. Some of this is already present in the tags api in census, and starting to brew in a TraceContext spec discussion. Ideally, this results in a very small api that anyone can use to push things across one side of a process to another, and in a meaningful way. IOTW, not just unguaranteed junk, rather if you place a deadline, you can surely read it on the other side. I'm positive that at least in zipkin (and spring cloud sleuth) we will participate in something like this.

Anyway, I hope the ramble was of some interest even if the scope of it is narrow :)

@codefromthecrypt
Copy link
Author

Drill-down notes.. so what's wrong tag after finish?

span.start()
span.finish()
span.tag(exception)

In tracers I've seen, ^^ will result in late data. For example, the tag corresponding to the exception misses the boat,
and is essentially harvested later (ex 30 seconds later in brave or finagle, any dangling data will be flushed).

That's when you code correctly! If you used a tool like "currentSpan" in the finally block, the error would misallocate
to the parent span. At any rate, there are some nuance to using try-finally = span lifecycle which folks need to be
aware of, so that they aren't surprised later.

@ivantopo
Copy link

hey @adriancole, cool post! btw, maybe you meant to link to this or this other instead?

@codefromthecrypt
Copy link
Author

clarified that there is no race condition in try/finally as auto-close always wins. the race condition is in the related ActiveSpan api in OpenTracing

@codefromthecrypt
Copy link
Author

@ivantopo thx copy pasta error

@codefromthecrypt
Copy link
Author

census-instrumentation/opencensus-java#646 will spike a revised opinion (which might mean keep it) in census

@nicmunroe
Copy link

Well said on everything. I may regret putting the AutoCloseable stuff into Wingtips. :) Definitely tradeoffs, and it's hard to know where the line is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment