wu-sheng/my-misunderstandings.md

## my-misunderstandings.md

      
    Raw
  

              my-misunderstandings.md
            
          
    This is a reaction post to Misunderstanding "Open Tracing" for the Enterprise by @jkowall
This is opinions with citations (imagine that!). This is not wikipedia. sorry. I didn't run this by anyone, Jonah or otherwise. I do not represent OpenTracing or OpenCensus (or my employer or whatever you might think) in this view. I will give critical thoughts on both from a technical view as people complained to me mostly about lack of details. I do have "a dog in the race" but it isn't what you might think. Yes, I'm the primary maintainer of Zipkin, but my goal is not to disparage anything rather to keep the community healthy with options that exist and free from the suffering caused in my opinion by complete lack of technical view on what things do. Particularly, this is dangerous in interop, and I'll get to that.
I have experience with both tools. Though I left within months, I was implicated in the beginning of OpenTracing. I still maintain (fix enhance etc) the zipkin java binding to it. I am involved as much as I can with OpenCensus. Personally, I like their culture better and the approach to interop, so I am biased by this. I am biased as being a primary maintainer of Zipkin, I get personal impact (less couch time) the more apis that break or mistruths told about tracing. Anyway, both projects (like most things) are flawed works in progress even if I feel one is better at interop. If you are looking for an OpenTracing balloons, that's not what you'll get here. I'm one of the few people willing to criticize it in any way. For example, a very read gist of mine is unfortunately a GOTO because as an industry we like handing out balloons I guess? I hopefully will not be handing out OpenCensus balloons either. As I have limited time for this, I'm just going to discuss Jonah's blog.
I'll break down things by quoting parts top-to-bottom and trying to unpack them. Not everything, just things I feel notable or that folks asked me about.
Preamble

Preamble is that OpenTracing is a library api definition for tracing. If used by frameworks and pinned to an implementation, you'd ideally see a coherent view of a distributed operation, like a request hitting memcached. This may or may not look like what you would see if you were to use direct integrations or alternatives to OpenTracing.
It is not architecture. It is not cloud or mesh or any other buzz work. It is literally a header or java interface or pick your language interface. Unlike other instrumentation projects a goal has been for "users" not only to code their own traces, but to do so routinely and use the same api for metrics and alerting, ripping out log4j for opentracing, security token propagation because why not? I don't have time to get into how horrible many of these ideas are, but the point is that it wants you to use it exclusively for a myriad of concerns at all layers of the stack. Anyway.. vendors have to provide a "Tracer" with the features defined by OpenTracing, this includes their own data formats etc. There is grey area about what "OpenTracing" support means as there is no chart I'm aware of. For example, APMs agents typically don't swap their internal code. They can provide an interface to users while still using a better api internally.
OpenCensus is a toolkit for instrumentors... it currently is not marketed (much at all, but not) to "Jo developer". It provides a suite of apis for metrics, tracing and tagging. You do not have to use the tracing api to use the metrics api. You do not have to route the data through the same pipeline. Each thing is opt-in and there is no likelihood that a future exception reporting or even logging api will be required.
By the way.. I'm talking about data?? Why is that? Census is an api with an implementation. In other words, you "plug in" to this with exporters and things that control headers. The internals are batteries included. This interestingly allows the community to make what's only available commercially, like lazy pull collectors. It has a data model and an api. I genuinely hope they don't market this as a log4j replacement and I don't expect they will. For example, google have a large stake in gRPC. If they wanted to push census down anyone's throat, they likely would have made gRPC require census. It doesn't for example, Zipkin's gRPC instrumentation can be used instead for example. Anyway.. If you are a vendor you plug in some things. It is relatively easy, but also constrains how data moves to the "census way"
Presumed motivation of Jonah's blog

First, motivation! This is motivated by a Gartner post which suggests agent based instrumentation is not advised, and instead literally mesh or OpenTracing is. Not OpenTracing or OpenCensus or AWS format or ... You have one option and that's OpenTracing. The link is paid.. guess who is likely to pay for this? Enterprise!! That's why you'll see focus on enterprise in Jonah's blog.. Why are enterprise folks being recommended OpenTracing as the answer to agents (I'll completely dodge the mesh advice)
I'm assuming this advice was fed to Gartner by OpenTracing people directly or indirectly via hype. I don't know this to be true. However, the quest of OpenTracing people against automatic or agent based instrumentation is interesting. The marketing slide mentioning motivations changed last year. It has this doozy:
"Auto-instrumentation doesn't scale: it must be explicit" Sorry, "Jo developer" you have to learn tracing apis now? Anyway I think (don't know) that this world view difference exists. OpenTracing presumes routine developers need to write custom things to succeed in tracing. This is different than the appDynamics (or non lightstep OpenTracing vendors) POV, which is inclusive of automatic and manual instrumentation. I didn't run my guess of Jonah's motivation by him, so he'll tell me if I'm wrong...
TL;DR; I think it is because there is increasing messaging that there is one right way to do tracing (which is forcing everything through opentracing v0.31 I guess) and that doesn't gel with what actually happens in enterprise environments.
Long commentary on things I feel need examples


Great, so if this is open this will solve all interoperability issues we have, and allow me to use multiple APM and tracing tools at once? It will help avoid vendor or project lock-in, unlock cloud services which are opaque or invisible? Nope! Why not?

^^ May sound normal to you if you've only read hype, or horrid to you if you've actually tried to change things. I suspect many feel like they have lock in freedom solely because they are told they do, or a few simple examples of similar systems appear to work. This sentence summarizes a lot of the types of questions I get on github issues from users, or planted by OpenTracing people.

The Enterprise uses a wide range of technologies which must be cobbled together to make their applications work. Some of the custom apps written in Java, .NET and other languages, much of it a decade old. Other parts of the stack are packaged applications such as those provided by Oracle, Microsoft, SAP, and many more. These often work with messaging systems which span both open source and commercial tools using proprietary protocols such as those offered by Tibco, Mulesoft, Oracle, Microsoft, and open source projects such as ActiveMQ and RabbitMQ.

As mentioned in the pre-amble, I think Jonah's trying to suggest things that OpenTracing isn't doing (especially not with anti-agent bias to that). But even if it did have a bias these things wouldn't work!
Instana (one of the poor souls working in OpenTracing but not able to control it) were one of the early contributors to Brave (Zipkin's java library)... Out of the gate, they asked for Java 6 support (for those decade old apps). Even last year we had folks asking for Spring 2.5 support (that's literally over 10 years old!). Meanwhile there are a routine set of questions not answered even in Zipkin who tries to take care of older apps. Like for example "websphere" which is a word that makes most OSS developers run and hide. I'm not dissing modern websphere, but that's not what people ask for.
Why did I just talk about this? literally I'm talking not about OpenTracing or Census api experience.. hmm.. this is because people asking to "turn on tracing" are literally not wanting to or able to change the code in their apps. The only way to accomplish this is a configuration-packaging approach. I fully expect the fulltime staff behind OpenTracing core (notably lightstep, uber, red hat) are capable of hands-off. Non core OT like DataDog write the only "native" OpenTracing agent I'm aware of. Why is auto-instrumentation demonized? It only makes it less relevant for enterprise applications, and further complicates positioning with competitors. Still even if the effort was put in, would it match features of WebLogic 5.1? APMs do this.. is your lock-in freedom (to the degree it exists) a circa 2017 story?
Census do not demonize agents. They know that there are things you just can't do without them, such as solve propagation problems without touching code. That's why they pack-in an agent. When census started externally, it went to vendors themselves with an aim to have vendor SDKs layer on this. Being ok with not being the only thing doing tracing is a strength, as it allows others to solve Weblogic 5.1.

In many Enterprise organizations, each of these "hops" are managed by a different team or subject matter expert who often uses another siloed tool for monitoring and diagnostics. The enterprise APM providers build end to end views across these technologies, both old and new. Unified views are incredibly hard to do technically and culturally, and even more difficult in production, under heavy load, with minimally affecting the performance of the instrumented transactions.

OpenTracing to the degree it can tell you overhead is only in the tools on top view, not knock-on effects of things like parsing maps and internals of tracers. Vendors express to me a very defensive view.. that they are blamed for all overhead.
Fear of overhead or errors makes relying on community (community sometimes means planted by vendors) instrumentation suspect. While possible, I highly doubt Dynatrace or anyone in enterprise environments is going to exclusively use the OpenTracing servlet filter. Not to rag on it, but it literally only works with newer servlet specs, and has almost no issues. Highly reused software has perpetual feedback. So.. if folks aren't using the opentracing things.. are they truly in a swappable situation? Are they going to get a similar experience? Sounds like bake-off to me more than swap-out!
OpenCensus, when it starts manufacturing instrumentation at the pace Red Hat do for OpenTracing, anyway, will have this problem, but they do have options not available in OpenTracing. The key differentiator between OpenTracing and OpenCensus is that the latter defines a data format and also an implementation (means to control end-to-end overhead). This means there can be an inclusive story with other products in the ecosystem, potentially proprietary in and outside the process, at least with open data. This means you can share and cooperate in the same trace without a wrestling match to lock in a specific api or version of it.
On the other hand, this has to prove itself out. There is an earnestly run specs site, but there's honestly little practice yet. Some aspects of the data model are contension and matter more outside google than inside it. For example, the span kind used in zipkin (and OpenTracing for that matter), wasn't idea sourced from google, and we'll see how that goes. Moreover, while there are a lot of vendors writing exporters and things, there's less direct engagement on the spec itself. To the degree in-process integration or data sharing is possible is conjecture at this point, so you should look at this again a year from now.

The retrospective to the Enterprise is smaller more modern companies who build things differently. They create customized stacks of open source, develop tools and technologies which can go very deep into the infrastructure layers. This subset of companies are not the Enterprise; they have typically been running in cloud or containers since the inception of the company. They make different decisions and are happy to leverage open source projects which are often customized extensively. They are writing their instrumentation for various purposes including monitoring and root-cause analysis. OpenTracing attempts to standardize the custom coded instrumentation with a standard API and language.

This part is an interesting differentiator of world view. For example, in the golang community, there is a clear cultural willingness to code straight to the bottom as necessary. In such a view, you might choose to focus on what code looks like when you do monitoring and tracing and logging on your own in the same file, vs what platforms like gokit might do for you. Peters presentation on mixed concerns shows how you can remove what looks like a bunch of crap from your app by using gokit (which provides several choices to implement tracing btw).

Once the developer writes the instrumentation, how a tool or platform consumes data is not part of the standard, it's not a problem OpenTracing is trying to solve.
The result of this decision is that if a user has implemented OpenTracing with a specific vendor and language, let's call it "Tool  A" they must release libraries or implementations for OpenTracing for each language.

Via backchannel, I realized this really made some upset. For starters, people do want to be able to swap things.. they like the idea of this. This is one of the reasons why projects like OpenTracing and OpenCensus can be attractive. Now, why did Jonah use harsher tone here? If to instill some doubt, maybe it is warranted.. let's move on.

Implementation of a new tool requires code changes associated with connecting the library and any language specific implementation changes. For a mature language such as Java, this would require changing the library and the implementation at the same time since the propagation formats to the tools are incompatible.

b...b..bbut OpenTracing is swappable.. Jonah's just a mean vendor! Please, regardless of your camp, please re-read this carefully. Is this not true? For example, in Java, interface compatibility is required, especially pre-java9. The OpenTracing project changed api in most versions, so yes... sorry, but if you use the same api at the top of the stack (user) and the bottom of the stack (http or database), you do need to release new code for every component. In two years this has been the norm and has caused rev-lock for projects who were keen to use opentracing. For example, if you are in the rare position where you actually have a use case for multiple "tracers", you have this revlock. Ex it happened in StageMonitor which had to wait until the zipkin and jaeger tracer were on the same version.
Propagation (trace headers) to the degree is compatible at all in OpenTracing, is vendor specific. For example, if you use Jaeger, you can choose B3 format, as it is packed in. This creates leverage for Jaeger for example, as to get zipkin compatiblity you can't do that in OpenTracing alone.. you have to use a zipkin tracer or jaeger (because jaeger themselves have a bridge). Here's another dimension of lock-in.
OpenCensus on the other hand requires support of B3 format, and all things can be swapped. So for example, you can be compatible with things like headers and other things that affect your production requests decoupled from "if" data is even sent externally to a system. This matches reality. For example, Nike Wingtips use splunk, but still rely on quality instrumentation and stable propagation. I'm not saying wingtips is obviated by census, rather that there's an implementation here and it can provide independence guarantees that solely a programming interface cannot.
TL;DR; when OpenTracing changes, it affects 2 layers of swapping ability because the same api exists at the top and the bottom of the system. In OpenCensus, there's an exporter interface responsible for vendor tie-in, which moves independently of user code and is provably more pliable. The same approach has worked in Zipkin for years zipkin-reporter, which is downloaded far more often than the "zipkin tracer" and has a very stable interface coupled only to data format.

OpenTracing doesn't solve the interoperability problem, so what does the "open standard" attempting to solve? Well, for one thing, is that it allows those making gateways, proxies, and frameworks the ability to write instrumentation. That should, in theory, make it easier to get traces connected, but once again the requirement to change implementation details for each tool is a problem.

I don't fully agree with Jonah here as I really wouldn't recommend OpenTracing for gateways and proxies as they should be wire-compatibility performance focused.. something discussed earlier you can't see the full picture of without an implementation. Sampling.. like export of data?! There's literally no api in OpenTracing for this, yet it is widely known as key part of tracing. So how do you do sampling on your proxy? Vendor specific or use some goofy tag? This code or tag heuristics locks you in to a vendor a way Census doesn't. So yeah, I really don't recommend OpenTracing as solving the interop problem for even proxies.
The "the requirement to change implementation details for each tool is a problem" part is a big deal as it makes people mad who think they don't have to. If history is a lesson, look at how many folks are doing "multi-cloud" now vs when tools came out in the first place.. It isn't really concrete until you try. What actually works between for example New Relic (no offence) and Instana (also no offence) with OpenTracing is only known at bake-off. All you get to see are toy examples of Jaeger vs Zipkin (wow.. a fork works same data of what it forked..) or LightStep screen shots. Is this what we base interop on? Why people aren't asking for more is beyond me.. with the resources OpenTracing has, they clearly could.

Don't forget that Enterprise APM tools do a lot more than just backend tracing, and they capture metrics and traces on the front end (mobile/web), infrastructure, log capture, and even correlation to other APIs. These additional capabilities are out of scope for the OpenTracing project.

Ironically, depending on who you hear from, above is in scope of OpenTracing because we are supposed to use it for everything. Most APM agents are closed source, so we just don't know, but I'd have to agree. Here are some interesting data points though from Open Source:

https://github.com/opentracing-contrib/java-agent < Written almost entirely by Red Hat, an OT committer.

A year old project with 304 downloads last month from maven central.


https://github.com/DataDog/dd-trace-java < DataDog are not OT committers (who knows why as they've been heavily involved)

Data is only from opentracing, but format and propagation obviously not defined there


https://github.com/apache/incubator-skywalking < Also not OT committers (despite writing chinese translation and other big efforts)

Data can come from OT api or their own apis. Also mixes with stats.


Census is frankly too new to be telling any stories here, yet. There are no vendors using it except Google (which to be fair is a pretty serious vendor, not only externally but inside google itself is big). There are exporters written, but to the degree they are used seriously is likely only for gRPC (as census is packed-in there). I may be unfair here, but whatev.

So if the goals are solving any of the problems an open standard would solve, OpenTracing doesn't do a lot. If you read the marketing put out by various vendors and foundations, one would think its the panacea, but the reality is far from the truth. OpenTracing does not standardize metrics, logs, or other structured data which tools consume.

This is a bit of grey area. OpenTracing do specify tags and their version of logs. However, it is not detailed enough to be used by Amazon X-Ray, for example, who will throw out data unless they get it all. That's because OpenTracing defines no model type that can enforce its collection. They may do that at some point, similarly to how Zipkin does with Brave (conjecture) or otherwise.
Let's take the X-Ray example to Census as there's no OpenTracing effort to help with Amazon (at least from the core project). Because Census has a data model, a discussion can be had about whether generically speaking amazon can work or not. It currently isn't interop tested! However, it is more straight-forward as you can discuss purely the data translation concerns independently from the entire implementation.
All the above is rather "trace centric" In OpenCensus, stats can be defined as they have apis for it. The challenge OpenCensus will face in stats will be similar to ones felt at micrometer. Will you be able to collect and represent data in ways that vendors will be able to and desire to absorb. The jury is out until there's more experience imho.
Regardless, I'll share a quote from @wu-sheng on a related topic "Explicit tracing can't, or should say not good at, access meta data of codes, something like line number, real class name(e.g. Spring class generation mechanism). But there are solutions for them, but not cool, and need to add many codes." In other words there's some things that are data in nature, agents are doing which explicit coding can not do, or not do efficiently, even if you defined the data field!

For some reason there is a presumption that swapping agents out of an APM tool is difficult, this is entirely not the case. Vendors replace each other all the time, if interested in these stories, many published, and countless others remain behind closed doors. It's much easier to change instrumentation done at runtime versus those hardcoded into your products. This is especially the case when the API and standards are always changing, and many of the changes in OpenTracing have been breaking changes to the APIs.

This is back to the "swap out" story. When APM agents are installed, the only exposure your code has, if any, is to what apis that agent ties into. For example, an OpenTracing contributor is working at Elasticsearch now on a new Agent. Felix smartly defines a much smaller customization api, as any such api is a lock-in risk, regardless of if it has the word open on it or not. OpenTracing's api is huge in comparison, and again used at the bottom of the stack, potentially in the agent itself in the some cases I mentioned above. This creates huge friction. Moreover, many simply "turn on the agent" which means they get tracing, stats etc, and don't when it isn't loaded.
This isn't to say coexistence with agents is easy. For example, if you follow @raphw on twitter, he'll occasionally yell about this problem, how agents are coded and how they interfere with eachother. Also, agent unless using something with an export format does nothing for interop even if it can do something about code based lock-in.

Enterprise APM vendors must build a lot of instrumentation to support these various technologies, languages, and frameworks. Each of the companies had teams of people just maintaining framework instrumentation. Each tool vendor has dozens of engineers dedicated to writing these, which is a waste of resources, and could be more efficient if there were an actual open standard.

This is a big deal.. people see hype and buzz but forget that real users and hard work go into things that matter. We get hype based curiousity people end up needing to take time to respond to. Meanwhile there are not only dozens, but sometimes a hundred plus engineers at vendors working on stuff actually used in the enterprise.
Outro

If you don't like how Jonah says things, fine. However, he like many others that talk to me are frustrated with the hype and that's amplified when it gets into Gartner advice to enterprise folks. We are sorely overdue in paying debt we borrow when presenting hype-based solutions to very hard problems. If we say things like vendor neutral interop, we pretty damned well have major vendors there and test it!
So, anyway let's make sure that it can actually solve the problems of lock in we scare people about. Both OpenTracing and OpenCensus can play a role here, as can other OSS projects like Zipkin and Micrometer.. as can commercial solutions.. as can ELK! Anyway, I hope I've added some value to this chat and apologize to the hype gods for not offering donuts. I now prepare myself for the heat of telling you stuff.