hazelweakly/o11y-thoughts.md

## o11y-thoughts.md

      
    Raw
  

              o11y-thoughts.md
            
          
    The tl;dr would be "otel-js should look much more like tracing from the rust ecosystem or otel from ruby, and less like otel from the java ecosystem."
I think a lot of the friction of otel-js comes from two main areas: not writing a library that takes advantage of most modern conveniences, and writing one that doesn't take into account most peoples use-cases.
Some high level points:

JS has essentially 4 main splits. You have front-end vs back-end, and toolchain vs "raw JS". Importantly, these are often not exclusive camps.
Many JS codebases run on the frontend and the backend with the same code; this is extremely difficult currently and brings almost all of the pain points into full display immediately.
Ergonomically speaking, the library feels like a very low-level library for building a real otel library out of, but it isn't, and there isn't a high level one.
Lastly, the most important style of documentation here is going to be (imo) copy-pasta cookbooks; this is almost entirely missing from otel-js.

I'm going to use a running example of an isometric codebase using a toolchain because that's getting increasingly common and also runs into every single ergonomics issue at the same time.
Let's say you have:
import { Button } from "design-system";
// 25 random otel imports go here

const App = () => {
  // tracing goes here lol
  return (
    <div>
      <Button onClick={doWork}>click me </Button>
    </div>
  );
};

const doWork = (event) => {
  // tracing also goes here
  return;
};
I'm going to point out the immediate issues:

This code will run on the server, and the browser. Any code that expects to only run on one of those is going to have a bad time.
This code gets transpiled, minified, manipulated, and so on. Any code that expects global stuff to get run "coherently" is going to have a bad time.
This code is invalid but will probably work. Why? Because it's possible for transpilers to detect usage of anonymous functions and rearrange code so that it's valid according to the spec.

You can't rely on a language spec for dictating allowed behavior. You rely on what browsers and nodeJS actually do and how various compilers/transpilers mangle your code in reality. Implementation dictates reality.


This code uses ESM imports which work differently than requires. The code may be mangled to be transformed into one of 3-5 different import/export module bundling formats which may or may not use ESM import syntax or node require syntax.

Basically, give up. Do not assume your code can be MITM screwed with, do not assume require can be overloaded, do not assume imports can be messed with, do not assume they're loaded in a controlled (or consistent) order, or that they're in separate files, or...
Consequently, there is a huge emphasis in the toolchain ecosystem of JS for having "side effect free" modules (aka "code that is well behaved regardless of any transforms that are done to it or its module layout").
Unfortunately, all of the automatic instrumentation in JS assumes side effectful code that isn't transpiled; it will break, it will do so inconsistently and unpredictably, it will appear to work until it doesn't, do not rely on it for anything important, ideally you shouldn't use it at all.


Configuring this code to be traced appropriately is almost impossible:

How do I get server only globals instantiated correctly? What about browser only globals? Good luck. I currently do this in a way that doesn't break any transpilers and results in code that "runs" but it has subtle errors in it. I currently have to just ignore them.


Now let's instrument it
// this code needs to be on the top to run correctly.
// however, good luck making sure it's actually at the top of the generated javascript. oh well. good luck.
// this is actually invalid. Imports must come before requires and they can't be mixed.
// you can usually force a bundler to do "something" and make it work. But, well, it does "something"
// require('./tracing.web'); // this code is surrounded by if window !== undefined. It mostly works.
// require('./tracing.server'); // this code is surrounded by if window === undefined. It mostly works.
//
// consequently, we use imports here because the transpiler will yell at us
// anyway and forcing it is more error prone than not forcing it... probably?
import "./tracing.web"; // i sure hope these enabled async support correctly
import "./tracing.server"; // i sure hope these enabled async support correctly

import { Button } from "design-system";
import tracing from "...";
import SemanticConstantsAndSTuff from "something else";
import otelApiThings from "the api i guess";
// 25 random otel imports go here

const App = () => {
  const theGlobal = tracing.getTracer(
    "i hope this isn't a typo lol",
    "some version number goes here"
  );
  const span = theGlobal.getActiveContext(); // might be wrong, you should check the docs and pick one of 4 different ways to write this
  // 4 other lines of setting up context, wiring it together, gluing it into a span, etc. Dealing with conditionally missing return values,...

  // how do I pass span to `doWork`?  jokes. you don't. refactor all your code to be tracing aware.
  // return (
  //   <div>
  //     <Button onClick={doWork(span)}>click me </Button>
  //   </div>
  // );
  //
  // except that won't work correctly. You really want
  const doWorkInSpan = theGlobal().startAciveSpan(
    name,
    bunchaOptionalStuff,
    (span) => doWork(span)
  );
  // except if the doWorkInSpan might outlive the lifetiem of the button function.
  // in which case you want to link it together rather than starting a nested span.
  return (
    <div>
      <Button onClick={doWorkInSpan}>click me </Button>
    </div>
  );
};

const doWork = (span) => (event) => {
  span.addEvent("event happened");
  span.addAttributes(); // this is actually 20 lines of code with no convenient object syntax shorthands whatsoever.
  return 4; // a random number, chosen by fair dice roll
};
You can see where my biases are, now, and a lot of the frustrations I have with the API.
I hope the sarcasm comes across here as empathetic rather than rude and biting, because that's the intent; API design is difficult, and choosing how to make a cross-language spec feel "language native" is REALLY hard.
So yeah, the big issues I see here are that, ergonomically speaking, the library feels like very low level code for building a real otel library out of, but it fails to actually do so.
The amount of imports required, the amount of syntactic overhead, the amount of noise, etc., really pushes you towards massive monolithic functions.
It feels like this sort of code would be semi ergonomic inside classy javascript where you create these super large classes and then OOP the fuck out of them.
More functional style programming doesn't get to benefit from the type of code reuse that would reduce the boilerplate here and it's super obvious.
Lastly, async needs to be a thing. You have two choices because of how adding it into your code has to work: a) give up and get nonsense traces when async gets involved, b) take the non-neglible performance hit. For basically all apps, b is the correct and only take.
Consequently, enabling async should be the default and opt-out rather than opt-in.
It's worthwhile to note that if you're using modern Angular, or most reactive JS frameworks, you've already taken this performance hit (or bigger ones).
Another thing that might not be immediately obvious but has a very large effect on the code is that it doesn't take advantage of any of javascript's actual capabilities or metaprogramming facilities to reduce the amount of boilerplate required.
But... You shouldn't actually use javascript's metaprogramming facilities, as a library author. They're too unreliable. Compile time metaprogramming is the only consistent way to make it work, and even then, it'll be tricky.
Really, the big ergonomic gains are going to come from using better library authoring tooling and taking advantage of toolchains to provide additional layers of convenience.
That said, a few API choices can make a big difference.
This is heavily modeled after ruby's way of doing things which, tbh, feels very ergonomic and possible in javascript since they're pretty similar in capabilities here (I fake the "do" syntax feel here with promises, but classes could work too; using promises is semantically closer though).
// This is what I think is possible without taking advantage of any toolchains. Ie, it should be possible to achieve this UX in un-transpiled "raw" browser+server JS.

// import or require works fine.
import "./tracing-setup"; // async enabled by default. This code is still fairly verbose and boilerplat-ey
import { span, SemanticConventions as SC } from "otel"; // contains _everything_
import { Button } from "design-system";

const App = () => {
  // Using promises to get automatic span ending.
  return span.inSpan("app").then(() => (
    <div>
      <Button onClick={doWork}>click me </Button>
    </div>
  ));
};

// first way
const doWork = (event) => {
  span.startSpan("doWork");
  span.addEvent("event happened");
  span.addAttributes({"app.button.onClick", { target: event.target, value: event.target.value }});
  span.end();
  return 4; // a random number, chosen by fair dice roll
};

// other way
const doWork = (event) => span.inSpan('doWork', {
  "app.button.onClick", { target: event.target, value: event.target.value }
}).then(() => {
  span.addEvent("event happened");
  return 4; // a random number, chosen by fair dice roll
});
That feels a lot better to me. I'd really like to see that be possible and remove as many footugns as possible from the API and then clearly document all of the basic and less basic usecases in a cookbook style format that people can just copy and paste and have "just work". That'd be awesome.
Lastly, there's also the elephant in the room: javascript in javascript is not the only option or even potentially the most desirable one. The JS toolchain ecosystem has a million different code generation and AST transformation options out there and they all are largly compatible with each other.
It's entirely possible to make an incredibly ergonomic API that's opt-in if you use a code transpiler; we have JSX after all. If JS can figure out how to turn fake XML into code, I'm pretty sure we can figure out how to get a more ergonomic API for JS.
For example, here's a really cool API I'd like to see, inspired by tracing in rust. Which, imo, is the golden standard of an ergonomic API that doesn't hide details but also makes things really pleasant to use.
I'm going to pleasantly assume decorators get accepted as is; this is a highly optimistic scenario.
// This is what I think is possible with transpiling.

// import or require works fine.
// import of a tracing setup is not required.
// You can configure it with a tracing.config.js file that gets consumed at build time.
import { span, SemanticConventions as SC } from "otel"; // contains _everything_
import { Button } from "design-system";
Not having to write and wire in a tracing setup is a big improvement already.
// option 1: automatic span injection
const App = () => {
  return (
    <div>
      <Button onClick={doWorkInSpan}>click me </Button>
    </div>
  );
};
// turns into span.inSpan('App').then ...

// By default, adds attributes to the event:
// - the namespace defaults to "app.button.onClick.doWork" (the "codepath" of the function)
// - all the function arguments are added with the values serialized to JSON
const doWork = (event) => {
  span.addEvent("event happened");
  return 4; // a random number, chosen by fair dice roll
};
// option 2: decorators
@instrument
const App = () => {
  return (
    <div>
      <Button onClick={doWork}>click me </Button>
    </div>
  );
};

// By default, adds attributes to the event:
// - the namespace defaults to "app.button.onClick.doWork" (the "codepath" of the function)
// - all the function arguments are added with the values serialized to JSON
@event("event happened") // shorthand for @instrument({event: "event happened"})
const doWork = (event) => 4; // a random number, chosen by fair dice roll
Probably one of the most imprtant things transpiling would do would be to take care of context propagation for you, and allow for a much more graceful handling of fork/join patterns:

fork/join patterns. Between these two, you shouldn't really ever have to write manual links anymore:

Dangling promises can be detected and converted into links.
Fire and forget (void functions) can have links added to them in case they outlive the parent.


Baggage can be transparently dealth with implicitly.
Auto-instrumentation can be added "correctly" at compile time for libraries using AST transformation. No performance hit required, and it would be much more robust.


Some stream of consciousness thoughts about the larger opentelemetry library ecosystem:
Thinking of stuff from a broader perspective, what's really happening here imo is we're trying to make an ergonomic DSL for constructing an annotated embedded DAG in the codebase with as little syntactic noise as possible. This is... Tricky. The ergonomics will essentially demand a per language approach, but since a lot of languages are pretty similar, I think it'll be decently approachable to create a couple different ways of thinking about it and going from there

Decorators with automatic propagation and some sort of reflection or build time metaprogramming are probably the most ergonomic method of doing this. Languages which can do this probably should. Java,. C#, JavaScript + transpiling, Rust...
Languages which support ergonomic monad syntax will really benefit from a inSpan ruby like method. Ruby, JavaScript, Haskell, etc, all fall in here
Languages with ergonomic scope blocking benefit from taking advantage of that. Rust can do this, C# has "with",  so does python.

Runtime analysis or build time static code path analysis will be required for making links more ergonomic. However, since there's a "collector" in the code, you always have runtime code tracking stuff anyway, so it's probably possible to do some automatic annotations and tracking (perhaps in debug mode only?) to get rudimentary fork/join style detection and tracing so that people can write their code as a tree but if it turns into a full DAG, that structure can maybe be inferred at runtime to some extent.
Trees are syntactically convenient in code structure and can be automatically built, DAGs are ugly.
The more I think about it, the more I think the major ergonomic challenges essentially all revolve around pruning and annotating the tree, representing DAGs, and propagating enriched context across process boundaries. Our inability to cleanly solve the first one feels like the "reason" so many libraries end up being very syntactically noisy and heavy.
Well, that and a lot of library ecosystems don't really let you (conveniently) build a tree implicitly in a codebase. Javascript is a notable exception to that, but most approaches to implicitly building it incur a fairly solid runtime performance hit.

When answering "can we build a high level library on top of the existing otel-js stuff":
There's a few things that are going to be non trivial to address to really boost ergonomics, I think.

tree shaking so that you can have a single import for everything relevant
the use of globals and how they're implemented makes isometric javascript extremely difficult to get right
the very strong assumption that code is loaded essentially in a "this global code is ran once and then all the other stuff runs  in linear order" just doesn't work with code bundling and transpiling. It also really really doesn't work with SPA style codebases running in the browser

If it weren't for those I'd assume a higher level library could be written comprised mostly of re-exports and higher level utility functions, but fixing some of those requires a literal rewriting of how things work under the hood.
The biggest one, though, is that auto-instrumentation is fundamentally broken in javascript and you just simply can't do it correctly; it has to be a transpilation step or a fully different approach.