Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Qard/faad53ba2368db54c95828365751d7bc to your computer and use it in GitHub Desktop.
Save Qard/faad53ba2368db54c95828365751d7bc to your computer and use it in GitHub Desktop.

Why lack of thenables support in async_hooks is problematic

Any use of thenables within the ecosystem has a high risk of breaking internal context propagation. Linear context propagation is fine, it's only the points in the async call graph where execution branches and potentially converges later which are impacted. This might not seem like a huge deal at first, but not all control flow is linear, especially for observability and diagnostics systems. It's important to have state consistency from anywhere that code is reachable, otherwise eventually someone will try something unusual and break the system. When you're dealing with a userbase as large as Node.js, it doesn't take long for someone to do something unique and unexpected.

See below for a code example demonstrating the exact scenarios when context loss will and will not happen.

const assert = require('assert')
const { AsyncLocalStorage } = require('async_hooks')

const store = new AsyncLocalStorage()
const data = Symbol('verifier')

function thenable() {
  return {
    // This then(...) method, unless called directly with promise.then(...)
    // in user code, will be executed within a microtask that is IN NO WAY
    // wrapped by async_hooks. This means the execution async id of the task
    // will be 0 and therefore the trigger id of any async tasks run within
    // the task will also be 0.
    // 
    // Because the ids are zero at this point, async_hooks will think the
    // current resource is the top-level resource. This could result in
    // pollution of the top-level resource which could then propagate into all
    // future descending resources in the async execution tree, depending on
    // how your continuation local storage system is implemented. Fortunately
    // `AsyncLocalStorage` is immune to this, but most of the APM ecosystem
    // currently uses their own CLS and some may continue to do so due to
    // potential for additional resource-tracking insights.
    then(cb) {
      assert.equal(store.getStore(), data) // Fail
      setImmediate(cb)
    }
  }
}

// The following four `store.run(...)` samples will "upgrade" the thenable to
// a real promise within the unwrapped microtask mentioned previously.

// Await a thenable
store.run(data, async () => {
  assert.equal(store.getStore(), data) // Pass
  await thenable()

  // Linear propagation restored to the correct context
  assert.equal(store.getStore(), data) // Pass
})

// Returning a thenable in an async function
store.run(data, async () => {
  try {
    assert.equal(store.getStore(), data) // Pass
    return thenable()
  } finally {
    // Linear propagation restored to the correct context
    assert.equal(store.getStore(), data) // Pass
  }
})

// Resolving a thenable
store.run(data, () => {
  assert.equal(store.getStore(), data) // Pass
  Promise.resolve(thenable())

  // Linear propagation restored to the correct context
  assert.equal(store.getStore(), data) // Pass
})

// Returning a thenable in a then handler
store.run(data, () => {
  assert.equal(store.getStore(), data) // Pass
  Promise.resolve().then(() => thenable())

  // Linear propagation restored to the correct context
  assert.equal(store.getStore(), data) // Pass
})

Fixing thenables context loss with AsyncResource

The current "recommended" way to deal with context loss issues is using the AsyncResource API. To fix the thenable above, one might do something like:

function thenable() {
  const resource = new AsyncResource('thenable')
  return {
    then(cb) {
      resource.runInAsyncScope(() => {
        assert.equal(store.getStore(), data) // Pass
        setImmediate(cb)
      })
    }
  }
}

This approach will work, but this approach will not, and only situationally:

function thenable() {
  return {
    then(cb) {
      const resource = new AsyncResource('thenable')
      resource.runInAsyncScope(() => {
        assert.equal(store.getStore(), data) // Fail
        setImmediate(cb)
      })
    }
  }
}

In this simple example, the former is not any more difficult than the later, but in a more real-world situation, the point at which the resource needs to be created may be less obvious.

Incorrect AsyncResource uses

It's very easy to use AsyncResource incorrectly. The docs suggest subclassing but also show direct construction which can very often lead to issues.

Here are a few examples of incorrect AsyncResource uses I've seen due to it being unclear when inits happen and what the context would be at that point.

class Query {
  // Most modules I've seen do this. It dispatches the query to a
  // promise-returning method of a shared client connection.
  then(...fns) {
    // The async_hooks ids and resource stack are empty here so we get an
    // orphaned init and resource that doesn't link to anything useful.
    const resource = new AsyncResource('thing')
    return resource.runInAsyncScope(() => {
      // Not only that, any async activity from this client.query(...) call
      // will link to this orphaned resource and therefore will also be
      // detached from the expected context. All descending asyncrony will
      // occur within this phantom branch which connects to nothing therefore
      // all following activity an APM might be interested in capturing can no
      // longer be attributed to anything useful.
      return this.client.query(this).then(...fns)
    })
  }
}

That previous example is a fairly straightforward demonstration of the issue, and the most immediately obvious way to fix it would be to move the AsyncResource construction out of the then(...) method. A naive approach might be to just put it in a "terminal" method like end, run, execute, etc. However, it's easy to accidentally direct to that call from another unwrapped thenable.

class Query {
  // This looks fine right? If we call `query.execute()` in our user code we
  // will get the `AsyncResource` constructed in a valid context.
  execute() {
    const resource = new AsyncResource('thing')
    return resource.runInAsyncScope(() => {
      return this.client.query(this)
    })
  }

  // But then what if we just make a little alias...
  then(...fns) {
    // Nope. The context is gone here so it will also be gone in `execute`.
    return this.execute().then(...fns)
  }
}

Here's another mysterious one I've seen...

class Query {
  // If we invoke the promise method outside the `AsyncResource` and only
  // wrap the attachment of the `then(...)` handler, we actually get
  // _nothing_ linking to the `AsyncResource` context. The promise that
  // is created when calling `then(...)` links directly to the promise in
  // the outer context because the current design of async_hooks considers
  // the internal resolve of the parent promise to be the init, not the
  // attachment of a `then(...)` handler, even though that is producing a
  // new promise.
  then(...fns) {
    const result = this.client.query(this)

    const resource = new AsyncResource('thing')
    return resource.runInAsyncScope(() => {
      return result.then(...fns)
    })
  }
}

The difficult of finding the right place to create the AsyncResource might suggest that perhaps only the subclass form should be supported, however that imposes significant restrictions on how a developer can write their code. Most cases where AsyncResource is needed currently are in database modules which typically have their own class hierarchy that could make introducing another parent class a rather complicated task.

Not all control flow is linear

The reason this gap is larger than some might think is that control flow is not always linear. You can get many complicated diverging and converging systems in an async environment, and each branch produces its own chain of asyncronous activity. If there is a break in that chain, an entire branch of execution context can be lost.

Here's a generic exponential backoff system someone might build...

async function sendMessageToUserById(id, message) {
  // We have a backoff class that takes min and max timeout values and a task
  // to try and perform. When the timeout is reached, the task rejects with a
  // `Timeout` error to indicate to the user that they should try again.
  const attempt = new BackOff(100, 10000, async () => {
    const user = await User.findById()
    await user.sendMessage(message)
  })

  while (true) {
    try {
      // We await the instance to let it make an attempt at doing the task.
      await attempt
      // If it resolves, we are done and can break out of the loop.
      break
    } catch (err) {
      // Otherwise, we need to decide if we want to log and continue, or
      // reject with the given error.
      if (err instanceof Timeout) {
        console.warn(err.message)
      } else {
        throw err
      }
    }
  }
}

class Timeout extends Error {
  constructor(ms) {
    super(`Timed out after ${ms}ms, retrying...`)
  }
}

class BackOff {
  constructor (min, max, task) {
    this.min = min
    this.max = max
    this.task = task
    this.current = min
  }

  // Using thenables, we can have a reusable object to await until it succeeds.
  // Unfortunately, the context inside this `then(...)` method will be empty
  // if it is triggered via an await, `Promise.resolve(...)` or `then(...)`
  // handler return value. This _also_ means that _all_ activity in the task
  // function will link back to this null context, making all that activity
  // impossible to correlate to the execution of `sendMessageToUserById`.
  then(resolve, reject) {
    const backoff = this.current
    this.current = Math.min(this.max, this.current * 2)

    function done() {
      clearTimeout(timer)
      isDone = true
    }

    let isDone = false
    const timer = setTimeout(() => {
      reject(new Timeout(backoff))
      done()
    }, backoff)

    this.task().then(result => {
      if (isDone) return
      done()
      resolve(result)
    }, error => {
      if (isDone) return
      done()
      reject(error)
    })
  }
}

You can argue all you want about if the code is "bad practice" or not, but it doesn't change that people will write code like this simply because it is possible to do so.

It's a good idea to advocate for best practices in how people use Node.js, but because Node.js is a platform that relies heavily on the module ecosystem, we must consider what the impact would be if a major module happened to introduce behaviour like this. A crash is easy to diagnose, but context loss can be very tricky, especially when it comes as a result of someone else's code. Trust me, this is a headache you do not want to have. 😬

Database module AsyncResource status

About half of the popular database/cache libraries use native promises or are callback-only and expect users to promisify things themselves. In the callback-only case, there are often hundreds of different styled promisifying libraries which vary widely in how they achieve that, many doing it in a way that breaks the context.

The other half of popular database/cache libraries are known to be broken and have been for a long time.

Here's a few examples:

The approach from mongodb I think is a notable one to examine because we passed the cognitive burden of context management to the module developers and then they just passed it further along to the end-user.

Because the majority of Node.js apps use database or cache libraries and the majority of these libraries break context, this is a very significant issue. Both async_hooks and AsyncLocalStorage will produce incomplete context coverage in most applications.

So when are microtasks run exactly, and why is there no context?

The issue comes down to code structure in the InternalCallbackScope function. The function does a MicrotaskScope::PerformCheckpoint(...) call at a point which is just after the JavaScript execution scope has finished and the async_hooks "after" event has been fired along with the resource being removed from the resource stack. A naive approach might be to move this before those happen or use a DefaultTriggerAsyncIdScope to encapsulate that block. However, the problem is that it needs to be in execution scope, not trigger scope, and each individual microtask should have its own separate execution scope.

Currently we have a queueMicrotask function which does effectively the same thing as what happens with the thenable resolve, but that function explicitly wraps itself in an AsyncResource.

This wrap needs to exist for all microtasks, not just the ones we create directly.

How do we fix it?

Well, I shared a high-level way to fix it for AsyncLocalStorage. If you're reading this, you probably already know that was not received well. 😅

To fix it at a lower-level will be quite complicated, but the current theory I'm exploring is that we could provide our own custom MicrotaskQueue which wraps the EnqueueMicrotask(...) methods.

The custom EnqueueMicrotask(...) method would bind to the context of the last init to occur before the method was called and then inject microtasks around the true microtask to enter the execution context before it runs and exit the execution context after.

The other possibility I've investigated is if we could get V8 to emit the PromiseHook events for thenables, however I haven't been able to figure out a clear way to achieve that without breaking their existing internal use of it.

I would very much appreciate feedback on this idea or any guidance you may be able to provide. 🙂

@puzpuzpuz
Copy link

It's very easy to use AsyncResource incorrectly. The docs suggest subclassing but also show direct construction which can very often lead to issues.

This sounds like something that could be improved in the documentation.

Because the majority of Node.js apps use database or cache libraries and the majority of these libraries break context, this is a very significant issue. Both async_hooks and AsyncLocalStorage will produce incomplete context coverage in most applications.

The proposed fix (nodejs/node#33189) won't fix most problems with context loss in client libraries, as it's aimed towards a specific scenario. I find this part a bit misleading as it gives readers an impression that the fix resolves all problems with custom thenables, while it's not.

To fix it at a lower-level will be quite complicated, but the current theory I'm exploring is that we could provide our own custom MicrotaskQueue which wraps the EnqueueMicrotask(...) methods.

It would be great to see such low-level fix, as all async_hooks based code could benefit from it, not only ALS.

@puzpuzpuz
Copy link

A naive approach might be to just put it in a "terminal" method like end, run, execute, etc. However, it's easy to accidentally direct to that call from another unwrapped thenable.

This looks like an oversimplified judgement. A proper implementation should create and return a separate thenable instance per "terminal" operation. Such approach matches the way promises work and it's not naive at all. Integrating such thenables with AsyncResource will ensure correct context propagation.

@Qard
Copy link
Author

Qard commented May 11, 2020

Sure, but that's also a lot more work. Telling all the module authors out there "Hey, we don't like your code. You should rewrite all of this." is not going to go well. 😬

As I said before, the ecosystem is enormous and there's no way we can reliably steer all of it in the direction we want. The only reasonable approach I see is to do what we can to prevent apps from falling apart if they happen to be written in a way that's unexpected to us. That's why we had things like the restoration and extended deprecation of _headers. The users wanted to write code a certain way and when we broke their code, they got angry at us. 🤷

As for your note about that PR not fixing all the context issues, you are correct. It only fixes thenables, but that accounts for all the issues I've found in all the databases modules I listed short of one connection pooling issue in mongoose, but that's a more straightforward fix.

Improving the docs of AsyncResource would certainly be good too, but the problem is that most cases where it's needed, no one knows it's needed or that it even exists. Good docs help no one if they don't know there's something there they should be reading about. Usually when you go to the docs it's because you have a specific task in mind. You think "I want to read a file" and go to the fs docs, or "I want to make an http request" and go to the http docs. When the context breaks, it's usually not clear where or why, and it's usually broken by someone else's code so the author is not even aware of it. The number of times I've seen perfectly reasonable code break the context with no reason for the author to know or care about it is frankly uncountable. Reducing that number even just a bit would be a huge help to APM folks.

@AndreasMadsen
Copy link

I think some are misunderstandings regarding the use case for AsyncResource. AsyncResource is for changing the async graph, not fixing a broken graph. Async context should, with only a few exceptions such as process.on('exit') or native addons, never be lost. With "lost" and "broken graph" I mean that the executionAsyncId is set to 0.

Pointing to AsyncResource as a solution to fixing a broken graph, is fundamentally misguided. The entire point of async_hooks is that it solves the issue of async context, without requiring user/module-specific code.

@Qard Thanks for writing this up, I think it will help clarify things.

@puzpuzpuz
Copy link

puzpuzpuz commented May 13, 2020

I think some are misunderstandings regarding the use case for AsyncResource. AsyncResource is for changing the async graph, not fixing a broken graph. Async context should, with only a few exceptions such as process.on('exit') or native addons, never be lost. With "lost" and "broken graph" I mean that the executionAsyncId is set to 0.

Yes, that's correct. If you refer to https://gist.github.com/Qard/faad53ba2368db54c95828365751d7bc#gistcomment-3297617, that thread was about changing the graph in a CLS friendly manner, so there is no context loss.

Pointing to AsyncResource as a solution to fixing a broken graph, is fundamentally misguided. The entire point of async_hooks is that it solves the issue of async context, without requiring user/module-specific code.

Nobody disagrees with that as well. That's why a low-level fix for nodejs/node#22360 would be great.

@Qard
Copy link
Author

Qard commented May 13, 2020

@AndreasMadsen That's a good way to put it. Unfortunately AsyncResource often seems to instead be pushed as a way to fix the graph, especially since it applies globally rather than an indivudual hook subscriber being able to say "I want hooks, but with these couple adjustments". I agree that it should not be pushed as a way to patch the holes, but in-practice that seems to be how it's getting used.

In an ideal world, the language would provide this functionality and would have a way to ask to bind a context to the point a function is called or the point a function is defined, which is basically what AsyncResource is abstractly trying to achieve, albeit not well. There's a lot of complicated AST and lexical analysis required to draw a path back to all the possible points a context consumer could care about. They could care about where a callback gets called, they could care about where a callback is given to the most user-facing async function, or they might care about where that function is declared. With anonymous functions the last two are typically the same place, but that's not necessarily the case and is basically what the connection pooling problem is that led to needing AsyncResource to bind a callback to the point it was given not to the point it was called.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment