Qard/why-lack-of-thenables-support-in-async_hooks-is-problematic.md Secret

## why-lack-of-thenables-support-in-async_hooks-is-problematic.md

      
    Raw
  

              why-lack-of-thenables-support-in-async_hooks-is-problematic.md
            
          
    Why lack of thenables support in async_hooks is problematic

Any use of thenables within the ecosystem has a high risk of breaking internal
context propagation. Linear context propagation is fine, it's only the points
in the async call graph where execution branches and potentially converges later
which are impacted. This might not seem like a huge deal at first, but not all
control flow is linear, especially for observability and diagnostics systems.
It's important to have state consistency from anywhere that code is reachable,
otherwise eventually someone will try something unusual and break the system.
When you're dealing with a userbase as large as Node.js, it doesn't take long
for someone to do something unique and unexpected.
See below for a code example demonstrating the exact scenarios when context
loss will and will not happen.
const assert = require('assert')
const { AsyncLocalStorage } = require('async_hooks')

const store = new AsyncLocalStorage()
const data = Symbol('verifier')

function thenable() {
  return {
    // This then(...) method, unless called directly with promise.then(...)
    // in user code, will be executed within a microtask that is IN NO WAY
    // wrapped by async_hooks. This means the execution async id of the task
    // will be 0 and therefore the trigger id of any async tasks run within
    // the task will also be 0.
    // 
    // Because the ids are zero at this point, async_hooks will think the
    // current resource is the top-level resource. This could result in
    // pollution of the top-level resource which could then propagate into all
    // future descending resources in the async execution tree, depending on
    // how your continuation local storage system is implemented. Fortunately
    // `AsyncLocalStorage` is immune to this, but most of the APM ecosystem
    // currently uses their own CLS and some may continue to do so due to
    // potential for additional resource-tracking insights.
    then(cb) {
      assert.equal(store.getStore(), data) // Fail
      setImmediate(cb)
    }
  }
}

// The following four `store.run(...)` samples will "upgrade" the thenable to
// a real promise within the unwrapped microtask mentioned previously.

// Await a thenable
store.run(data, async () => {
  assert.equal(store.getStore(), data) // Pass
  await thenable()

  // Linear propagation restored to the correct context
  assert.equal(store.getStore(), data) // Pass
})

// Returning a thenable in an async function
store.run(data, async () => {
  try {
    assert.equal(store.getStore(), data) // Pass
    return thenable()
  } finally {
    // Linear propagation restored to the correct context
    assert.equal(store.getStore(), data) // Pass
  }
})

// Resolving a thenable
store.run(data, () => {
  assert.equal(store.getStore(), data) // Pass
  Promise.resolve(thenable())

  // Linear propagation restored to the correct context
  assert.equal(store.getStore(), data) // Pass
})

// Returning a thenable in a then handler
store.run(data, () => {
  assert.equal(store.getStore(), data) // Pass
  Promise.resolve().then(() => thenable())

  // Linear propagation restored to the correct context
  assert.equal(store.getStore(), data) // Pass
})
Fixing thenables context loss with AsyncResource

The current "recommended" way to deal with context loss issues is using the
AsyncResource API. To fix the thenable above, one might do something like:
function thenable() {
  const resource = new AsyncResource('thenable')
  return {
    then(cb) {
      resource.runInAsyncScope(() => {
        assert.equal(store.getStore(), data) // Pass
        setImmediate(cb)
      })
    }
  }
}
This approach will work, but this approach will not, and only situationally:
function thenable() {
  return {
    then(cb) {
      const resource = new AsyncResource('thenable')
      resource.runInAsyncScope(() => {
        assert.equal(store.getStore(), data) // Fail
        setImmediate(cb)
      })
    }
  }
}
In this simple example, the former is not any more difficult than the later,
but in a more real-world situation, the point at which the resource needs to
be created may be less obvious.
Incorrect AsyncResource uses

It's very easy to use AsyncResource incorrectly. The docs suggest subclassing
but also show direct construction which can very often lead to issues.
Here are a few examples of incorrect AsyncResource uses I've seen due to it
being unclear when inits happen and what the context would be at that point.
class Query {
  // Most modules I've seen do this. It dispatches the query to a
  // promise-returning method of a shared client connection.
  then(...fns) {
    // The async_hooks ids and resource stack are empty here so we get an
    // orphaned init and resource that doesn't link to anything useful.
    const resource = new AsyncResource('thing')
    return resource.runInAsyncScope(() => {
      // Not only that, any async activity from this client.query(...) call
      // will link to this orphaned resource and therefore will also be
      // detached from the expected context. All descending asyncrony will
      // occur within this phantom branch which connects to nothing therefore
      // all following activity an APM might be interested in capturing can no
      // longer be attributed to anything useful.
      return this.client.query(this).then(...fns)
    })
  }
}
That previous example is a fairly straightforward demonstration of the issue,
and the most immediately obvious way to fix it would be to move the
AsyncResource construction out of the then(...) method. A naive approach
might be to just put it in a "terminal" method like end, run, execute,
etc. However, it's easy to accidentally direct to that call from another
unwrapped thenable.
class Query {
  // This looks fine right? If we call `query.execute()` in our user code we
  // will get the `AsyncResource` constructed in a valid context.
  execute() {
    const resource = new AsyncResource('thing')
    return resource.runInAsyncScope(() => {
      return this.client.query(this)
    })
  }

  // But then what if we just make a little alias...
  then(...fns) {
    // Nope. The context is gone here so it will also be gone in `execute`.
    return this.execute().then(...fns)
  }
}
Here's another mysterious one I've seen...
class Query {
  // If we invoke the promise method outside the `AsyncResource` and only
  // wrap the attachment of the `then(...)` handler, we actually get
  // _nothing_ linking to the `AsyncResource` context. The promise that
  // is created when calling `then(...)` links directly to the promise in
  // the outer context because the current design of async_hooks considers
  // the internal resolve of the parent promise to be the init, not the
  // attachment of a `then(...)` handler, even though that is producing a
  // new promise.
  then(...fns) {
    const result = this.client.query(this)

    const resource = new AsyncResource('thing')
    return resource.runInAsyncScope(() => {
      return result.then(...fns)
    })
  }
}
The difficult of finding the right place to create the AsyncResource might
suggest that perhaps only the subclass form should be supported, however
that imposes significant restrictions on how a developer can write their code.
Most cases where AsyncResource is needed currently are in database modules
which typically have their own class hierarchy that could make introducing
another parent class a rather complicated task.
Not all control flow is linear

The reason this gap is larger than some might think is that control flow is not
always linear. You can get many complicated diverging and converging systems in
an async environment, and each branch produces its own chain of asyncronous
activity. If there is a break in that chain, an entire branch of execution
context can be lost.
Here's a generic exponential backoff system someone might build...
async function sendMessageToUserById(id, message) {
  // We have a backoff class that takes min and max timeout values and a task
  // to try and perform. When the timeout is reached, the task rejects with a
  // `Timeout` error to indicate to the user that they should try again.
  const attempt = new BackOff(100, 10000, async () => {
    const user = await User.findById()
    await user.sendMessage(message)
  })

  while (true) {
    try {
      // We await the instance to let it make an attempt at doing the task.
      await attempt
      // If it resolves, we are done and can break out of the loop.
      break
    } catch (err) {
      // Otherwise, we need to decide if we want to log and continue, or
      // reject with the given error.
      if (err instanceof Timeout) {
        console.warn(err.message)
      } else {
        throw err
      }
    }
  }
}

class Timeout extends Error {
  constructor(ms) {
    super(`Timed out after ${ms}ms, retrying...`)
  }
}

class BackOff {
  constructor (min, max, task) {
    this.min = min
    this.max = max
    this.task = task
    this.current = min
  }

  // Using thenables, we can have a reusable object to await until it succeeds.
  // Unfortunately, the context inside this `then(...)` method will be empty
  // if it is triggered via an await, `Promise.resolve(...)` or `then(...)`
  // handler return value. This _also_ means that _all_ activity in the task
  // function will link back to this null context, making all that activity
  // impossible to correlate to the execution of `sendMessageToUserById`.
  then(resolve, reject) {
    const backoff = this.current
    this.current = Math.min(this.max, this.current * 2)

    function done() {
      clearTimeout(timer)
      isDone = true
    }

    let isDone = false
    const timer = setTimeout(() => {
      reject(new Timeout(backoff))
      done()
    }, backoff)

    this.task().then(result => {
      if (isDone) return
      done()
      resolve(result)
    }, error => {
      if (isDone) return
      done()
      reject(error)
    })
  }
}
You can argue all you want about if the code is "bad practice" or not, but it
doesn't change that people will write code like this simply because it is
possible to do so.
It's a good idea to advocate for best practices in how people use Node.js, but
because Node.js is a platform that relies heavily on the module ecosystem, we
must consider what the impact would be if a major module happened to introduce
behaviour like this. A crash is easy to diagnose, but context loss can be very
tricky, especially when it comes as a result of someone else's code. Trust me,
this is a headache you do not want to have. 😬
Database module AsyncResource status

About half of the popular database/cache libraries use native promises or are
callback-only and expect users to promisify things themselves. In the
callback-only case, there are often hundreds of different styled promisifying
libraries which vary widely in how they achieve that, many doing it in a way
that breaks the context.
The other half of popular database/cache libraries are known to be broken and
have been for a long time.
Here's a few examples:

mongodb decided to let the user deal with it
mongoose rejected AsyncResource due to experimental status
pg has no support
knex query builder has no support

The approach from mongodb I think is a notable one to examine because we
passed the cognitive burden of context management to the module developers
and then they just passed it further along to the end-user.
Because the majority of Node.js apps use database or cache libraries and
the majority of these libraries break context, this is a very significant
issue. Both async_hooks and AsyncLocalStorage will produce incomplete
context coverage in most applications.
So when are microtasks run exactly, and why is there no context?

The issue comes down to code structure in the InternalCallbackScope function.
The function does a
MicrotaskScope::PerformCheckpoint(...)
call at a point which is just after the JavaScript execution scope has finished
and the async_hooks "after" event has been fired along with the resource being
removed from the resource stack. A naive approach might be to move this before
those happen or use a DefaultTriggerAsyncIdScope to encapsulate that block.
However, the problem is that it needs to be in execution scope, not trigger
scope, and each individual microtask should have its own separate execution
scope.
Currently we have a queueMicrotask function which does effectively the same
thing as what happens with the thenable resolve, but that function
explicitly wraps itself in an AsyncResource.
This wrap needs to exist for all microtasks, not just the ones we create
directly.
How do we fix it?

Well, I shared a high-level way to fix it for AsyncLocalStorage. If you're
reading this, you probably already know that was not received well. 😅
To fix it at a lower-level will be quite complicated, but the current theory
I'm exploring is that we could provide our own custom MicrotaskQueue which
wraps the EnqueueMicrotask(...) methods.
The custom EnqueueMicrotask(...) method would bind to the context of the
last init to occur before the method was called and then inject microtasks
around the true microtask to enter the execution context before it runs
and exit the execution context after.
The other possibility I've investigated is if we could get V8 to emit the
PromiseHook events for thenables, however I haven't been able to figure
out a clear way to achieve that without breaking their existing internal
use of it.
I would very much appreciate feedback on this idea or any guidance you may
be able to provide. 🙂