Skip to content

Instantly share code, notes, and snippets.

@schmichael
Last active June 21, 2018 18:39
Show Gist options
  • Save schmichael/45e660e4e8ca20db6826cdf24d20f85f to your computer and use it in GitHub Desktop.
Save schmichael/45e660e4e8ca20db6826cdf24d20f85f to your computer and use it in GitHub Desktop.
Global Context Trees for Services

Global Context Trees for Services

aka Service Lifecycle Contexts

Request-scoped Contexts

Request-scoped contexts are unambigiously good. Other than a brief mention of main() they're the only use case covered by the official context announcement and documentation. Every single feature of contexts makes sense in a request/response scenario:

  • Cancelation provides a unified API for canceling work whose result is no longer needed
  • Deadlines and timeouts provide a unified API for preventing requests from blocking indefinitely.
  • Values provide a unified API for tracing and other request-scoped data without expecting all libraries and frameworks to be aware of their types.

Pipeline cancellation is another documented use of contexts, but it can be viewed as a form of chained/continuous request/responses, so all of the same arguments and principles for request-scoped context apply to pipelines as well.

Service Contexts

The question I'm considering is: should you have a context that represents the lifecycle of a long lived service (and therefore cancellation would signal shutdown). Long-lived services such as Nomad agents have complex shutdown semantics:

  • On one hand they must be "crash safe" - an agent should be able to die at any point and recover on startup. The worst case scenario is that the agent must refuse to restart due to corrupted state although this should be considered a bug. At no point should forcibly restarting a process cause incorrect: either recover or refuse to run.
  • On the other hand they must make a best effort at a graceful shutdown.

Graceful Shutdown in Nomad Agents

There are numerous places where making a best effort to gracefully shutdown is a critical Nomad feature:

  1. Consul TTL Healthchecks are heartbeated on shutdown to make a best effort at preventing TTL expirations during agent restarts.
  • Uses a 2 channel shutdown-signal + shutdown-complete with timeout approach. Shutdown signal could be replaced with a context, but that context must not be used when making Consul API calls.
  1. When run in -dev mode the Nomad agent cleans up all running tasks before exiting.
  • Uses a 2 channel shutdown-signal + shutdown-complete approach. Shutdown signal could be replaced with a context, but that context must not be used when communicating with drivers (eg Docker API, executor RPCs, exec'ing rkt commands).
  1. TODO local and/or remote state sync'ing?

As noted above uses 1 and 2 could use a Context.Done() chan for receiving the shutdown signal, but this has 2 gotchas:

  1. The shutdown context must not be used for communicating with Consul or drivers. Doing so would cancel these operations and defeat the purpose of attempting to shutdown gracefully.
  2. The parent that cancels the context must know it needs to wait for its children to exit.

2 might not seem like a big deal, but it means every parent of a goroutine that requires a coordinated shutdown must implement a coordinated shutdown. For example even if Agent could just cancel Client's context and exit because Agent doesn't care about any "results" from Client, Client cares about waiting for drivers to exit in dev mode. So Client knows to wait on a graceful shutdown of drivers, but Agent also needs to.

In practice this means contexts only complicate shutdown for non-leaf (or close to leaf) goroutines. As soon as some descendent goroutine requires a coordinated shutdown, it infects every parent and defeats much of the simplicity of using a context for shutting down.

Implementing a Global Context Tree

The open questions in my mind is:

Does the benefit of having a global context tree representing the lifecycle of a service outweigh the cognitive overhead of knowing when to use a simple context cancellation vs a coordinated shutdown mechanism?

I believe the only way to answer it is to look at APIs of possible implementations.

Using a global context

Let's see an example of using a global context tree with a struct that requires a graceful (blocking) shutdown:

type T struct {
  // ctx is cancelled to signal a shutdown
  ctx    context.Context
  
  // cancel T's context to signal a shutdown
  cancel context.CancelFunc
  
  // doneCh is closed when graceful shutdown is complete
  doneCh chan struct{}
}

// NewT creates a T that exits 
func NewT(pctx context.Context) *T {
  t := &T{
    doneCh: make(chan struct{}),
  }
  t.ctx, t.cancel = context.WithCancel(pctx)
  return t
}

// Run is called in a goroutine by T's parent.
func (t *T) Run() {
  defer close(t.doneCh)
  
  work := make(chan int)
  
  go someAncillaryProcess(t.ctx)
  
  for {
    select {
      case <-work:
        // do work
        
      case <-t.ctx.Done():
        // cancelled; exit
        return
    }
  }
}

// Shutdown gracefully
func (t *T) Shutdown() {
  t.preShutdown()
  t.cancel()
  <-t.doneCh
  t.postShutdown()
}

The first question is: should the parent context be canceled before or after calling Shutdown? There's no way for the canceler to know. If t.preShutdown() requires someAncillaryProcess(...) to be running, the parent must call Shutdown first. However since Shutdown cancels the local context, there's no point in passing in a parent context as the child context is canceled before it is every time.

Obviously a developer would want to document such dependencies and a common pattern could be established to prevent errors, but I am left wondering if the parent context is ever useful?

Other Context Features for Service Lifecycles

Other than cancellation none of the other features of contexts make sense for service lifecycles. This leads me to believe service lifecycle context trees are not idiomatic.

Timeouts and Deadlines

Timeouts and deadlines make no sense for "service" goroutines. The only case I can imagine is as a failsafe when testing, but the Go test tool already provides a timeout mechanism that is much more robust.

Values

Don't do it for services. Explicitly pass dependencies.

Context Resources

Official

Resources by core developers:

Top Tips

Resources by community members:

Real World Examples

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment