Skip to content

Instantly share code, notes, and snippets.

@jnaulty
Created February 12, 2020 00:31
Show Gist options
  • Save jnaulty/10f4914c603e5d6cb23e6407614e6422 to your computer and use it in GitHub Desktop.
Save jnaulty/10f4914c603e5d6cb23e6407614e6422 to your computer and use it in GitHub Desktop.

The Gotchas of Zero-Downtime Traffic /w Kubernetes

Speaker: Leigh Capili, Weaveworks

Video Link Demo Github Link

Pod Shutdown Procedure

  • kube-apiserver receives delete
  • Pod marked as terminating + async consequence
  • Service controller removes endpoint
  • PreStop hooks run
    • PID 1 of all containers receive SIGTERM
    • wait
    • PID 1 of all containers receives SIGKILL

Gotcha #1

Shells don't pass signals like you would 'expect' them to.

  • use array syntax ["nginx"] style instead

Gotcha #2

STOPSIGNAL: cannot override in kubernetes. If using docker and containerd, you can override in image a STOPSIGNAL (rewrite SIGTERM using a STOPSIGNAL)

Gotcha #3

Readiness/Liveness Probes

Liveness Probe used to check if Process is OK. Kills pod if it's not OK.

Readiness Probe used to check if Pod should received traffic. It's important for pod 'startup' (especially for pods that require a long warmup period).

Gotcha #4

PreStop lifecycle hook or... in-app integration

Endpoints update async, independent of Pod Lifecycle.

kube-proxy and ingress-controllers depend on Endpoints.

when preStop is running or SIGTERM is sent, your app will likely still be receiving connections.

It takes time for endpoints to update across nodes.

When initiate termination, need to not "stop receiving connections" but need to "start draining connections"

Demo

Deploying nginx app

Testing with siege -v 172.17.0.2 -c2

Monitoring with kubectl get po,ep

Production Ready preStop CMD:

lifecycle:
    preStop:
      exec:
        command:
          - /bin/sleep
          - "20"

Gotcha #5

rollingUpdate.maxUnavailable use percentage or 0 when replica count == 1

Gotcha #6

make sure app can stay warm according to these periods: strategy.minReadySeconds, strategy.progressDeadlineSeconds

Also take care that this does not exceed capacity: strategy.rollingUpdate.maxSurge

Gotcha #7

Mismatched signal lifecycle with sidecars:

Example:

If you're using cloudsql-proxy to connect your app to your DB, your preStop hooks and graceful shutdown periods should be either synchronized or scheduled so that they do not effectively race. If your app is in graceful shutdown and the proxy is not sleeping, it will exit and drop your db connections.

Easiest way without writing crazy synchronization logic is just add sleep, seriously, sleep longer on the sidecar.

Rules of Uptime

  1. entrypoint should handle or pass signals
  2. STOPSIGNAL may need to be changed
  3. use diff. periods for liveness/readiness probes (greater for liveness probes)
  4. sleep in preStop hooks to drain connections
  5. use the newer apps/v1 deployment
  6. keep your app warm during a RollingUpdate
  7. synchronize shutdown of side-cars
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment