jnaulty/zero-downtime-gotchas-k8s.md

## zero-downtime-gotchas-k8s.md

      
    Raw
  

              zero-downtime-gotchas-k8s.md
            
          
    The Gotchas of Zero-Downtime Traffic /w Kubernetes

Speaker: Leigh Capili, Weaveworks
Video Link
Demo Github Link
Pod Shutdown Procedure


kube-apiserver receives delete
Pod marked as terminating + async consequence
Service controller removes endpoint
PreStop hooks run

PID 1 of all containers receive SIGTERM
wait 
PID 1 of all containers receives SIGKILL


Gotcha #1

Shells don't pass signals like you would 'expect' them to.

use array syntax ["nginx"] style instead

Gotcha #2

STOPSIGNAL: cannot override in kubernetes. If using docker and containerd, you can override in image a STOPSIGNAL
(rewrite SIGTERM using a STOPSIGNAL)
Gotcha #3

Readiness/Liveness Probes
Liveness Probe used to check if Process is OK. Kills pod if it's not OK.
Readiness Probe used to check if Pod should received traffic. It's important for pod 'startup' (especially for pods that require a long warmup period).
Gotcha #4

PreStop lifecycle hook or... in-app integration
Endpoints update async, independent of Pod Lifecycle.
kube-proxy and ingress-controllers depend on Endpoints.

when preStop is running or SIGTERM is sent, your app will likely still be receiving connections.

It takes time for endpoints to update across nodes.
When initiate termination, need to not "stop receiving connections" but need to "start draining connections"
Demo

Deploying nginx app
Testing with siege -v 172.17.0.2 -c2
Monitoring with kubectl get po,ep
Production Ready preStop CMD:
lifecycle:
    preStop:
      exec:
        command:
          - /bin/sleep
          - "20"

Gotcha #5

rollingUpdate.maxUnavailable use percentage or 0 when replica count == 1
Gotcha #6

make sure app can stay warm according to these periods:
strategy.minReadySeconds, strategy.progressDeadlineSeconds
Also take care that this does not exceed capacity:
strategy.rollingUpdate.maxSurge
Gotcha #7

Mismatched signal lifecycle with sidecars:
Example:

If you're using cloudsql-proxy to connect your app to your DB, your preStop hooks and graceful shutdown periods should
be either synchronized or scheduled so that they do not effectively race.
If your app is in graceful shutdown and the proxy is not sleeping, it will exit and drop your db connections.

Easiest way without writing crazy synchronization logic is just add sleep, seriously, sleep longer on the sidecar.
Rules of Uptime


entrypoint should handle or pass signals
STOPSIGNAL may need to be changed
use diff. periods for liveness/readiness probes (greater for liveness probes)
sleep in preStop hooks to drain connections
use the newer apps/v1 deployment
keep your app warm during a RollingUpdate
synchronize shutdown of side-cars