Skip to content

Instantly share code, notes, and snippets.

@krx252525
Last active May 23, 2018 02:44
Show Gist options
  • Save krx252525/fbe5af32711089066bd575255e527507 to your computer and use it in GitHub Desktop.
Save krx252525/fbe5af32711089066bd575255e527507 to your computer and use it in GitHub Desktop.
Kubernetes node healthcheck via ELB with DNS change
Title: Kubernetes node health monitoring
# Render here: https://bramp.github.io/js-sequence-diagrams/
node->dns: resolve ip-address
node->node: cache ip ttl 60s
node->elb: healthy
master->node: ok (2xx)
node->node: wait 10s
note over node,elb: Repeat heartbeat every \n node-status-update-frequency (default 10s)
elb->master: healthy
note right of master: Check status of node every \nnode-monitor-period (default 5s)
master->master: node healthy
elb->elb: change ip
node->elb: healthy (wrong ip)
note over node,elb: Wait for connection timeout. Retries \nnodeStatusUpdateRetry number of times (constant 5)
master->master: node missed healthcheck
Note right of master: set node not healthy after \nnode-monitor-grace-period seconds (default 40s)
note over node,master: heartbeat fails 3 more times
master->master: nodes not healthy
node->elb: healthy (wrong ip)
node->node: wait 10s
note over node,elb: one more healthcheck until elb ttl expires\n and new ip is resolved
node->dns: resolve ip-address
node->node: cache ip ttl 60s
node->elb: healthy
elb->master: healthy
master->node: ok (2xx)
master->master: node healthy
Title: Kubernetes node health monitoring
# Render here: https://bramp.github.io/js-sequence-diagrams/
participant node
participant dns
participant ELBi1
participant ELBi2
participant master
node->dns: resolve ip-address
node->ELBi1: healthy
ELBi1->master: healthy
master->node: ok (2xx)
node->node: close.body()
node->ELBi1: keep-alive
ELBi1->master: keep-alive
note over node,ELBi1: Repeat heartbeat every \n node-status-update-frequency (default 10s)\npreventing tcp connection closing
node->ELBi1: healthy
ELBi1->master: healthy
master->node: ok (2xx)
note right of master: Check status of node every \nnode-monitor-period (default 5s)
master->master: node healthy
dns->dns: Remove ELBi1; \nAdd ELBi2
note over node,master: The advertised address, and underlying host, for the ELB has changed. \nThe the previous address remains resolvable for 60 minutes.\nSince we have a persistent connection open AWS are kind enough to keep the\nprevious instance alive for upto around 1 week.
node->ELBi1: healthy (wrong ip)
ELBi1->master: healthy
master->node: ok (2xx)
note over node,master: a week passes by and we're still speaking to the master on our route via old ELB
ELBi1->ELBi1: die
note over node,master: The ELBi1 hop on the persistent connection has disapeared without saying goodbye.\n A Heartbeat is sent but remains unacknowledged\nThe kernel will keep resending with exponential back off until it receives a response. \nThe kernel Eventually kills the TCP connection 15m25s if no response.
node->ELBi1: healthy (wrong ip)
Note right of master: set node not healthy after \nnode-monitor-grace-period seconds \n(default 40s)
note over node,master: 40 seconds pass since master received last heartbeat from node
master->master: node unhealthy
note over node,master: 15min25s pass since last heartbeat was initially sent and still no response
node->node:kill tcp connection
node->dns: resolve ip-address
node->ELBi2: healthy
ELBi2->master: healthy
master->node: ok (2xx)
master->master: node healthy
note over node,master: Keep connection open indefinitely
Title: Simulate cluster failure resulting from ELB disapearing
# Render here: https://bramp.github.io/js-sequence-diagrams/
participant infra
participant node
participant dnsmasq
participant m2elb
participant melb
participant master
note over dnsmasq: dnsmasq will run\non node
infra->dnsmasq:set dnsmasq to point \nm2elb's dns entry to m2elb
node->dnsmasq: resolve m2elb ip
node->m2elb: heartbeat
m2elb->master: heartbeat
master->node: ok (2xx)
note over node,master: connection to master will now be persistent via m2elb
infra->dnsmasq: set dnsmasq to point \nm2elb's to melb
infra->node: use iptable rules to drop all\npackets top and from m2elb IP
note over node,master: watch as the node becomes unhealthy and remains\nunhealthy for around 15mins25sec.
node->node: kill tcp connection
node->dnsmasq: node will now reresolve m2elb dns\nand get melb ip from dnsmasq
node->melb: healthy
melb->master: healthy
master->node: ok (2xx)
note over node,master: see the node become healthy again as the connection is reset and \n heartbeats reach master again
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment