Last active
May 23, 2018 02:44
-
-
Save krx252525/fbe5af32711089066bd575255e527507 to your computer and use it in GitHub Desktop.
Kubernetes node healthcheck via ELB with DNS change
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Title: Kubernetes node health monitoring | |
# Render here: https://bramp.github.io/js-sequence-diagrams/ | |
node->dns: resolve ip-address | |
node->node: cache ip ttl 60s | |
node->elb: healthy | |
master->node: ok (2xx) | |
node->node: wait 10s | |
note over node,elb: Repeat heartbeat every \n node-status-update-frequency (default 10s) | |
elb->master: healthy | |
note right of master: Check status of node every \nnode-monitor-period (default 5s) | |
master->master: node healthy | |
elb->elb: change ip | |
node->elb: healthy (wrong ip) | |
note over node,elb: Wait for connection timeout. Retries \nnodeStatusUpdateRetry number of times (constant 5) | |
master->master: node missed healthcheck | |
Note right of master: set node not healthy after \nnode-monitor-grace-period seconds (default 40s) | |
note over node,master: heartbeat fails 3 more times | |
master->master: nodes not healthy | |
node->elb: healthy (wrong ip) | |
node->node: wait 10s | |
note over node,elb: one more healthcheck until elb ttl expires\n and new ip is resolved | |
node->dns: resolve ip-address | |
node->node: cache ip ttl 60s | |
node->elb: healthy | |
elb->master: healthy | |
master->node: ok (2xx) | |
master->master: node healthy |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Title: Kubernetes node health monitoring | |
# Render here: https://bramp.github.io/js-sequence-diagrams/ | |
participant node | |
participant dns | |
participant ELBi1 | |
participant ELBi2 | |
participant master | |
node->dns: resolve ip-address | |
node->ELBi1: healthy | |
ELBi1->master: healthy | |
master->node: ok (2xx) | |
node->node: close.body() | |
node->ELBi1: keep-alive | |
ELBi1->master: keep-alive | |
note over node,ELBi1: Repeat heartbeat every \n node-status-update-frequency (default 10s)\npreventing tcp connection closing | |
node->ELBi1: healthy | |
ELBi1->master: healthy | |
master->node: ok (2xx) | |
note right of master: Check status of node every \nnode-monitor-period (default 5s) | |
master->master: node healthy | |
dns->dns: Remove ELBi1; \nAdd ELBi2 | |
note over node,master: The advertised address, and underlying host, for the ELB has changed. \nThe the previous address remains resolvable for 60 minutes.\nSince we have a persistent connection open AWS are kind enough to keep the\nprevious instance alive for upto around 1 week. | |
node->ELBi1: healthy (wrong ip) | |
ELBi1->master: healthy | |
master->node: ok (2xx) | |
note over node,master: a week passes by and we're still speaking to the master on our route via old ELB | |
ELBi1->ELBi1: die | |
note over node,master: The ELBi1 hop on the persistent connection has disapeared without saying goodbye.\n A Heartbeat is sent but remains unacknowledged\nThe kernel will keep resending with exponential back off until it receives a response. \nThe kernel Eventually kills the TCP connection 15m25s if no response. | |
node->ELBi1: healthy (wrong ip) | |
Note right of master: set node not healthy after \nnode-monitor-grace-period seconds \n(default 40s) | |
note over node,master: 40 seconds pass since master received last heartbeat from node | |
master->master: node unhealthy | |
note over node,master: 15min25s pass since last heartbeat was initially sent and still no response | |
node->node:kill tcp connection | |
node->dns: resolve ip-address | |
node->ELBi2: healthy | |
ELBi2->master: healthy | |
master->node: ok (2xx) | |
master->master: node healthy | |
note over node,master: Keep connection open indefinitely |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Title: Simulate cluster failure resulting from ELB disapearing | |
# Render here: https://bramp.github.io/js-sequence-diagrams/ | |
participant infra | |
participant node | |
participant dnsmasq | |
participant m2elb | |
participant melb | |
participant master | |
note over dnsmasq: dnsmasq will run\non node | |
infra->dnsmasq:set dnsmasq to point \nm2elb's dns entry to m2elb | |
node->dnsmasq: resolve m2elb ip | |
node->m2elb: heartbeat | |
m2elb->master: heartbeat | |
master->node: ok (2xx) | |
note over node,master: connection to master will now be persistent via m2elb | |
infra->dnsmasq: set dnsmasq to point \nm2elb's to melb | |
infra->node: use iptable rules to drop all\npackets top and from m2elb IP | |
note over node,master: watch as the node becomes unhealthy and remains\nunhealthy for around 15mins25sec. | |
node->node: kill tcp connection | |
node->dnsmasq: node will now reresolve m2elb dns\nand get melb ip from dnsmasq | |
node->melb: healthy | |
melb->master: healthy | |
master->node: ok (2xx) | |
note over node,master: see the node become healthy again as the connection is reset and \n heartbeats reach master again |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment