krx252525/initial-analysis-wrong

## initial-analysis-wrong
Title: Kubernetes node health monitoring
# Render here: https://bramp.github.io/js-sequence-diagrams/
node->dns: resolve ip-address
node->node: cache ip ttl 60s
node->elb: healthy
master->node: ok (2xx)
node->node: wait 10s
note over node,elb: Repeat heartbeat every \n node-status-update-frequency (default 10s)
elb->master: healthy
note right of master: Check status of node every \nnode-monitor-period (default 5s)
master->master: node healthy
elb->elb: change ip
node->elb: healthy (wrong ip)
note over node,elb:  Wait for connection timeout. Retries \nnodeStatusUpdateRetry number of times (constant 5)
master->master: node missed healthcheck
Note right of master: set node not healthy after \nnode-monitor-grace-period seconds (default 40s)
note over node,master: heartbeat fails 3 more times
master->master: nodes not healthy
node->elb: healthy (wrong ip)
node->node: wait 10s
note over node,elb: one more healthcheck until elb ttl expires\n and new ip is resolved
node->dns: resolve ip-address
node->node: cache ip ttl 60s
node->elb: healthy
elb->master: healthy
master->node: ok (2xx)
master->master: node healthy

## more-detailed-analysis
Title: Kubernetes node health monitoring
# Render here: https://bramp.github.io/js-sequence-diagrams/
participant node
participant dns
participant ELBi1
participant ELBi2
participant master

node->dns: resolve ip-address
node->ELBi1: healthy
ELBi1->master: healthy
master->node: ok (2xx)
node->node: close.body()
node->ELBi1: keep-alive
ELBi1->master: keep-alive
note over node,ELBi1: Repeat heartbeat every \n node-status-update-frequency (default 10s)\npreventing tcp connection closing
node->ELBi1: healthy
ELBi1->master: healthy
master->node: ok (2xx)
note right of master: Check status of node every \nnode-monitor-period (default 5s)
master->master: node healthy
dns->dns: Remove ELBi1; \nAdd ELBi2
note over node,master: The advertised address, and underlying host, for the ELB has changed. \nThe the previous address remains resolvable for 60 minutes.\nSince we have a persistent connection open AWS are kind enough to keep the\nprevious instance alive for upto around 1 week.
node->ELBi1: healthy (wrong ip)
ELBi1->master: healthy
master->node: ok (2xx)
note over node,master: a week passes by and we're still speaking to the master on our route via old ELB
ELBi1->ELBi1: die
note over node,master: The ELBi1 hop on the persistent connection has disapeared without saying goodbye.\n A Heartbeat is sent but remains unacknowledged\nThe kernel will keep resending with exponential back off until it receives a response. \nThe kernel Eventually kills the TCP connection 15m25s if no response.
node->ELBi1: healthy (wrong ip)
Note right of master: set node not healthy after \nnode-monitor-grace-period seconds \n(default 40s)
note over node,master: 40 seconds pass since master received last heartbeat from node
master->master: node unhealthy
note over node,master: 15min25s pass since last heartbeat was initially sent and still no response
node->node:kill tcp connection
node->dns: resolve ip-address
node->ELBi2: healthy
ELBi2->master: healthy
master->node: ok (2xx)
master->master: node healthy
note over node,master: Keep connection open indefinitely

## simulate-cluster-failure
Title: Simulate cluster failure resulting from ELB disapearing
# Render here: https://bramp.github.io/js-sequence-diagrams/

participant infra
participant node
participant dnsmasq
participant m2elb
participant melb
participant master

note over dnsmasq: dnsmasq will run\non node
infra->dnsmasq:set dnsmasq to point \nm2elb's dns entry to m2elb
node->dnsmasq: resolve m2elb ip
node->m2elb: heartbeat
m2elb->master: heartbeat
master->node: ok (2xx)
note over node,master: connection to master will now be persistent via m2elb
infra->dnsmasq: set dnsmasq to point \nm2elb's to melb
infra->node: use iptable rules to drop all\npackets top and from m2elb IP
note over node,master: watch as the node becomes unhealthy and remains\nunhealthy for around 15mins25sec.
node->node: kill tcp connection
node->dnsmasq: node will now reresolve m2elb dns\nand get melb ip from dnsmasq
node->melb: healthy
melb->master: healthy
master->node: ok (2xx)
note over node,master: see the node become healthy again as the connection is reset and \n heartbeats reach master again
	Title: Kubernetes node health monitoring
	# Render here: https://bramp.github.io/js-sequence-diagrams/
	node->dns: resolve ip-address
	node->node: cache ip ttl 60s
	node->elb: healthy
	master->node: ok (2xx)
	node->node: wait 10s
	note over node,elb: Repeat heartbeat every \n node-status-update-frequency (default 10s)
	elb->master: healthy
	note right of master: Check status of node every \nnode-monitor-period (default 5s)
	master->master: node healthy
	elb->elb: change ip
	node->elb: healthy (wrong ip)
	note over node,elb: Wait for connection timeout. Retries \nnodeStatusUpdateRetry number of times (constant 5)
	master->master: node missed healthcheck
	Note right of master: set node not healthy after \nnode-monitor-grace-period seconds (default 40s)
	note over node,master: heartbeat fails 3 more times
	master->master: nodes not healthy
	node->elb: healthy (wrong ip)
	node->node: wait 10s
	note over node,elb: one more healthcheck until elb ttl expires\n and new ip is resolved
	node->dns: resolve ip-address
	node->node: cache ip ttl 60s
	node->elb: healthy
	elb->master: healthy
	master->node: ok (2xx)
	master->master: node healthy
	Title: Kubernetes node health monitoring
	# Render here: https://bramp.github.io/js-sequence-diagrams/
	participant node
	participant dns
	participant ELBi1
	participant ELBi2
	participant master

	node->dns: resolve ip-address
	node->ELBi1: healthy
	ELBi1->master: healthy
	master->node: ok (2xx)
	node->node: close.body()
	node->ELBi1: keep-alive
	ELBi1->master: keep-alive
	note over node,ELBi1: Repeat heartbeat every \n node-status-update-frequency (default 10s)\npreventing tcp connection closing
	node->ELBi1: healthy
	ELBi1->master: healthy
	master->node: ok (2xx)
	note right of master: Check status of node every \nnode-monitor-period (default 5s)
	master->master: node healthy
	dns->dns: Remove ELBi1; \nAdd ELBi2
	note over node,master: The advertised address, and underlying host, for the ELB has changed. \nThe the previous address remains resolvable for 60 minutes.\nSince we have a persistent connection open AWS are kind enough to keep the\nprevious instance alive for upto around 1 week.
	node->ELBi1: healthy (wrong ip)
	ELBi1->master: healthy
	master->node: ok (2xx)
	note over node,master: a week passes by and we're still speaking to the master on our route via old ELB
	ELBi1->ELBi1: die
	note over node,master: The ELBi1 hop on the persistent connection has disapeared without saying goodbye.\n A Heartbeat is sent but remains unacknowledged\nThe kernel will keep resending with exponential back off until it receives a response. \nThe kernel Eventually kills the TCP connection 15m25s if no response.
	node->ELBi1: healthy (wrong ip)
	Note right of master: set node not healthy after \nnode-monitor-grace-period seconds \n(default 40s)
	note over node,master: 40 seconds pass since master received last heartbeat from node
	master->master: node unhealthy
	note over node,master: 15min25s pass since last heartbeat was initially sent and still no response
	node->node:kill tcp connection
	node->dns: resolve ip-address
	node->ELBi2: healthy
	ELBi2->master: healthy
	master->node: ok (2xx)
	master->master: node healthy
	note over node,master: Keep connection open indefinitely
	Title: Simulate cluster failure resulting from ELB disapearing
	# Render here: https://bramp.github.io/js-sequence-diagrams/

	participant infra
	participant node
	participant dnsmasq
	participant m2elb
	participant melb
	participant master

	note over dnsmasq: dnsmasq will run\non node
	infra->dnsmasq:set dnsmasq to point \nm2elb's dns entry to m2elb
	node->dnsmasq: resolve m2elb ip
	node->m2elb: heartbeat
	m2elb->master: heartbeat
	master->node: ok (2xx)
	note over node,master: connection to master will now be persistent via m2elb
	infra->dnsmasq: set dnsmasq to point \nm2elb's to melb
	infra->node: use iptable rules to drop all\npackets top and from m2elb IP
	note over node,master: watch as the node becomes unhealthy and remains\nunhealthy for around 15mins25sec.
	node->node: kill tcp connection
	node->dnsmasq: node will now reresolve m2elb dns\nand get melb ip from dnsmasq
	node->melb: healthy
	melb->master: healthy
	master->node: ok (2xx)
	note over node,master: see the node become healthy again as the connection is reset and \n heartbeats reach master again