bensternthal/#deis 2015-09-17

## #deis 2015-09-17
12:29 You have joined the channel
12:29 Mode: +nt
12:29 Created at: Aug 31, 2015, 6:50 PM
15:13 jgmize
pmac: tmux -S /tmp/shareds attach -t shared
15:43 krutten
Hello
15:43 krutten
the host is out of file handles, we are looking for why
15:45 jgmize
I think it's probably the tiny nat host on that cluster
15:46 jgmize
it's a bottleneck for outbound traffic so there's possibly too many fh open for tcp sockets
15:46 jgmize
I resized it on the other cluster but forgot to go back and do it on this one
15:47 jgmize
could be wrong of course and I'm open to other possibilities
15:47 krutten
can we adjust it with the lack of file handles
15:53 krutten
jgmize: can you restart or resize the nat host?
15:55 jgmize
krutten: looking into that now
15:56 krutten
Fleet is just trying to talk to etcd, so we believe when the handles are freed up, it should recover
15:57 krutten
we should be able to see it clearly in the journal after the NAT restart
16:05 jgmize
krutten: as you and carmstrong can probably tell from the shared tmux session, I'm trying to figure out what I need to set to run the update-vpc.sh against this specific cluster-- do you happen to know?
16:06 jgmize
if not, and krancour isn't busy, maybe he could tell us since he wrote it? :)
16:06 krutten
checking
16:11 krutten
jgmize: whats the stack name for us-west?
16:11 jgmize
deis-vpc
16:12 krutten
you just want to apply this to the one cluster right?
16:12 jgmize
right
16:12 jgmize
I think I just need to set the region
16:12 jgmize
to us-west-2
16:12 krutten
that makes sense
16:13 jgmize
the cloud formation stack name is the same in both regions, I'm just trying to figure out where to specify it
16:13 jgmize
it being the region
16:15 krutten
jgmize: can you look at
16:15 krutten
~/.aws/config
16:15 jgmize
ok I think the AWS_DEFAULT_REGION env var should work
16:15 krutten
[ruby-2.1.5] Aries:.aws krutten$ cat config
16:15 krutten
[default]
16:15 jgmize
running it now
16:15 krutten
region = us-east-1
16:15 krutten
[ruby-2.1.5] Aries:.aws krutten$
16:15 krutten
ENV should work also
16:16 krutten
or `aws config`
16:19 jgmize
ok, the nat host is now an m4.large instead of a t2.micro
16:21 krutten
jgmize: can you run
16:21 krutten
journalctl -n 50 -u fleet --no-pager
16:24 jgmize
sure and you can drive now if you want
16:27 krutten
jgmize: localhost was missing from /etc/hosts
16:27 jgmize
yes, is that a known issue or something ne?
16:27 krutten
so the traffic was hitting DNS
16:27 jgmize
new
16:28 jgmize
is this something in the cloudformation template?
16:28 krutten
There is some pull requests krancour and chris are looking at.
16:28 jgmize
can you give me a link?
16:29 krutten
https://github.com/deis/deis/pull/4221
16:29 jgmize
thanks
16:29 krutten
We are still looking at other things
16:30 krutten
jgmize: can we go on all the hosts and add localhost to the 127.0.0.1 line on /etc/hosts ?
16:30 jgmize
krutten: yes
16:31 krutten
I'll let you drive as you know the instances better then I
16:34 jgmize
ok, all the nodes in that cluster have that change applied. should we restart any services?
16:35 krutten
it should start to heal
16:35 krutten
I'd be curious to run fleetctl list-machines on each node
16:38 krutten
lookup localhost: too many open files
16:39 krutten
jgmize: can Chris drive for a minute?
16:40 jgmize
krutten: sure
16:40 krutten
looks like restarting fleet may be needed when it;'s in this state
16:40 krutten
update hosts and restart fleet.  it's not rereading hosts :-/
16:41 krutten
which I expected to happen at the libc level
16:43 krutten
jgmize: I don;t have the IP of the last machine
16:44 jgmize
omp
16:44 krutten
found it
16:45 krutten
matches etcd's list
16:46 krutten
so the lack of localhost in /etc/hosts put pressure on the NAT server until fleet started to fail the lookup of localhost:4001
16:46 krutten
so we fixed both half of the issue, NAT capacity and localhost lookup
16:47 jgmize
ok, I'll apply this fix to the other cluster as well, thanks for your help
16:48 krutten
Managing /etc/hosts should fall on the OS and provisioning service for multiple reasons which is why we don't try to manage it (we removed that part)
16:48 jgmize
the other half-- it already has the larger nat cluster
16:48 krutten
but I've suggested a check on install to verify localhost is present and warn the user/installer
16:48 jgmize
sounds good
16:49 krutten
jgmize: after the NAT was resized, I'm not sure why that didn't "solve" it (cover it up)
16:52 krutten
jgmize: does things look healthy now?
16:58 jgmize
on this cluster, yes. I'm still applying the /etc/hosts changes to the other cluster, but since that one uses k8s do you think I should go ahead and restart it instead of fleet?
17:00 krutten
I'd watch the logs.  if it's not failing leaving it up may be a good test
17:00 krutten
if it's production, then I would rolling restart it to be safe though
17:06 jgmize
the other one has been having issues too, but it's a much more experimental config and we wanted to focus on this one first. I think the /etc/hosts issue has been affecting both clusters though, we just didn't see the extent of the issues until we started doing stress testing
	12:29 You have joined the channel
	12:29 Mode: +nt
	12:29 Created at: Aug 31, 2015, 6:50 PM
	15:13 jgmize
	pmac: tmux -S /tmp/shareds attach -t shared
	15:43 krutten
	Hello
	15:43 krutten
	the host is out of file handles, we are looking for why
	15:45 jgmize
	I think it's probably the tiny nat host on that cluster
	15:46 jgmize
	it's a bottleneck for outbound traffic so there's possibly too many fh open for tcp sockets
	15:46 jgmize
	I resized it on the other cluster but forgot to go back and do it on this one
	15:47 jgmize
	could be wrong of course and I'm open to other possibilities
	15:47 krutten
	can we adjust it with the lack of file handles
	15:53 krutten
	jgmize: can you restart or resize the nat host?
	15:55 jgmize
	krutten: looking into that now
	15:56 krutten
	Fleet is just trying to talk to etcd, so we believe when the handles are freed up, it should recover
	15:57 krutten
	we should be able to see it clearly in the journal after the NAT restart
	16:05 jgmize
	krutten: as you and carmstrong can probably tell from the shared tmux session, I'm trying to figure out what I need to set to run the update-vpc.sh against this specific cluster-- do you happen to know?
	16:06 jgmize
	if not, and krancour isn't busy, maybe he could tell us since he wrote it? :)
	16:06 krutten
	checking
	16:11 krutten
	jgmize: whats the stack name for us-west?
	16:11 jgmize
	deis-vpc
	16:12 krutten
	you just want to apply this to the one cluster right?
	16:12 jgmize
	right
	16:12 jgmize
	I think I just need to set the region
	16:12 jgmize
	to us-west-2
	16:12 krutten
	that makes sense
	16:13 jgmize
	the cloud formation stack name is the same in both regions, I'm just trying to figure out where to specify it
	16:13 jgmize
	it being the region
	16:15 krutten
	jgmize: can you look at
	16:15 krutten
	~/.aws/config
	16:15 jgmize
	ok I think the AWS_DEFAULT_REGION env var should work
	16:15 krutten
	[ruby-2.1.5] Aries:.aws krutten$ cat config
	16:15 krutten
	[default]
	16:15 jgmize
	running it now
	16:15 krutten
	region = us-east-1
	16:15 krutten
	[ruby-2.1.5] Aries:.aws krutten$
	16:15 krutten
	ENV should work also
	16:16 krutten
	or `aws config`
	16:19 jgmize
	ok, the nat host is now an m4.large instead of a t2.micro
	16:21 krutten
	jgmize: can you run
	16:21 krutten
	journalctl -n 50 -u fleet --no-pager
	16:24 jgmize
	sure and you can drive now if you want
	16:27 krutten
	jgmize: localhost was missing from /etc/hosts
	16:27 jgmize
	yes, is that a known issue or something ne?
	16:27 krutten
	so the traffic was hitting DNS
	16:27 jgmize
	new
	16:28 jgmize
	is this something in the cloudformation template?
	16:28 krutten
	There is some pull requests krancour and chris are looking at.
	16:28 jgmize
	can you give me a link?
	16:29 krutten
	https://github.com/deis/deis/pull/4221
	16:29 jgmize
	thanks
	16:29 krutten
	We are still looking at other things
	16:30 krutten
	jgmize: can we go on all the hosts and add localhost to the 127.0.0.1 line on /etc/hosts ?
	16:30 jgmize
	krutten: yes
	16:31 krutten
	I'll let you drive as you know the instances better then I
	16:34 jgmize
	ok, all the nodes in that cluster have that change applied. should we restart any services?
	16:35 krutten
	it should start to heal
	16:35 krutten
	I'd be curious to run fleetctl list-machines on each node
	16:38 krutten
	lookup localhost: too many open files
	16:39 krutten
	jgmize: can Chris drive for a minute?
	16:40 jgmize
	krutten: sure
	16:40 krutten
	looks like restarting fleet may be needed when it;'s in this state
	16:40 krutten
	update hosts and restart fleet. it's not rereading hosts :-/
	16:41 krutten
	which I expected to happen at the libc level
	16:43 krutten
	jgmize: I don;t have the IP of the last machine
	16:44 jgmize
	omp
	16:44 krutten
	found it
	16:45 krutten
	matches etcd's list
	16:46 krutten
	so the lack of localhost in /etc/hosts put pressure on the NAT server until fleet started to fail the lookup of localhost:4001
	16:46 krutten
	so we fixed both half of the issue, NAT capacity and localhost lookup
	16:47 jgmize
	ok, I'll apply this fix to the other cluster as well, thanks for your help
	16:48 krutten
	Managing /etc/hosts should fall on the OS and provisioning service for multiple reasons which is why we don't try to manage it (we removed that part)
	16:48 jgmize
	the other half-- it already has the larger nat cluster
	16:48 krutten
	but I've suggested a check on install to verify localhost is present and warn the user/installer
	16:48 jgmize
	sounds good
	16:49 krutten
	jgmize: after the NAT was resized, I'm not sure why that didn't "solve" it (cover it up)
	16:52 krutten
	jgmize: does things look healthy now?
	16:58 jgmize
	on this cluster, yes. I'm still applying the /etc/hosts changes to the other cluster, but since that one uses k8s do you think I should go ahead and restart it instead of fleet?
	17:00 krutten
	I'd watch the logs. if it's not failing leaving it up may be a good test
	17:00 krutten
	if it's production, then I would rolling restart it to be safe though
	17:06 jgmize
	the other one has been having issues too, but it's a much more experimental config and we wanted to focus on this one first. I think the /etc/hosts issue has been affecting both clusters though, we just didn't see the extent of the issues until we started doing stress testing