Skip to content

Instantly share code, notes, and snippets.

@bensternthal
Created September 18, 2015 00:18
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bensternthal/f32eb3b9398c73695ddc to your computer and use it in GitHub Desktop.
Save bensternthal/f32eb3b9398c73695ddc to your computer and use it in GitHub Desktop.
12:29 You have joined the channel
12:29 Mode: +nt
12:29 Created at: Aug 31, 2015, 6:50 PM
15:13 jgmize
pmac: tmux -S /tmp/shareds attach -t shared
15:43 krutten
Hello
15:43 krutten
the host is out of file handles, we are looking for why
15:45 jgmize
I think it's probably the tiny nat host on that cluster
15:46 jgmize
it's a bottleneck for outbound traffic so there's possibly too many fh open for tcp sockets
15:46 jgmize
I resized it on the other cluster but forgot to go back and do it on this one
15:47 jgmize
could be wrong of course and I'm open to other possibilities
15:47 krutten
can we adjust it with the lack of file handles
15:53 krutten
jgmize: can you restart or resize the nat host?
15:55 jgmize
krutten: looking into that now
15:56 krutten
Fleet is just trying to talk to etcd, so we believe when the handles are freed up, it should recover
15:57 krutten
we should be able to see it clearly in the journal after the NAT restart
16:05 jgmize
krutten: as you and carmstrong can probably tell from the shared tmux session, I'm trying to figure out what I need to set to run the update-vpc.sh against this specific cluster-- do you happen to know?
16:06 jgmize
if not, and krancour isn't busy, maybe he could tell us since he wrote it? :)
16:06 krutten
checking
16:11 krutten
jgmize: whats the stack name for us-west?
16:11 jgmize
deis-vpc
16:12 krutten
you just want to apply this to the one cluster right?
16:12 jgmize
right
16:12 jgmize
I think I just need to set the region
16:12 jgmize
to us-west-2
16:12 krutten
that makes sense
16:13 jgmize
the cloud formation stack name is the same in both regions, I'm just trying to figure out where to specify it
16:13 jgmize
it being the region
16:15 krutten
jgmize: can you look at
16:15 krutten
~/.aws/config
16:15 jgmize
ok I think the AWS_DEFAULT_REGION env var should work
16:15 krutten
[ruby-2.1.5] Aries:.aws krutten$ cat config
16:15 krutten
[default]
16:15 jgmize
running it now
16:15 krutten
region = us-east-1
16:15 krutten
[ruby-2.1.5] Aries:.aws krutten$
16:15 krutten
ENV should work also
16:16 krutten
or `aws config`
16:19 jgmize
ok, the nat host is now an m4.large instead of a t2.micro
16:21 krutten
jgmize: can you run
16:21 krutten
journalctl -n 50 -u fleet --no-pager
16:24 jgmize
sure and you can drive now if you want
16:27 krutten
jgmize: localhost was missing from /etc/hosts
16:27 jgmize
yes, is that a known issue or something ne?
16:27 krutten
so the traffic was hitting DNS
16:27 jgmize
new
16:28 jgmize
is this something in the cloudformation template?
16:28 krutten
There is some pull requests krancour and chris are looking at.
16:28 jgmize
can you give me a link?
16:29 krutten
https://github.com/deis/deis/pull/4221
16:29 jgmize
thanks
16:29 krutten
We are still looking at other things
16:30 krutten
jgmize: can we go on all the hosts and add localhost to the 127.0.0.1 line on /etc/hosts ?
16:30 jgmize
krutten: yes
16:31 krutten
I'll let you drive as you know the instances better then I
16:34 jgmize
ok, all the nodes in that cluster have that change applied. should we restart any services?
16:35 krutten
it should start to heal
16:35 krutten
I'd be curious to run fleetctl list-machines on each node
16:38 krutten
lookup localhost: too many open files
16:39 krutten
jgmize: can Chris drive for a minute?
16:40 jgmize
krutten: sure
16:40 krutten
looks like restarting fleet may be needed when it;'s in this state
16:40 krutten
update hosts and restart fleet. it's not rereading hosts :-/
16:41 krutten
which I expected to happen at the libc level
16:43 krutten
jgmize: I don;t have the IP of the last machine
16:44 jgmize
omp
16:44 krutten
found it
16:45 krutten
matches etcd's list
16:46 krutten
so the lack of localhost in /etc/hosts put pressure on the NAT server until fleet started to fail the lookup of localhost:4001
16:46 krutten
so we fixed both half of the issue, NAT capacity and localhost lookup
16:47 jgmize
ok, I'll apply this fix to the other cluster as well, thanks for your help
16:48 krutten
Managing /etc/hosts should fall on the OS and provisioning service for multiple reasons which is why we don't try to manage it (we removed that part)
16:48 jgmize
the other half-- it already has the larger nat cluster
16:48 krutten
but I've suggested a check on install to verify localhost is present and warn the user/installer
16:48 jgmize
sounds good
16:49 krutten
jgmize: after the NAT was resized, I'm not sure why that didn't "solve" it (cover it up)
16:52 krutten
jgmize: does things look healthy now?
16:58 jgmize
on this cluster, yes. I'm still applying the /etc/hosts changes to the other cluster, but since that one uses k8s do you think I should go ahead and restart it instead of fleet?
17:00 krutten
I'd watch the logs. if it's not failing leaving it up may be a good test
17:00 krutten
if it's production, then I would rolling restart it to be safe though
17:06 jgmize
the other one has been having issues too, but it's a much more experimental config and we wanted to focus on this one first. I think the /etc/hosts issue has been affecting both clusters though, we just didn't see the extent of the issues until we started doing stress testing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment