Skip to content

Instantly share code, notes, and snippets.

@Paxxi
Created April 24, 2017 16:48
Show Gist options
  • Save Paxxi/bd09eaec4d24e98b2a30dfecda0dbdad to your computer and use it in GitHub Desktop.
Save Paxxi/bd09eaec4d24e98b2a30dfecda0dbdad to your computer and use it in GitHub Desktop.
docker 17.04 issues
Env: Ubuntu 16.04
Docker: 17.03, 3 manager nodes running in swarm mode and one worker node
drain manager 1
update machine with apt-get update and reboot
node visible in swarm, things look ok
try to set availability to availible, get error "message":"rpc error: code = 2 desc = update out of sequence" no matter which node it's run on
revert back to 17.03 and reboot machine
update availability to available => success
try update again without drain on all 3 nodes.
Changing availability works as expected again
Searched for the error, didn't find anything related. Never logged a ticket as I didn't have any time for possible follow up questions.
Also affected by these and have seen it since 17.03 at least, don't remember if we saw it on 1.13 as well.
https://github.com/moby/moby/issues/32079
https://github.com/moby/moby/issues/32738
@BretFisher
Copy link

Hmm, after digging through all those issues (wormhole took me to 6+ issues semi-related to overlay networking intermittent connections on various docker versions) I've been lucky enough with my own deployments and my clients, some of which are major web players, not see that issue. Or at least they aren't telling me their dealing with it.

moby/moby#32079
moby/moby#32738
moby/moby#32841
moby/moby#32195
moby/moby#30321

Looks like by the traffic in GHIssues that quite a few are dealing with something overlay or VIP related that's intermittent, which is always harder to troubleshoot or recreate.

My config was:
digital ocean 16.04.02 .5GB
17.05-ce
3 managers, one worker
one service of nginx with 8 replicas spread across all 4 nodes
httping nginx every 10ms via external interfaces (ingress routing mesh port 80) for about an hour (I know, not very long 🤔 )

with the persistent httping, I was able to change node from drain to active, reboot, etc. without fail, as long as I did only one at once.
I also updated and rebooted manager nodes one at a time, without issue, without draining first.
Note: In 3 manager scenario, 2 must always be talking or raft will fail to work until two are healthy.

I'm not saying you didn't have valid issues, but maybe the issue you had found was resolved or maybe it was related to the number of healthy managers that were online?

If you need a small swarm, maybe create 3 managers with drain enabled so they won't run containers and make them smaller instances if you need to cut costs or run 5 managers that are also workers to increase your FT. I'm not sure if # of healthy managers was your issue though. Happy to take more questions on twitter or email bret@bretfisher.com and sorry for the delay in response. Swarm is far from perfect but I deal with it in prod often and over time the show-stopping bugs have decreased since launching last summer.

Thanks for reaching out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment