Skip to content

Instantly share code, notes, and snippets.

@superhero
Last active September 2, 2021 13:46
Show Gist options
  • Save superhero/76bd9d8a20371363f1b2bf8ae30eb15a to your computer and use it in GitHub Desktop.
Save superhero/76bd9d8a20371363f1b2bf8ae30eb15a to your computer and use it in GitHub Desktop.
When docker swarm fucks with you

From a new installation

sudo usermod -aG docker $USER

Adding your user to the docker group. OBS! Follow up with a login-logout process of the session.


From a manager node

OBS! If you have network issues in the swarm, try to reboot the managers in serie before you try anything else. Rebooting the managers carries a small risk for complications compared to what much else can cause.


docker node ls

ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS      ENGINE VERSION
b7t0j9rs63go8l1wa8z24e9tl *   manager-01          Ready               Active              Leader              18.09.3
fg992na75y3tmbiei2o6tc6l8     manager-02          Ready               Active              Reachable           18.09.3
fmg39j952a4lnm5o3uapux3g2     manager-03          Ready               Active              Reachable           18.09.3
um3j5adapapux3g27nmldb9pf     worker-01           Ready               Active                                  18.09.3
y0pcyjb6vlpqz1x62bbsazxp3     worker-02           Down                Drain                                   19.03.15
vzpgz07b64f712t0kam1kfymq     worker-03           Ready               Active                                  19.03.1
oivlajzn6dgbi7nk2osejx0hx     worker-04           Down                Drain                                   18.09.3
32ohhrmdage2rl637ckywsapj     worker-05           Ready               Active                                  20.10.6
...

Shows what node is down or active. Look for something that is down that you do not expect to be down.


If you see some error that says that there is no leader

Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.

...and you need to reclaim the cluster from the manager ... that is expected to already be a manager for the cluster, then init swarm with the flag force-new-cluster should solve the issue for you.

docker swarm init --force-new-cluster

Could be that you need to define what IP to advertise with the --advertise-addr flag, depending on if there is more then one IP - tells the joining nodes how to connect to the leader.


docker service ls

ID                  NAME                              MODE                REPLICAS            IMAGE                                                           PORTS
7bmj6rbsclh1        acl                               replicated          3/3                 registry.example.com:5/acl:latest                               
9ev0xlyq4fx2        acl-staging                       replicated          3/3                 registry.example.com:5/acl-staging:latest                       
ykccgabqlu1o        acl-uat                           replicated          3/3                 registry.example.com:5/acl-uat:latest                           
2z62o15rvzoe        api                               replicated          3/3                 registry.example.com:5/api:latest                               
n28wor05ekb0        b2b-notification-tool             replicated          1/1                 registry.example.com:5/b2b-notification-tool:latest             
9ikwwcots325        b2b-notification-tool-ui          replicated          1/1                 registry.example.com:5/b2b-notification-tool-ui:latest          
8pnvjoonjucf        bss-adaptor                       replicated          1/1                 mrsuperhero/bss-adaptor:latest                                  
11deiin32cez        bss-adaptor-staging               replicated          1/1                 registry.example.com:5/bss-adaptor-staging:latest               
eh0lcm9jsg76        bss-adaptor-uat                   replicated          1/1                 mrsuperhero/bss-adaptor-uat:latest                              
8pjil6qbw7lb        cadmin-repository                 replicated          1/1                 registry.example.com:5/cadmin-repository:latest                 
m7oycnd82sme        calamares-adaptor                 replicated          1/1                 registry.adamo.es:5000/calamares-adaptor:latest
...

Shows what services are running in the cluster. Look for incomplete services, such as 0/1 or 1/3.


docker node ps worker-05

ID                  NAME                                                          IMAGE                                                           NODE                DESIRED STATE       CURRENT STATE                ERROR                              PORTS
vyzeoi0f86b6        proxy.32ohhrmdage2rl637ckywsapj                               mrsuperhero/proxy:latest                                        worker-05           Running             Running 39 minutes ago                                          
ndkkx374oxu1         \_ proxy.32ohhrmdage2rl637ckywsapj                           mrsuperhero/proxy:latest                                        worker-05           Shutdown            Shutdown 39 minutes ago                                         
l5x3yz12nhf5        find-ont-ui.32ohhrmdage2rl637ckywsapj                         registry.example.com:5/find-ont-ui:latest                       worker-05           Running             Running 41 minutes ago                                          
po8kb12p3u5b         \_ find-ont-ui.32ohhrmdage2rl637ckywsapj                     registry.example.com:5/find-ont-ui:latest                       worker-05           Shutdown            Shutdown 41 minutes ago                                         
e4rd2fylnto0        eurologistica-adapter-staging.32ohhrmdage2rl637ckywsapj       registry.example.com:5/eurologistica-adapter-staging:latest     worker-05           Running             Running 42 minutes ago                                          
touqd4u8gdd5         \_ eurologistica-adapter-staging.32ohhrmdage2rl637ckywsapj   registry.example.com:5/eurologistica-adapter-staging:latest     worker-05           Shutdown            Shutdown 41 minutes ago                                         
hcc20ynvvgvv        eurologistica-adapter.32ohhrmdage2rl637ckywsapj               registry.example.com:5/eurologistica-adapter:latest             worker-05           Running             Running 42 minutes ago                                          
vbsykzk7e26j         \_ eurologistica-adapter.32ohhrmdage2rl637ckywsapj           registry.example.com:5/eurologistica-adapter:latest             worker-05           Shutdown            Shutdown 42 minutes ago                                         
uc34coz95lub         \_ eurologistica-adapter.32ohhrmdage2rl637ckywsapj           registry.example.com:5/eurologistica-adapter:latest             worker-05           Shutdown            Shutdown 42 minutes ago                                         
nq4xrgr62ak2        proxy.32ohhrmdage2rl637ckywsapj                               mrsuperhero/proxy                                               worker-05           Shutdown            Shutdown 39 minutes ago                                         
02epczf81udm        eurologistica-adapter-staging.32ohhrmdage2rl637ckywsapj       registry.example.com:5/eurologistica-adapter-staging:latest     worker-05           Shutdown            Shutdown 42 minutes ago                                         
7babrn6xa4fv        find-ont-ui.32ohhrmdage2rl637ckywsapj                         registry.example.com:5/find-ont-ui:latest                       worker-05           Shutdown            Shutdown 41 minutes ago                                         
y7nqf3n7n7bv        proxy.32ohhrmdage2rl637ckywsapj                               mrsuperhero/proxy                                               worker-05           Shutdown            Failed about an hour ago     "error while removing network:…"   
8kmk37626i5t        find-ont-ui.32ohhrmdage2rl637ckywsapj                         registry.example.com:5/find-ont-ui:latest                       worker-05           Shutdown            Failed about an hour ago     "error while removing network:…"   
s4i2dkz6p742        eurologistica-adapter-staging.32ohhrmdage2rl637ckywsapj       registry.example.com:5/eurologistica-adapter-staging:latest     worker-05           Shutdown            Shutdown about an hour ago                                  
...

Shows what services are running in the specified node.


docker service ps acl

ID                  NAME                IMAGE                               NODE                        DESIRED STATE       CURRENT STATE                ERROR               PORTS
uwwrptqqwmvj        acl.1               registry.example.com:5/acl:latest   worker-05                   Running             Running about an hour ago                        
i9b8pv6em9rl         \_ acl.1           registry.example.com:5/acl:latest   worker-01                   Shutdown            Shutdown about an hour ago                       
hwx8vo9yagls         \_ acl.1           registry.example.com:5/acl:latest   worker-01                   Shutdown            Shutdown about an hour ago                       
eg89rm0htcaw         \_ acl.1           registry.example.com:5/acl:latest   worker-01                   Shutdown            Complete 4 hours ago                             
wh0fx31ky1g1         \_ acl.1           registry.example.com:5/acl          yn3r1lamszgq6ub06tg3ahe6c   Shutdown            Running 11 hours ago                             
tdl29w1g12y5        acl.2               registry.example.com:5/acl:latest   worker-05                   Running             Running about an hour ago                        
9hrf14vx0m8d         \_ acl.2           registry.example.com:5/acl:latest   worker-05                   Shutdown            Shutdown about an hour ago                       
8hmdu51vuykg         \_ acl.2           registry.example.com:5/acl:latest   worker-01                   Shutdown            Shutdown about an hour ago                       
t5qp12m23yot         \_ acl.2           registry.example.com:5/acl:latest   worker-01                   Shutdown            Complete 2 hours ago                             
tsbam1hr2b9c         \_ acl.2           registry.example.com:5/acl          yn3r1lamszgq6ub06tg3ahe6c   Shutdown            Running 11 hours ago                             
pbq6ew1wv3pn        acl.3               registry.example.com:5/acl:latest   worker-01                   Running             Running about an hour ago                        
qagmo6uqrrdc         \_ acl.3           registry.example.com:5/acl:latest   worker-05                   Shutdown            Shutdown about an hour ago                       
7865qajaj2jq         \_ acl.3           registry.example.com:5/acl:latest   worker-01                   Shutdown            Shutdown about an hour ago                       
cms2tpm2lrvl         \_ acl.3           registry.example.com:5/acl:latest   worker-01                   Shutdown            Complete 4 hours ago                             
yjrba4scedzc         \_ acl.3           registry.example.com:5/acl          yn3r1lamszgq6ub06tg3ahe6c   Shutdown            Running 11 hours ago 

Shows history of the service and what node they run on. Look for errors in the error column.


docker node update --availability drain worker-05

worker-05

To stop using a node that you identify as corrupt or suspect to be a problem.


docker node update --availability active worker-05

worker-05

To activate a node that is down.


docker swarm join-token worker

To add a worker to this swarm, run the following command:

    docker swarm join --token SWMTKN-1-4y2e0g4dfqtr8xwbxsajzorll9vnzyotkij1p97j07vqtor5rg-2shevw4iur544eauhs556rabf 10.156.0.2:2377

To join a node to the cluster as a worker. OBS! the docker swarm join --token ... command must be copy pasted to the worker node terminal that you want to join.


docker swarm join-token manager

Same as to join a worker, but to join a new manager to the clustor.


docker node update --label-add production=true worker-05

worker-05

Adding a label to a newly joined node can be necessery if there is a constraint to the services which nodes they can be deployed on.


docker service update ping-pong --force

ping-pong
overall progress: 3 out of 3 tasks 
1/3: running   [==================================================>] 
2/3: running   [==================================================>] 
3/3: running   [==================================================>] 
verify: Service converged 

Rebalance a service across the cluster after a new node was added to the swarm.


From a worker node

docker ps -a

CONTAINER ID        IMAGE                                                           COMMAND                  CREATED             STATUS                           PORTS               NAMES
a998b75af5a3        registry.example.com:5/zoho-adaptor-uat:latest                  "npm start"              About an hour ago   Up About an hour                                     zoho-adaptor-uat.1.y0omz07xqf5k3sch384rjzg4s
3f6f7ce7f18c        registry.example.com:5/zoho-adaptor-staging:latest              "npm start"              About an hour ago   Up About an hour                                     zoho-adaptor-staging.1.ui3i86so1uxh8x6k2guq9rw01
92b61a1ab0a0        registry.example.com:5/zoho-adaptor:latest                      "npm start"              About an hour ago   Up About an hour                                     zoho-adaptor.1.p9q9oybrewa692zbs8z5fd4kb
90d2a771ef0e        mrsuperhero/wholesale-sit:latest                                "docker-php-entrypoi…"   About an hour ago   Up About an hour                 80/tcp              wholesale-sit.1.medddy0amddy61xjfyj9i9lqu
13fa4a01bf92        registry.example.com:5/wholesale-ftp-manager:latest             "docker-entrypoint.s…"   About an hour ago   Up About an hour                                     wholesale-ftp-manager.1.moryl65emlvwdoo2pa3p4b3m2
5dec02d416a3        mrsuperhero/wholesale-dev:latest                                "docker-php-entrypoi…"   About an hour ago   Exited (137) About an hour ago                       wholesale-dev.1.tdm04n6es078c514iye5fo5gs
c5a23521d1ec        mrsuperhero/wholesale:latest                                    "docker-php-entrypoi…"   About an hour ago   Up About an hour                 80/tcp              wholesale.1.k9ddmnl1utqyxni8dso40mk6k
bb2139a6f5ca        registry.example.com:5/ui-voip-numberpool-staging:latest        "docker-entrypoint.s…"   About an hour ago   Up About an hour                                     ui-voip-numberpool-staging.1.lgmne591oyvvbf6g2tpyoklpz
7aebe7b1e1a8        registry.example.com:5/ui-voip-numberpool:latest                "docker-entrypoint.s…"   About an hour ago   Up About an hour                                     ui-voip-numberpool.1.8c7idw7wyfshwbcmec7hbd7eb
0ee7cc0d839e        registry.example.com:5/sys02-rebuild-ui-uat:latest              "npm start"              About an hour ago   Up About an hour                                     sys02-rebuild-ui-uat.1.v29st7d0goddys6iyz57t2jjq
0938774e5430        registry.example.com:5/sys02-rebuild-ui-staging:latest          "npm start"              About an hour ago   Up About an hour                                     sys02-rebuild-ui-staging.1.p8cvlgms7723m4nnmaox5jrtw
69544e725437        mrsuperhero/sys02-rebuild-ui:latest                             "npm start"              About an hour ago   Up About an hour                                     sys02-rebuild-ui.1.lschjl7x7jqi5vd3haqr7jo2p
12f863d15f6d        registry.example.com:5/sys02-rebuild-uat:latest                 "npm start"              About an hour ago   Created                                              sys02-rebuild-uat.1.hd2bmecvj0fkye5twmey39flv
...

Lists all containers running in the worker node. With the -a flag, the list includes exited containers.


docker stats

CONTAINER ID        NAME                                                                                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
a998b75af5a3        zoho-adaptor-uat.1.y0omz07xqf5k3sch384rjzg4s                                        0.00%               38.83MiB / 23.55GiB   0.16%               2.27MB / 400kB      94.2kB / 16.4kB     23
063de48800d3        sms.1.o510t7owmi3s09skp1xiauls1                                                     0.00%               39.34MiB / 23.55GiB   0.16%               1.64MB / 15.6kB     0B / 16.4kB         23
2486af68cf3a        sim-delivery-ui.1.2jjkka517nlk43d0dhiagqphh                                         0.00%               40.25MiB / 23.55GiB   0.17%               1.63MB / 8.44kB     0B / 16.4kB         19
fff833b0d21d        sim-delivery-staging.1.occufvt6om4zrm817zqobl7k0                                    0.00%               37MiB / 23.55GiB      0.15%               212MB / 742kB       28.7kB / 16.4kB     23
9b7fc3d89708        sim-delivery.1.6x4hsoiisx8095lortwgkzkg1                                            12.56%              100.9MiB / 23.55GiB   0.42%               2.9GB / 1.18GB      0B / 16.4kB         23
1a61621ff4c5        redis-global.1.oypej8vmpzsa0vxs9tt0lpxtk                                            9.62%               10.82MiB / 23.55GiB   0.04%               4.06GB / 4.25GB     0B / 1.09MB         4
...

Prints a list of docker statistics related to the containers running on the worker. Look for containers that utelize more then average of resources. OBS! To much RAM usage could be becouse of a memory leak.


docker network inspect ${network-name}

Inspecting a network to see if any IP of the containers in the network missmatch with the container. ...or maybe you looking for something else, idk!


docker inspect ${container-name}

Inspecting a container to see details of the container.


docker swarm leave -f

Node left the swarm.

Look for error output - example:

Error response from daemon: context deadline exceeded

sudo service docker stop
sudo rm -rf /var/lib/docker/swarm
sudo service docker start

Removes the corrupted data, can help if your cli commands are throwing errors or freezes. This is a drastic messure, chill with these sets of commands - only use if necessery. Good to follow up with - removing worker from swarm and restarting the managers, before re-adding the node to the swarm with a join token.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment