DrEsteban/dc-os_dockercleanup.md

## dc-os_dockercleanup.md

      
    Raw
  

              dc-os_dockercleanup.md
            
          
    DC/OS: Docker artifact cleanup

Introduction

During the course of normal usage, your DC/OS cluster may accumulate Docker images and containers on the public and private agents.  (Especially if you run CI workflows for Docker images in your cluster, or your microservices are frequently spinning up & down with new container instances.)  By default, Docker caches each image version and container instance on the executing node so it can quickly rehydrate stopped containers or deploy new instances. Eventually this process will fill the entire disk.
This Gist will walk you through the process of cleaning out these cached images/containers from your public and private agents, using a specially-configured Marathon app.
Prerequisites


You'll need the private SSH key corresponding to the public key that is authorized with your DC/OS cluster.
You must have the ability to SSH to your cluster.

If your environment is Azure Container Services, see this article for more details.


You must open an SSH tunnel to port 80 on the cluster. This tutorial will reference your locally-mapped port as [port #].

If your environment is Azure Container Services, see the same article as above for details.
Or, if you have the Azure CLI, you can instead try the az acs dcos browse command to open an SSH tunnel for you.


You'll need to know the total number of nodes in your cluster. Instructions for finding this information are included below.

Steps

1. Open a web browser to Marathon's management portal

Using the port you mapped in the Prerequisites section, open a web browser to http://localhost:[port #]/marathon.  You will see all the applications currently deployed to your cluster here.
2. Click the "Create Application" button in the upper-right of the portal


3. Change to "JSON Mode"


4. Paste the following Marathon app definition into the text area

NOTE: Change the <number of nodes in cluster> to equal the total number of nodes. You can find this information by counting the number of nodes at http://localhost:[port #]/#/nodes/.
{
  "id": "/docker-cleanup",
  "cmd": "sudo docker rm $(sudo docker ps -a -q) & sudo docker rmi $(sudo docker images -a -q) & echo Done! & sleep 3600",
  "cpus": 0.01,
  "mem": 32,
  "disk": 0,
  "instances": <number of nodes in cluster>,
  "constraints": [
    [
      "hostname",
      "UNIQUE"
    ]
  ],
  "acceptedResourceRoles": [
    "*",
    "slave_public"
  ]
}

This application will run one instance on each node of the cluster (enforced by the hostname:UNIQUE constraint, and proper configuration of the instances variable).  On each node, it will attempt to clean all stopped containers and images.  Don't worry, running containers and images will be unaffected due to the lack of the -f (force) argument.
After cleaning is complete, the application will output "Done!" to stdout, and will sleep for 3,600 seconds (60 minutes). After the 60 minute timeout, the application will gracefully exit. However, because Marathon has been configured to ensure there are '<number of nodes in cluster>' instances running, the application will immediately restart and the process will run again. If you don't want this, please make sure to follow the rest of this tutorial within 60 minutes of clicking "Create Application".
5. Click "Create Application", and wait for all instances to spin up

NOTE: 5/5 is only used here as an example.  The actual value will be equal to the <number of nodes in cluster> variable you configured above.

6. Click the application name to go to the details page, and verify that all instances report "Done!" in stdout


Some nodes may take longer to clear out than others if they have more cached images and containers.
7. Suspend the application (optional)

At this step, the application will clean all nodes once an hour.  If you don't want this, go to the application details page again, click the gear icon, and select "Suspend".  This will scale the number of instances to zero, effectively disabling it.  You can always re-scale back to <number of nodes in cluster> at a future date to run the cleanup again.
8. You're done! Retry your failing deployment again.

FAQ


The cleanup application is stuck with Status "Waiting" or "Delayed", with some number of instances not running!

Check the value you are using for <number of nodes in cluster> to ensure you aren't specifying a higher number of instances than nodes. Only one instance per node can be run, due to the hostname:UNIQUE constraint. Marathon will wait forever for another available node to execute on.
Check the Mesos logs at http://localhost:[port #]/mesos. In the "Completed Tasks" section, look for any instances of your cleanup application with status "KILLED" or "FAILED". You can go to the "Sandbox" for those instances to see their stderr and stdout to check for any obvious errors. Make sure to take note of the "Host" that attempted to run the application to identify if one node in particular is failing to execute it.
One or more of your nodes may not have 32MB of memory available to run the application. In this case, you'll need to identify the node, identify the running applications on that node, and scale-down some of the running applications in Marathon until there is 32MB available.  (All of this information can be found at http://localhost:[port #]/mesos.) Alternatively, you can SSH to the node from the master and execute the commands manually.
Your node's disk may be so full that Marathon cannot deploy the application. In this case, you'll need to SSH to the node from the master and execute the commands manually.


I don't have access to the private SSH key used to create my cluster!

If your environment is Azure Container Services, you can reset SSH keys, create/delete users, etc, with the Azure VMAccess extension.