Skip to content

Instantly share code, notes, and snippets.

@angrycub
Last active January 17, 2024 02:14
Show Gist options
  • Star 9 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save angrycub/ee9dd4099f1117100dd29608f2028aa8 to your computer and use it in GitHub Desktop.
Save angrycub/ee9dd4099f1117100dd29608f2028aa8 to your computer and use it in GitHub Desktop.

HOWTO: Clean Up Nomad Data Directory

Issue

When attempting to remove all of the data in the Nomad data directory, several directories and files are unable to be deleted. Many messages are logged to the console like:

rm: cannot remove ‘alloc/736f61b9-d7dc-cb73-0dd1-76b1b2ba032d/nomad-ui/secrets’: Device or resource busy
rm: cannot remove ‘alloc/ddcf5a78-5497-f4a4-a101-221fc4e0180b/fabio/alloc’: Device or resource busy
rm: cannot remove ‘alloc/ddcf5a78-5497-f4a4-a101-221fc4e0180b/fabio/secrets’: Device or resource busy
rm: cannot remove ‘alloc/ddcf5a78-5497-f4a4-a101-221fc4e0180b/fabio/proc/7464/projid_map’: Read-only file system
rm: cannot remove ‘alloc/ddcf5a78-5497-f4a4-a101-221fc4e0180b/fabio/proc/7464/setgroups’: Read-only file system
rm: cannot remove ‘alloc/ddcf5a78-5497-f4a4-a101-221fc4e0180b/fabio/proc/7464/timers’: Read-only file system

Process

Run ps aux | grep nomad | grep executor | grep -v grep to verify that there are no running executor processes on the node. Any executors listed by the previous command should be stopped before deleting the allocation state.

When Nomad creates the environment for the allocation, several folders can be are mounted into the allocation to provide state storage.

  • nomad/alloc/«task-alloc-id»/«task-name»/secrets
  • nomad/alloc/«task-alloc-id»/«task-name»/dev
  • nomad/alloc/«task-alloc-id»/«task-name»/proc
  • nomad/alloc/«task-alloc-id»/«task-name»/alloc

These folders remain mounted in stopped instances by design. Nomad uses a garbage collection process to unmount these folders during its cleanup. This allows an operator to inspect the state of an allocation that has yet to be garbage collected even if the operator chooses to stop the Nomad process.

Ordinarily they are cleaned up by Nomad's garbage collection process and would not cause any issues; however, an operator has two options for removing these directories beyond Nomad's internally scheduled garbage collection.

  • Stimulate a garbage collection run using the system/gc API endpoint
  • Unmount the folders manually

Stimulate Garbage Collection

The HTTP API's system/gc endpoint can be used to tell Nomad to immediately make a garbage collection pass. In the case of a completely drained client node, this will server to remove any remaining allocation state data and would prevent your encountering the read-only tmpfs mounts.

curl -XPUT http://127.0.0.1:4646/v1/system/gc

This process does require that the Nomad client process be up and availible and that the nodes are drained of all running allocations. Draining the node is necessary to ensure that all of the allocations are stopped and eligible for garbage collation when you run it manually.

Monitor the alloc directory in the Nomad data directory to verify that all of the allocations have been garbage-collected before stopping Nomad.

Manual Unmounting

Since these read-only file systems are regular Linux mounts, you can use the umount command to unmount them. This could be done in a one-liner similar to the following:

export NOMAD_DATA_ROOT=«Path to your Nomad data_dir»
for ALLOC in `ls -d $NOMAD_DATA_ROOT/alloc/*`; do for JOB in `ls ${ALLOC}| grep -v alloc`; do umount ${ALLOC}/${JOB}/secrets; umount ${ALLOC}/${JOB}/dev; umount ${ALLOC}/${JOB}/proc; umount ${ALLOC}/${JOB}/alloc; done; done

Once the directories are unmounted, the remaining files and directories can be deleted.

@vikashpisces
Copy link

Thank you for this gist. Manual unmounting helped me today to overcome couple of hours struggle today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment