Skip to content

Instantly share code, notes, and snippets.

@bbeaudreault
Created June 21, 2013 21:13
Show Gist options
  • Save bbeaudreault/5834357 to your computer and use it in GitHub Desktop.
Save bbeaudreault/5834357 to your computer and use it in GitHub Desktop.
Cleanup hadoop jobcache files
#!/bin/bash
for DIR in `find /mnt/mapred/local/taskTracker/*/jobcache/* -maxdepth 0 -type d -mmin +60`; do
if ! find $DIR | grep attempt > /dev/null; then
rm -rf $DIR;
fi;
done;
# There is also another bug that results in jobcache directories being duplicated
# within the attempt_ directories we filter out above. These directories never go away and so jobs with
# this problem will never be cleaned up. The below command handles this case, here's what it does:
#
# 1. Find all subdirectories below the attempt_ dirs that are over 7 days old,
# e.g. /mnt/mapred/local/taskTracker/root/jobcache/job_201305091555_517165/attempt_201305091555_517165_m_000015_0/taskTracker
# (We do a 7 day filter because with these we have no way of knowing if the job is still running. This should be safe enough.)
# 2. Filter out only those we care about, i.e. those with 2 taskTracker parts in the path
# 3. Strip off everything after the jobId, so we are left with only the top-level that we will delete
# e.g. /mnt/mapred/local/taskTracker/root/jobcache/job_201305091555_517165/
# 4. We likely have multiple attempt_ dirs per top-level, so uniq them
# 5. Recursively remove the results
find /mnt/mapred/local/taskTracker/*/jobcache/*/attempt*/ -maxdepth 1 -type d -mtime +7 -name "taskTracker" \
| sed -e 's/attempt.*//' \
| uniq \
| xargs rm -rf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment