public
Created

Inspired by http://linuxnote.net/jianingy/en/linux/a-fast-way-to-remove-huge-number-of-files.html this program removes all regular files within a directory, using multiple processes to work faster. Timings from my system are in a comment below; feel free to leave your own. Easily create lots of empty files with "seq 10000 | xargs touch".

  • Download Gist
purge-directory.c
C
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
#include <dirent.h>
#include <stdio.h>
#include <unistd.h>
 
/* filter for regular files only */
static int dirent_select(const struct dirent* ent)
{
return ent->d_type == DT_REG;
}
 
/* goes to the directory in argv[1] and removes all regular files within */
int main(int argc, char* argv[])
{
if (argc != 2) {
fprintf(stderr, "directory to delete from is required\n");
return 1;
}
 
int res = chdir(argv[1]);
if (res) {
perror("chdir");
return 1;
}
 
/* make the list of files to delete */
struct dirent** list;
int count = scandir(".", &list, dirent_select, NULL);
if (count < 0) {
perror("scandir");
return 1;
}
 
/* fork twice to become four processes total */
pid_t pid1 = fork();
pid_t pid2 = fork();
if (pid1 < 0 || pid2 < 0) {
perror("fork");
return 1;
}
 
/* figure out who is responsible for which files (one case per process) */
int begin, end;
if (pid1 == 0 && pid2 == 0) {
begin = 0;
end = count / 4;
} else if (pid2 == 0) {
begin = count / 4;
end = count / 2;
} else if (pid1 == 0) {
begin = count / 2;
end = count * 3 / 4;
} else {
begin = count * 3 / 4;
end = count;
}
 
/* now delete the files this process is responsible for */
int ii;
for (ii = begin; ii < end; ++ii) {
res = unlink(list[ii]->d_name);
if (res) {
perror("unlink");
return 1;
}
}
 
return 0;
}

On a 2011 Macbook Air (OS X 10.8.3, 128 GB SSD, 1.6 GHz), clearing a directory of 100k empty files takes 11.3 seconds with rsync --delete, vs. 4.3 seconds with this program. The relative gains are less with smaller numbers of files, but still measurable with 10k files (about 0.64 vs. 0.48 seconds). I tried 2-way and 8-way parallelism as well, but found 4-way to be the best fit, at least on this 2x2 (dual-core, hyper-threaded) system.

If you want to try 8-way parallelism, here's the conditional block you need (along with an extra pid3 = fork() of course):

int tmp = (  (pid1 == 0 ? 0 : 4)
       + (pid2 == 0 ? 0 : 2)
       + (pid3 == 0 ? 0 : 1));
begin = count * tmp / 8;
end = count * (tmp+1) / 8;

http://fpaste.org/16330/28909613/

Your program:

$ time ./dothis TestDir/

real    0m36.350s
user    0m0.057s
sys 0m3.831s

My program:

$ time ./killdir TestDir/
Total files: 1000000
Performing delete..
Done

real    0m16.713s
user    0m1.140s
sys 0m6.273s

However, this only works on Linux.

Please sign in to comment on this gist.

Something went wrong with that request. Please try again.