Skip to content

Instantly share code, notes, and snippets.

@candale
Created March 3, 2020 11:24
Show Gist options
  • Save candale/0cafe5ce8ed30df4903d214f40ad9c45 to your computer and use it in GitHub Desktop.
Save candale/0cafe5ce8ed30df4903d214f40ad9c45 to your computer and use it in GitHub Desktop.

We had this situation where we had to move around 53 million files (HTMLs) from a bunch of folder to some other folders. The files were located in no more than 100 files, having folders with 2.5 million files. One thing worth mentioning is that the disk where the files are is with spinning disks and is also a NFS, both of whch make the access really slow.

The reason we want to move all of these files is because we no longer want to keep them in a flat structure but rather in a nested structure, using the date when the item was created; the dir structure would look something like this: %Y/%m/%d/. So we basically want to run mv $BASE_DIR/item_name.html.gz $BASE_DIR/%Y/%m/%d/item_name.html.gz for each of those 53 million files.

In order to list the files as fast as possible, we don't use ls but a bit of code written in C that can fetch the file names much faster. Check out this blog post about it.

The C that I finally used is this (the blog post clarifies much of it):

#define _GNU_SOURCE
#include <dirent.h>     /* Defines DT_* constants */
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/syscall.h>

#define handle_error(msg) \
       do { perror(msg); exit(EXIT_FAILURE); } while (0)

struct linux_dirent {
   long           d_ino;
   off_t          d_off;
   unsigned short d_reclen;
   char           d_name[];
};

#define BUF_SIZE 1024*1024*7

int
main(int argc, char *argv[])
{
   int fd, nread;
   char buf[BUF_SIZE];
   struct linux_dirent *d;
   int bpos;
   char d_type;

   fd = open(argc > 1 ? argv[1] : ".", O_RDONLY | O_DIRECTORY);
   if (fd == -1)
       handle_error("open");

   for ( ; ; ) {
       nread = syscall(SYS_getdents, fd, buf, BUF_SIZE);
       if (nread == -1)
           handle_error("getdents");

       if (nread == 0)
           break;

       for (bpos = 0; bpos < nread;) {
           d = (struct linux_dirent *) (buf + bpos);
           d_type = *(buf + bpos + d->d_reclen - 1);
           if(d_type != DT_DIR && d->d_ino != 0 && strcmp(d->d_name, ".") != 0 && strcmp(d->d_name, "..") != 0) {
              printf("%s\n", d->d_name);
           }
           bpos += d->d_reclen;
       }
   }

   exit(EXIT_SUCCESS);
}

Compile: gcc -Wall listdir.c -o listdir Use it: ./listdir my_dir

The files are named in the following way: "<item_uid>.html.gz". We can use <item_uid> to fetch the date at which the item was created, from Cassandra. Because I already have the command listdir that I can use from the terminal I decided I want to do the whole move using shell commands.

Cassandra listener server:

import socket
import time
import sys
import os

from my_cassandra_client import CassandraClient


if len(sys.argv) < 2:
    print("you need to supply port")
    exit()

cli = CassandraClient(['host1', 'host2'], 'keyspace')


try:
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
        sock.bind(('localhost', int(sys.argv[1])))
        sock.listen(10)

        result_file = open('results/result_{}'.format(os.getpid()), 'w')
        while True:
            conn, addr = sock.accept()
            with conn:
                uid = conn.recv(64).decode()
                start = time.time()
                try:
                    item = (
                        _ItemByUID.objects()
                        .using(keyspace=cli.keyspace)
                        .only(['created_at', 'spider_name'])
                        .get(uid=uid.strip())
                    )
                except _ItemByUID.DoesNotExist:
                    # conn.sendall('na'.encode('utf-8'))
                    result_file.write('no\n')
                    conn.close()
                    continue
                data = (
                    '%s/%s/%s/%s/%s.html.gz\n' % (
                        item.spider_name,
                        item.created_at.year,
                        item.created_at.month,
                        item.created_at.day,
                        uid.strip()
                    )
                )

                result_file.write(data)
                result_file.flush()
                conn.close()
finally:
    sock.close()
    result_file.flush()
    result_file.close()

Balancer:

# deinfe 60 listeners
python3 server.py 8061 &
python3 server.py 8062 &
python3 server.py 8063 &
python3 server.py 8064 &

....

python3 server.py 8214 &
python3 server.py 8215 &
python3 server.py 8216 &
python3 server.py 8217 &
python3 server.py 8218 &
python3 server.py 8219 &
python3 server.py 8220 &

# start balancer (apt-get install balance)
balance -f -b localhost 8060 \
    localhost:8061 localhost:8062 localhost:8063 localhost:8064 localhost:8065 \
    localhost:8066 localhost:8067 localhost:8068 localhost:8169 localhost:8170 \
    localhost:8171 localhost:8172 localhost:8173 localhost:8174 localhost:8175 \
    localhost:8176 localhost:8177 localhost:8178 localhost:8179 localhost:8180 \
    localhost:8081 localhost:8082 localhost:8083 localhost:8084 localhost:8085 \
    localhost:8086 localhost:8087 localhost:8088 localhost:8189 localhost:8190 \
    localhost:8191 localhost:8192 localhost:8193 localhost:8194 localhost:8195 \
    localhost:8196 localhost:8197 localhost:8198 localhost:8199 localhost:8200 \
    localhost:8201 localhost:8202 localhost:8203 localhost:8204 localhost:8205 \
    localhost:8206 localhost:8207 localhost:8208 localhost:8209 localhost:8210 \
    localhost:8211 localhost:8212 localhost:8213 localhost:8214 localhost:8215 \
    localhost:8216 localhost:8217 localhost:8218 localhost:8219 localhost:8220

I know all the folders that have my items so for each folder I run the command:

./listdir my_folder | \
    # get the file identifier
    rev | cut -d. -f3- | rev | \
    # use parallel to run multiple jobs at once
    parallel -I% --progress  --max-args 1 --jobs 15 "echo -n % | netcat localhost 8060"

This produces a bunch of files that contain on each line a path "my_dir/%Y/%m/%d/my_item.html.gz"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment