jeanlescure/diff-huge-files.md

## diff-huge-files.md

      
    Raw
  

              diff-huge-files.md
            
          
    Abstract

Say you have a csv backup of a huge database from a week ago (week-old-backup.csv - 3.6GB) and you also have a backup
from the same database which is up-to-date (up-to-date-backup.csv - 3.8GB). You would like to generate a file containing
only the rows that have been added since that backup from a week ago.
This documents details the fastest way to generate this diff file of huge files in Linux.
What you'll need


A machine with enough RAM to support loading to memory the largest of the files, and enough storage to hold the sum of
two copies of the files to compare plus the size of the largest file
A linux distro including the commands sort, comm, sync, rm
A /tmp (tmpfs) partition large enough to hold the contents of the largest of the files

Process

Before starting make sure you have enough RAM and tmpfs storage available to process the files.
In order to clear the unused RAM run:
$ su -c "sync; echo 3 > /proc/sys/vm/drop_caches"
In order to clear the tmpfs run:
$ cd /tmp
$ sudo rm -r *
Now it is time to sort each file, based on our example presented in the abstract we would run:
$ LC_ALL=C sort --parallel=4 -o sorted-week-old-backup.csv week-old-backup.csv
$ LC_ALL=C sort --parallel=4 -o sorted-up-to-date-backup.csv up-to-date-backup.csv
Note: the number 4 used for the --parallel argument should reflect the number of available CPU cores
Note: depending on your system settings you might have to clear the RAM and tmpfs between each run
Troubleshooting: you might get a response Killed on terminal and realize that the sorted file is either empty or
incomplete. In this case simply clear the RAM and tmpfs and try again.
Once you have your sorted files ready, clear the RAM and tmpfs one last time, then generate the "diff" file by running:
$ LC_ALL=C comm -13 sorted-week-old-backup.csv sorted-up-to-date-backup.csv > diff.csv
In our example with two ~4GB files, using an AMD 3200U 2 core CPU (exposing 4 virtual cores) laptop computer with 20GB
RAM and a 8GB tmpfs, each sort and the final comm commands took less than a minute each to run.