Say you have a csv backup of a huge database from a week ago (week-old-backup.csv
- 3.6GB) and you also have a backup
from the same database which is up-to-date (up-to-date-backup.csv
- 3.8GB). You would like to generate a file containing
only the rows that have been added since that backup from a week ago.
This documents details the fastest way to generate this diff file of huge files in Linux.
- A machine with enough RAM to support loading to memory the largest of the files, and enough storage to hold the sum of two copies of the files to compare plus the size of the largest file
- A linux distro including the commands
sort
,comm
,sync
,rm
- A
/tmp
(tmpfs) partition large enough to hold the contents of the largest of the files
Before starting make sure you have enough RAM and tmpfs storage available to process the files.
In order to clear the unused RAM run:
$ su -c "sync; echo 3 > /proc/sys/vm/drop_caches"
In order to clear the tmpfs run:
$ cd /tmp
$ sudo rm -r *
Now it is time to sort each file, based on our example presented in the abstract we would run:
$ LC_ALL=C sort --parallel=4 -o sorted-week-old-backup.csv week-old-backup.csv
$ LC_ALL=C sort --parallel=4 -o sorted-up-to-date-backup.csv up-to-date-backup.csv
Note: the number 4
used for the --parallel
argument should reflect the number of available CPU cores
Note: depending on your system settings you might have to clear the RAM and tmpfs between each run
Troubleshooting: you might get a response Killed
on terminal and realize that the sorted file is either empty or
incomplete. In this case simply clear the RAM and tmpfs and try again.
Once you have your sorted files ready, clear the RAM and tmpfs one last time, then generate the "diff" file by running:
$ LC_ALL=C comm -13 sorted-week-old-backup.csv sorted-up-to-date-backup.csv > diff.csv
In our example with two ~4GB files, using an AMD 3200U 2 core CPU (exposing 4 virtual cores) laptop computer with 20GB
RAM and a 8GB tmpfs, each sort
and the final comm
commands took less than a minute each to run.