Skip to content

Instantly share code, notes, and snippets.

@jeanlescure
Last active November 3, 2020 01:17
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jeanlescure/c2f546db51880f4991b026a826b74cf5 to your computer and use it in GitHub Desktop.
Save jeanlescure/c2f546db51880f4991b026a826b74cf5 to your computer and use it in GitHub Desktop.
Find lines from a huge file which are not present in another even bigger file

Abstract

Say you have a csv backup of a huge database from a week ago (week-old-backup.csv - 3.6GB) and you also have a backup from the same database which is up-to-date (up-to-date-backup.csv - 3.8GB). You would like to generate a file containing only the rows that have been added since that backup from a week ago.

This documents details the fastest way to generate this diff file of huge files in Linux.

What you'll need

  • A machine with enough RAM to support loading to memory the largest of the files, and enough storage to hold the sum of two copies of the files to compare plus the size of the largest file
  • A linux distro including the commands sort, comm, sync, rm
  • A /tmp (tmpfs) partition large enough to hold the contents of the largest of the files

Process

Before starting make sure you have enough RAM and tmpfs storage available to process the files.

In order to clear the unused RAM run:

$ su -c "sync; echo 3 > /proc/sys/vm/drop_caches"

In order to clear the tmpfs run:

$ cd /tmp
$ sudo rm -r *

Now it is time to sort each file, based on our example presented in the abstract we would run:

$ LC_ALL=C sort --parallel=4 -o sorted-week-old-backup.csv week-old-backup.csv
$ LC_ALL=C sort --parallel=4 -o sorted-up-to-date-backup.csv up-to-date-backup.csv

Note: the number 4 used for the --parallel argument should reflect the number of available CPU cores

Note: depending on your system settings you might have to clear the RAM and tmpfs between each run

Troubleshooting: you might get a response Killed on terminal and realize that the sorted file is either empty or incomplete. In this case simply clear the RAM and tmpfs and try again.

Once you have your sorted files ready, clear the RAM and tmpfs one last time, then generate the "diff" file by running:

$ LC_ALL=C comm -13 sorted-week-old-backup.csv sorted-up-to-date-backup.csv > diff.csv

In our example with two ~4GB files, using an AMD 3200U 2 core CPU (exposing 4 virtual cores) laptop computer with 20GB RAM and a 8GB tmpfs, each sort and the final comm commands took less than a minute each to run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment