Skip to content

Instantly share code, notes, and snippets.

@fginter
Last active August 31, 2020 06:55
Show Gist options
  • Save fginter/2d4662faeef79acdb772 to your computer and use it in GitHub Desktop.
Save fginter/2d4662faeef79acdb772 to your computer and use it in GitHub Desktop.
Super-fast sort - uniq for ngram counting

The problem:

  • 1.3TB data with 5B lines in a 72GB .gz file
  • Need to sort the lines and get a count for each unique line, basically a sort | uniq -c
  • Have a machine with 24 cores, 128GB of memory, but not 1.3TB of free disk space
  • Solution: sort | uniq -c with lots of non-standard options and pigz to take care of compression

Here's the sort part, uniq I used as usual.

INPUT=$1
OUTPUT=${INPUT%.gz}.sorted.gz
export LC_ALL=C
export LC_COLLATE=C
pigz -d -c $INPUT -p 4 | sort -S 50G --parallel 20 -T /mnt/ssd/tmp --compress-program "./pigz.sh" | pigz -b 2048 -p 20 > $OUTPUT

where pigz.sh is just

pigz -b 2048 -p 20 $*

The options are

  • -S gives a huge buffer for sort to work with (it only takes few seconds to sort this amount of data!)
  • --parallel makes sort quite a bit faster if you have the cores to spare
  • -T places the temporary files sort produces onto an SSD drive we happen to have
  • --compress-program tells sort to compress these temporary files using pigz.sh, which is just a wrapper script around pigz
  • pigz -p 20 uses up to 20 cores to compress the data
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment