fginter/gist:2d4662faeef79acdb772

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    The problem:

1.3TB data with 5B lines in a 72GB .gz file
Need to sort the lines and get a count for each unique line, basically a sort | uniq -c
Have a machine with 24 cores, 128GB of memory, but not 1.3TB of free disk space
Solution: sort | uniq -c with lots of non-standard options and pigz to take care of compression

Here's the sort part, uniq I used as usual.
INPUT=$1
OUTPUT=${INPUT%.gz}.sorted.gz
export LC_ALL=C
export LC_COLLATE=C
pigz -d -c $INPUT -p 4 | sort -S 50G --parallel 20 -T /mnt/ssd/tmp --compress-program "./pigz.sh" | pigz -b 2048 -p 20 > $OUTPUT

where pigz.sh is just
pigz -b 2048 -p 20 $*

The options are

-S gives a huge buffer for sort to work with (it only takes few seconds to sort this amount of data!)
--parallel makes sort quite a bit faster if you have the cores to spare
-T places the temporary files sort produces onto an SSD drive we happen to have
--compress-program tells sort to compress these temporary files using pigz.sh, which is just a wrapper script around pigz
pigz -p 20 uses up to 20 cores to compress the data