The problem:
- 1.3TB data with 5B lines in a 72GB .gz file
- Need to sort the lines and get a count for each unique line, basically a
sort | uniq -c
- Have a machine with 24 cores, 128GB of memory, but not 1.3TB of free disk space
- Solution:
sort | uniq -c
with lots of non-standard options andpigz
to take care of compression
Here's the sort
part, uniq
I used as usual.
INPUT=$1
OUTPUT=${INPUT%.gz}.sorted.gz