sort -u
andsort | uniq
have almost equal performance- both of them are slow and can be replaced with faster python script
Create a file named generate.py
containing the following code:
# generate.py
from random import sample
population="lorem ipsum dolor sit amet consectetur adipiscing elit".split()
number_of_lines = 1_000_000
with open("long-file.txt", "w") as f:
for _ in range(number_of_lines):
print(" ".join(sample(population, len(population))), file=f)
Then run the following command:
python3 generate.py
This will create a text file named long-file.txt
about 53MB in size with one million random lines.
- Ubuntu 20.04
- Intel Core i5-10400
- 32GB RAM
- SSD
IDK why, but both ... | sort -u
and ... | sort | uniq
work extremely slowly.
time sh -c 'cat long-file.txt | sort -u > ./out-sort-u.txt'
real 0m4,184s
user 0m4,134s
sys 0m0,062s
time sh -c 'cat long-file.txt | sort | uniq > ./out-sort-uniq.txt'
real 0m4,081s
user 0m4,101s
sys 0m0,133s
This problem can be solved using custom python script.
# uniq_sort.py
import sys
unique_lines = set()
for line in sys.stdin:
unique_lines.add(line.rstrip("\n"))
unique_lines = sorted(unique_lines)
for line in unique_lines:
print(line)
time sh -c 'cat long-file.txt | python3 uniq_sort.py > ./out-uniq-sort-py.txt'
real 0m0,336s
user 0m0,307s
sys 0m0,057s
If the order of the lines is not important unique_lines = sorted(unique_lines)
can be removed. In this case, the output will contain all unique lines, but in random order.
# uniq.py
import sys
unique_lines = set()
for line in sys.stdin:
unique_lines.add(line.rstrip("\n"))
for line in unique_lines:
print(line)
time sh -c 'cat long-file.txt | python3 uniq.py > ./out-uniq-py.txt'
real 0m0,324s
user 0m0,298s
sys 0m0,054s
But as you can see this version is not much faster.