sda97ghb/Unique lines in text file.md

## Unique lines in text file.md

      
    Raw
  

              Unique lines in text file.md
            
          
    TL;DR;


sort -u and sort | uniq have almost equal performance
both of them are slow and can be replaced with faster python script

Input data

Create a file named generate.py containing the following code:
# generate.py
from random import sample
population="lorem ipsum dolor sit amet consectetur adipiscing elit".split()
number_of_lines = 1_000_000
with open("long-file.txt", "w") as f:
    for _ in range(number_of_lines):
        print(" ".join(sample(population, len(population))), file=f)
Then run the following command:
python3 generate.py
This will create a text file named long-file.txt about 53MB in size with one million random lines.
Testing environment


Ubuntu 20.04
Intel Core i5-10400
32GB RAM
SSD

Standart utilities

IDK why, but both ... | sort -u and ... | sort | uniq work extremely slowly.
sort -u

time sh -c 'cat long-file.txt | sort -u > ./out-sort-u.txt'

real    0m4,184s
user    0m4,134s
sys     0m0,062s

sort | uniq

time sh -c 'cat long-file.txt | sort | uniq > ./out-sort-uniq.txt'

real    0m4,081s
user    0m4,101s
sys     0m0,133s

Python script

This problem can be solved using custom python script.
The version producing the same output as sort -u

# uniq_sort.py
import sys

unique_lines = set()

for line in sys.stdin:
    unique_lines.add(line.rstrip("\n"))

unique_lines = sorted(unique_lines)

for line in unique_lines:
    print(line)
time sh -c 'cat long-file.txt | python3 uniq_sort.py > ./out-uniq-sort-py.txt'

real    0m0,336s
user    0m0,307s
sys     0m0,057s

Even faster version

If the order of the lines is not important unique_lines = sorted(unique_lines) can be removed. In this case, the output will contain all unique lines, but in random order.
# uniq.py
import sys

unique_lines = set()

for line in sys.stdin:
    unique_lines.add(line.rstrip("\n"))

for line in unique_lines:
    print(line)
time sh -c 'cat long-file.txt | python3 uniq.py > ./out-uniq-py.txt'

real    0m0,324s
user    0m0,298s
sys     0m0,054s

But as you can see this version is not much faster.