Skip to content

Instantly share code, notes, and snippets.

@sda97ghb
Last active December 1, 2021 11:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sda97ghb/690c227eb9a6b7fb9047913bfe0e431d to your computer and use it in GitHub Desktop.
Save sda97ghb/690c227eb9a6b7fb9047913bfe0e431d to your computer and use it in GitHub Desktop.

TL;DR;

  1. sort -u and sort | uniq have almost equal performance
  2. both of them are slow and can be replaced with faster python script

Input data

Create a file named generate.py containing the following code:

# generate.py
from random import sample
population="lorem ipsum dolor sit amet consectetur adipiscing elit".split()
number_of_lines = 1_000_000
with open("long-file.txt", "w") as f:
    for _ in range(number_of_lines):
        print(" ".join(sample(population, len(population))), file=f)

Then run the following command:

python3 generate.py

This will create a text file named long-file.txt about 53MB in size with one million random lines.

Testing environment

  • Ubuntu 20.04
  • Intel Core i5-10400
  • 32GB RAM
  • SSD

Standart utilities

IDK why, but both ... | sort -u and ... | sort | uniq work extremely slowly.

sort -u

time sh -c 'cat long-file.txt | sort -u > ./out-sort-u.txt'
real    0m4,184s
user    0m4,134s
sys     0m0,062s

sort | uniq

time sh -c 'cat long-file.txt | sort | uniq > ./out-sort-uniq.txt'
real    0m4,081s
user    0m4,101s
sys     0m0,133s

Python script

This problem can be solved using custom python script.

The version producing the same output as sort -u

# uniq_sort.py
import sys

unique_lines = set()

for line in sys.stdin:
    unique_lines.add(line.rstrip("\n"))

unique_lines = sorted(unique_lines)

for line in unique_lines:
    print(line)
time sh -c 'cat long-file.txt | python3 uniq_sort.py > ./out-uniq-sort-py.txt'
real    0m0,336s
user    0m0,307s
sys     0m0,057s

Even faster version

If the order of the lines is not important unique_lines = sorted(unique_lines) can be removed. In this case, the output will contain all unique lines, but in random order.

# uniq.py
import sys

unique_lines = set()

for line in sys.stdin:
    unique_lines.add(line.rstrip("\n"))

for line in unique_lines:
    print(line)
time sh -c 'cat long-file.txt | python3 uniq.py > ./out-uniq-py.txt'
real    0m0,324s
user    0m0,298s
sys     0m0,054s

But as you can see this version is not much faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment