Skip to content

Instantly share code, notes, and snippets.

@halostatue
Last active April 19, 2023 15:13
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save halostatue/4407dde79bb9c584abcf083f6b83b1ea to your computer and use it in GitHub Desktop.
Save halostatue/4407dde79bb9c584abcf083f6b83b1ea to your computer and use it in GitHub Desktop.
Benchmarking duplicate discovery with Elixir
Mix.install([
{:benchee, "~> 1.0"}
])
no_duplicates = [Enum.map(1..10_000, &to_string/1)]
all_duplicates = [no_duplicates, no_duplicates]
some_duplicates = [no_duplicates, Enum.map(300..700, &to_string/1)]
one_duplicate = [no_duplicates, "700"]
minus_minus = fn headers ->
flat_headers = List.flatten(headers)
!Enum.empty?(flat_headers -- Enum.uniq(flat_headers))
end
group_filter = fn headers ->
headers
|> List.flatten()
|> Enum.group_by(& &1)
|> Enum.filter(&match?({_, [_, _ | _]}, &1))
|> Enum.empty?()
|> Kernel.not()
end
frequencies = fn headers ->
headers
|> List.flatten()
|> Enum.frequencies()
|> Enum.any?(fn {_header, count} -> count > 1 end)
end
Benchee.run(
%{
"no_duplicates --" => fn -> minus_minus.(no_duplicates) end,
"all_duplicates --" => fn -> minus_minus.(all_duplicates) end,
"some_duplicates --" => fn -> minus_minus.(some_duplicates) end,
"one_duplicate --" => fn -> minus_minus.(one_duplicate) end,
"no_duplicates group_filter" => fn -> group_filter.(no_duplicates) end,
"all_duplicates group_filter" => fn -> group_filter.(all_duplicates) end,
"some_duplicates group_filter" => fn -> group_filter.(some_duplicates) end,
"one_duplicate group_filter" => fn -> group_filter.(one_duplicate) end,
"no_duplicates frequencies" => fn -> frequencies.(no_duplicates) end,
"all_duplicates frequencies" => fn -> frequencies.(all_duplicates) end,
"some_duplicates frequencies" => fn -> frequencies.(some_duplicates) end,
"one_duplicate frequencies" => fn -> frequencies.(one_duplicate) end
},
time: 10,
memory_time: 2
)

I needed to test the difference between finding duplicate entries in a list. The actual results would be nearly indistinguishable between the implementations (~5–20 items across multiple lists, at most), so I inflated this into a much larger test dataset.

This second revision adds a test for using Enum.frequencies/2.

$ elixir benchmark_duplicates.exs
Operating System: macOS
CPU Information: Apple M1 Max
Number of Available Cores: 10
Available memory: 64 GB
Elixir 1.14.0
Erlang 25.0.4

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 2 s
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 2.80 min

Benchmarking all_duplicates -- ...
Benchmarking all_duplicates frequencies ...
Benchmarking all_duplicates group_filter ...
Benchmarking no_duplicates -- ...
Benchmarking no_duplicates frequencies ...
Benchmarking no_duplicates group_filter ...
Benchmarking one_duplicate -- ...
Benchmarking one_duplicate frequencies ...
Benchmarking one_duplicate group_filter ...
Benchmarking some_duplicates -- ...
Benchmarking some_duplicates frequencies ...
Benchmarking some_duplicates group_filter ...

Name                                   ips        average  deviation         median         99th %
one_duplicate group_filter          399.53        2.50 ms     ±8.23%        2.44 ms        3.10 ms
no_duplicates group_filter          392.28        2.55 ms     ±9.34%        2.48 ms        3.25 ms
some_duplicates group_filter        385.90        2.59 ms     ±8.39%        2.55 ms        3.26 ms
no_duplicates frequencies           374.41        2.67 ms    ±10.11%        2.65 ms        3.15 ms
some_duplicates frequencies         373.00        2.68 ms    ±12.77%        2.68 ms        3.32 ms
one_duplicate frequencies           370.32        2.70 ms     ±4.12%        2.71 ms        2.95 ms
all_duplicates frequencies          203.19        4.92 ms     ±4.51%        4.86 ms        5.60 ms
all_duplicates group_filter         191.07        5.23 ms     ±9.14%        5.12 ms        6.27 ms
no_duplicates --                    158.22        6.32 ms     ±5.55%        6.25 ms        7.05 ms
one_duplicate --                    156.01        6.41 ms     ±6.47%        6.36 ms        7.13 ms
some_duplicates --                  154.05        6.49 ms    ±12.32%        6.42 ms        8.03 ms
all_duplicates --                   149.23        6.70 ms     ±5.56%        6.62 ms        7.70 ms

Comparison:
one_duplicate group_filter          399.53
no_duplicates group_filter          392.28 - 1.02x slower +0.0463 ms
some_duplicates group_filter        385.90 - 1.04x slower +0.0884 ms
no_duplicates frequencies           374.41 - 1.07x slower +0.168 ms
some_duplicates frequencies         373.00 - 1.07x slower +0.178 ms
one_duplicate frequencies           370.32 - 1.08x slower +0.197 ms
all_duplicates frequencies          203.19 - 1.97x slower +2.42 ms
all_duplicates group_filter         191.07 - 2.09x slower +2.73 ms
no_duplicates --                    158.22 - 2.53x slower +3.82 ms
one_duplicate --                    156.01 - 2.56x slower +3.91 ms
some_duplicates --                  154.05 - 2.59x slower +3.99 ms
all_duplicates --                   149.23 - 2.68x slower +4.20 ms

Memory usage statistics:

Name                            Memory usage
one_duplicate group_filter           4.55 MB
no_duplicates group_filter           4.59 MB - 1.01x memory usage +0.0332 MB
some_duplicates group_filter         4.79 MB - 1.05x memory usage +0.24 MB
no_duplicates frequencies            3.08 MB - 0.68x memory usage -1.46903 MB
some_duplicates frequencies          2.38 MB - 0.52x memory usage -2.17207 MB
one_duplicate frequencies            3.08 MB - 0.68x memory usage -1.46825 MB
all_duplicates frequencies           5.24 MB - 1.15x memory usage +0.69 MB
all_duplicates group_filter          9.80 MB - 2.15x memory usage +5.25 MB
no_duplicates --                     4.11 MB - 0.90x memory usage -0.44268 MB
one_duplicate --                     4.10 MB - 0.90x memory usage -0.45168 MB
some_duplicates --                   4.11 MB - 0.90x memory usage -0.44685 MB
all_duplicates --                    4.28 MB - 0.94x memory usage -0.27480 MB
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment