Skip to content

Instantly share code, notes, and snippets.

@chrisdone
Last active September 18, 2023 07:52
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save chrisdone/8551675bb99a0d66cf075fdcb1e6b757 to your computer and use it in GitHub Desktop.
Save chrisdone/8551675bb99a0d66cf075fdcb1e6b757 to your computer and use it in GitHub Desktop.
Rust vs Haskell: parse vector of ints

Comparing parsing of a list containing 12345678910 of sizes 1000, 2000 and 4000.

Compares walking a string building no structures ("check") and parsing into a vector ("parse").

The Haskell version just builds a linked list and then converts that to an unboxed vector. I'm not sure what the Nom package does.

The parsers are slightly different - but there isn't a 2x difference between the two for a representative grammar, indicating the libraries are of equivalent performance.

You could sit and optimize both down, but the difference compared to attoparsec is already 10x.

My laptop:

vendor_id	: GenuineIntel
cpu family	: 6
model		: 158
model name	: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
stepping	: 9
microcode	: 0x8e
cpu MHz		: 1899.979
cache size	: 8192 KB
(cargo bench)
check valid 1000 time: [48.064 us 48.243 us 48.432 us]
Found 8 outliers among 100 measurements (8.00%)
4 (4.00%) high mild
4 (4.00%) high severe
parse valid 1000 time: [62.988 us 63.175 us 63.397 us]
Found 14 outliers among 100 measurements (14.00%)
2 (2.00%) high mild
12 (12.00%) high severe
check valid 2000 time: [96.119 us 96.440 us 96.799 us]
Found 10 outliers among 100 measurements (10.00%)
5 (5.00%) high mild
5 (5.00%) high severe
parse valid 2000 time: [126.01 us 126.42 us 126.91 us]
Found 14 outliers among 100 measurements (14.00%)
12 (12.00%) high mild
2 (2.00%) high severe
check valid 4000 time: [192.55 us 192.99 us 193.45 us]
Found 11 outliers among 100 measurements (11.00%)
5 (5.00%) high mild
6 (6.00%) high severe
parse valid 4000 time: [252.29 us 253.30 us 254.45 us]
Found 11 outliers among 100 measurements (11.00%)
5 (5.00%) high mild
6 (6.00%) high severe
(stack bench --ba 'integer_array')
benchmarked integer_array/1000/check
time 19.35 μs (18.98 μs .. 19.68 μs)
0.998 R² (0.998 R² .. 0.999 R²)
mean 18.89 μs (18.81 μs .. 19.00 μs)
std dev 358.2 ns (270.8 ns .. 463.1 ns)
benchmarked integer_array/1000/parse
time 20.82 μs (20.68 μs .. 21.09 μs)
0.999 R² (0.998 R² .. 1.000 R²)
mean 20.94 μs (20.85 μs .. 21.11 μs)
std dev 421.2 ns (307.0 ns .. 635.7 ns)
benchmarked integer_array/2000/check
time 37.40 μs (37.09 μs .. 37.79 μs)
0.999 R² (0.997 R² .. 1.000 R²)
mean 37.46 μs (37.29 μs .. 37.70 μs)
std dev 686.5 ns (497.1 ns .. 951.5 ns)
benchmarked integer_array/2000/parse
time 50.98 μs (50.63 μs .. 51.42 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 51.04 μs (50.83 μs .. 51.31 μs)
std dev 756.9 ns (599.4 ns .. 988.2 ns)
benchmarked integer_array/4000/check
time 75.07 μs (74.55 μs .. 75.76 μs)
0.999 R² (0.998 R² .. 1.000 R²)
mean 75.16 μs (74.82 μs .. 75.69 μs)
std dev 1.443 μs (1.084 μs .. 1.900 μs)
benchmarked integer_array/4000/parse
time 106.2 μs (105.3 μs .. 107.0 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 104.4 μs (104.0 μs .. 105.0 μs)
std dev 1.518 μs (1.193 μs .. 2.053 μs)
(stack bench --ba 'integer_array') # using llvm
benchmarked integer_array/1000/check
time 16.67 μs (16.48 μs .. 16.88 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 16.52 μs (16.47 μs .. 16.58 μs)
std dev 179.5 ns (140.1 ns .. 232.3 ns)
benchmarked integer_array/1000/parse
time 18.23 μs (18.12 μs .. 18.39 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 18.33 μs (18.27 μs .. 18.43 μs)
std dev 287.9 ns (205.1 ns .. 391.7 ns)
benchmarked integer_array/2000/check
time 32.96 μs (32.79 μs .. 33.11 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 33.10 μs (32.96 μs .. 33.32 μs)
std dev 586.8 ns (386.3 ns .. 835.5 ns)
benchmarked integer_array/2000/parse
time 45.64 μs (45.25 μs .. 46.01 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 45.99 μs (45.77 μs .. 46.28 μs)
std dev 887.7 ns (638.7 ns .. 1.194 μs)
benchmarked integer_array/4000/check
time 66.03 μs (65.59 μs .. 66.44 μs)
0.999 R² (0.998 R² .. 1.000 R²)
mean 66.46 μs (66.17 μs .. 66.97 μs)
std dev 1.257 μs (970.9 ns .. 1.638 μs)
benchmarked integer_array/4000/parse
time 93.87 μs (93.14 μs .. 94.71 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 94.07 μs (93.76 μs .. 94.49 μs)
std dev 1.159 μs (899.0 ns .. 1.611 μs)
@liamzee
Copy link

liamzee commented Sep 17, 2023

The Haskell benchmarks are actually disturbing because they're not O(n); i.e, Haskell is very mildly exponential, implying that Rust pulls ahead starting around n = 30,000. Any idea what's going on?

@chrisdone
Copy link
Author

chrisdone commented Sep 17, 2023 via email

@chrisdone
Copy link
Author

chrisdone commented Sep 17, 2023 via email

@AndrasKovacs
Copy link

AndrasKovacs commented Sep 17, 2023

I do smell a bit of vector shenanigans. The UV.fromList uses a growing buffer with doubling and some usual weird fusion that I didn't try to understand in the Core. I reran the bench here: https://github.com/AndrasKovacs/flatparse/blob/readVectorInt-example/bench/ReadVectorInt.hs

read Int vector/1000/check               mean 11.68 μs  ( +- 122.8 ns  )
read Int vector/1000/parse               mean 15.09 μs  ( +- 160.0 ns  )
read Int vector/1000/parse'              mean 18.97 μs  ( +- 305.5 ns  )
read Int vector/10000/check              mean 120.4 μs  ( +- 448.0 ns  )
read Int vector/10000/parse              mean 200.6 μs  ( +- 2.353 μs  )
read Int vector/10000/parse'             mean 195.6 μs  ( +- 1.420 μs  )
read Int vector/100000/check             mean 1.233 ms  ( +- 15.61 μs  )
read Int vector/100000/parse             mean 2.724 ms  ( +- 61.22 μs  )
read Int vector/100000/parse'            mean 2.002 ms  ( +- 27.97 μs  )
  • check is linear as before. Note that this already implies that the non-linear part must be in heap allocation or in vector.
  • parse is the old code.
  • parse' is a new version that I rewrote without UV.fromList, with direct copying into the target buffer.

The direct-copying version is also close to perfectly linear.
Note that even if the workload "should" be linear, in general the heap-allocating benchmarks will be worse than perfectly linear, mainly because of cache effects.

@liamzee
Copy link

liamzee commented Sep 18, 2023

The fact that 4k takes 5 times the time of 1k implies that it's a n^1.057 + k algorithm.

I'm pleasantly surprised that in the best case, we can beat Rust by 5x in terms of parsing speed (especially important due to both languages having a type-driven paradigm, Haskell significantly more so; "parse, don't validate"), but am mildly disappointed that we're around n^1.06, whereas rust is n^(<1.01). Having similar asymptotics (presumably via Andras Kovac's rewrite) means that we are decisively better performing, can resist criticism better, and helps justify Haskell when everyone and their dog is going "rewrite it in Rust", with the dog often producing better code :).

@chrisdone
Copy link
Author

I do smell a bit of vector shenanigans. The UV.fromList uses a growing buffer with doubling and some usual weird fusion that I didn't try to understand in the Core. I reran the bench here: https://github.com/AndrasKovacs/flatparse/blob/readVectorInt-example/bench/ReadVectorInt.hs

read Int vector/1000/check               mean 11.68 μs  ( +- 122.8 ns  )
read Int vector/1000/parse               mean 15.09 μs  ( +- 160.0 ns  )
read Int vector/1000/parse'              mean 18.97 μs  ( +- 305.5 ns  )
read Int vector/10000/check              mean 120.4 μs  ( +- 448.0 ns  )
read Int vector/10000/parse              mean 200.6 μs  ( +- 2.353 μs  )
read Int vector/10000/parse'             mean 195.6 μs  ( +- 1.420 μs  )
read Int vector/100000/check             mean 1.233 ms  ( +- 15.61 μs  )
read Int vector/100000/parse             mean 2.724 ms  ( +- 61.22 μs  )
read Int vector/100000/parse'            mean 2.002 ms  ( +- 27.97 μs  )
* `check` is linear as before. Note that this already implies that the non-linear part must be in heap allocation or in `vector`.

* `parse` is the old code.

* `parse'` is a new version that I rewrote without `UV.fromList`, with direct copying into the target buffer.

The direct-copying version is also close to perfectly linear. Note that even if the workload "should" be linear, in general the heap-allocating benchmarks will be worse than perfectly linear, mainly because of cache effects.

Yeah, that's what I meant by the allocation characteristics. I've looked at this previously; the vector fusion causes the parser to be merged in with a loop that writes to a vector, but will allocate a large vector and then double it (or 1.5 it, don't remember) every time we reach near to the end of it. I don't have an environment setup to test it, but I wouldn't be surprised if this explained the numbers. Probably if we graphed N from 1 to 100000 we'd see occasional spikes rather than steady growth. The large jumps in our N is probably hiding that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment