Skip to content

Instantly share code, notes, and snippets.

@sebastienros
Last active August 29, 2021 14:06
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sebastienros/82f5dd4ef1560b793574f3c7bd8dc656 to your computer and use it in GitHub Desktop.
Save sebastienros/82f5dd4ef1560b793574f3c7bd8dc656 to your computer and use it in GitHub Desktop.
Pipeline Pipeline io_uring Non-pipelined Non-pipelined io_uring
CPU 99 50 (-50%) 97 48 (-50%)
RPS 2,592,670 2,878,222 (+11%) 497,429 631,976 (+26%)
Working set 79 81 79 81
Latency (mean) 1.28 0.98 1.07 1.47
Latency (99th) n/a 7.57 14.8 14.67
@sitsofe
Copy link

sitsofe commented Jun 9, 2020

Non-pipelined io-uring result looks unusual: average latency rose but worst case latency of the 99th percentile fell compared to non-io-uring? Throughput also increased but average latencies were up?

@tkp1n
Copy link

tkp1n commented Jun 9, 2020

Thank you so much, @sebastienros, for taking the time to run those benchmarks!
I had a look at your sebros/kernel branch, and it looks good to me.

I'm glad you were able to confirm my claims regarding throughput in the non-pipelined case. It is not a big surprise that the relative gain is smaller in the pipelined case (as predicted by @tmds here). My setup is apparently unable to give an accurate representation of the latency. I'll look into options to improve that.

The numbers do raise some questions, however:

  1. The decrease in CPU usage is significant. May I ask how the CPU usage of the load generators looked like during that time? I'm curious whether there was just headroom for more connections/load on the server or whether something is holding the server back from using its full potential. Remember that we've previously had a "usage error" leading to suboptimal CPU utilization.
  2. The increase in avg latency in the non-pipelined case is also significant. The transport is currently optimized to maximize the number of I/O requests per syscall (io_uring_enter), potentially leading to a latency increase. It could be that submitting to the kernel more often (earlier) would improve latency (at the cost of throughput). Let me know if you or someone else with access to Citrine is interested in exploring this.

Would you be able to re-run the benchmarks with traces enabled (and an eye on htop on the load generators) to maybe answer some of those questions and to spot potential low-hanging fruit regarding perf improvements?

Client scenarios:
IoUring.Transport also supports handling client (outbound) connections (via IConnectionFactory). This would allow microservice scenarios or reverse-proxies (such as YARP) to increase the number of I/O requests per syscall even further. For reverse-proxies, kernel v5.7 even added support for splice(2) via io_uring (although I'm not sure how we would best expose that via the existing connection abstractions). Are there any workloads in your set of benchmarks that include client connections via IConnectionFactory? It would be interesting to see how IoUring.Transport does for this kind of workload.

@tmds
Copy link

tmds commented Jun 9, 2020

These benchmarks were ran on a 12-core machine, because Citrine setup doesn't have the required kernel (yet).

The decrease in CPU usage is significant.

The Transport defaults to half the number of Processors: https://github.com/tkp1n/IoUring.Transport/blob/bed647373487aac25a58de34598e2bc9251c903b/src/IoUring.Transport/IoUringOptions.cs#L9.

And ApplicationSchedulingMode is set to Inline: https://github.com/tkp1n/IoUring.Transport/blob/13e571a5d6d0e63937da2e8a0e18a9a589648bb8/tests/PlatformBenchmarks/BenchmarkConfigurationHelpers.cs#L59

This means the code runs on half of the processors, so 50% is the expected CPU load.

If you increase the ThreadCount option, CPU usage will go up, but RPS will probably go down (cfr benchmarks ran in tmds/Tmds.LinuxAsync#39 (comment)).

@tkp1n
Copy link

tkp1n commented Jun 10, 2020

I assumed that the ThreadCount would be controlled due to this config in the Benchmarks repo. The results make more sense now, thanks 😅.
In fact, I set the default ThreadCount to half the logical threads (~ the number of physical cores) based on the findings in comment you've linked.

When comparing the results from tmds/Tmds.LinuxAsync#39 (comment) with the results above, we notice an increase in RPS from 518,186 -> 631,976 with the update to kernel v5.7 and the necessary code changes to leverage IORING_FEAT_FAST_POLL. Assuming, of course, the infrastructure hasn't changed since then. That would be as close to "free lunch" as it gets 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment