gsoc foo
| Subject: My GSoC contributions to WireGuard | |
| Hi, | |
| over the past few months, I was a GSoC[1] student in the WireGuard project | |
| (nominally the Linux Foundation). Together with Thomas Gschwantner and Gauvain | |
| Roussel-Tarbouriech (who were also GSoC students), and of course Jason | |
| Donenfeld, I tried to implement some of the performance optimizations from the | |
| TODO list[2]. | |
| TL;DR: Here are the commits that I contributed during that time, which were | |
| also merged (not all of them were): | |
| - https://git.zx2c4.com/WireGuard/commit/?h=b289d122d8fe935de7138f77a8495dca76555b0e | |
| - https://git.zx2c4.com/WireGuard/commit/?h=590c41089fd77bf43efda375c0e6bd3913cb9350 | |
| - https://git.zx2c4.com/WireGuard/commit/?h=0f8718b4fcbb89abc8b8b25c09510e1f7fac9625 | |
| - https://git.zx2c4.com/WireGuard/commit/?h=18822b84f88422f11611c598a0c23c56652693d9 | |
| - https://git.zx2c4.com/WireGuard/commit/?h=6008eacbf2c7a5f31b0c9d5d0a629cbdfbb8f222 | |
| Our first task was to change the trie (prefix tree) used in allowedips.c[3] | |
| to look up peers based on their IP(v6) address, to use native endianness | |
| (little endian on the x86 systems prevalent today) instead of big endian, | |
| which is how the bytes in a IP(v6) address are stored in a packet. This | |
| saves a few byte swaps per lookup in situations with many peers, because the | |
| number is reduced from one byte swap per edge of trie that was walked to one | |
| byte swap per complete traversal through the trie. | |
| We implemented and discussed a few versions of this optimization, and | |
| finally Jason committed his[4], and a selftest to help ensure that the new | |
| implementation is correct[5]. I noticed that the debugging code that can | |
| print the allowedips trie in Graphviz[6] format was now wrong, and fixed[7] | |
| it. | |
| The next goal was to replace the spinlock-based ring buffer implementation[8] | |
| used in WireGuard to implement the receive/transmit and encrypt/decrypt | |
| queues with a lock-free version. For this, we looked at ConcurrencyKit[9], | |
| and reimplemented their algorithms. However, after facing no clear | |
| performance improvement, and in some cases, serious performance degradation, | |
| we gave up on this goal. The code, along with a selftest[10] can be found on | |
| the jn/mpmc-wip branch[11]. | |
| Next, I made use[12] of the NAPI[13] infrastructure of the Linux networking | |
| subsystem, to improve the performance of the receive path, with some | |
| success[14], but also a performance regression on some systems[15], which | |
| Jason appears to have solved by using GRO[16] before passing packets up | |
| through the network stack[17]. | |
| Since my integration of NAPI relied one NAPI context per WireGuard peer, and | |
| NAPI contexts are stored in a fixed-size hashtable in the kernel[18], Thomas | |
| disabled NAPI busy polling, thus disabling the use of this hashtable[19]. | |
| This avoids overloading the hashtable when there are many WireGuard peers. | |
| My last task was to change WireGuard's hashtables[20] from a fixed-size | |
| hashtable to the resizable hashtable implementation (called "rhashtable") | |
| that's already in the upstream kernel[21]. This will save some memory when | |
| few items are stored, and avoid overloading, and thus slowing down, the | |
| hashtables when many items are stored. The main difficulty here stems from | |
| the fact that WireGuard's pubkey_hashtable uses the SipHash function to be | |
| better protected against hash collision, or "hashtable poisoning" | |
| attacks[22], in which the hashed data is chosen in such a way that many | |
| items are stored in the same hash bucket, slowing down lookups. SipHash uses | |
| a 128-bit random seed to make the hashes less predictable, while the | |
| existing rhashtable API only allows 32 bits. Thus, to properly integrate | |
| SipHash into rhashtable, I need to extend the rhashtable API to use a longer | |
| random seed. A work-in-progress version of this change can be found in the | |
| linux-dev repo on git.zx2c4.com[22]. | |
| Conclusion: | |
| I have not reached all the goals that I initially set out to reach. In | |
| particular, here are some of the items that were part of my GSoC project | |
| proposal: | |
| - Reducing bufferbloat with CoDel/DQL. Since WireGuard temporarily stores | |
| packets in queues, and queues can add latency when they're full, it would be | |
| interesting to try to limit the added latency with CoDel[23] or Linux's | |
| dynamic queue limits library (DQL)[24]. | |
| - CPU auto-scaling and NUMA awareness. Currently, WireGuard splits the | |
| encryption/decryption workload among all CPUs[25] currently online. For low | |
| packet rates it might be a good idea to stick to one CPU, to avoid polluting | |
| the other CPUs' caches, and possibly to avoid the other CPUs from sleep | |
| states. In larger systems with two or more set of CPUs, each with their own | |
| L3 cache and DDR memory controller, i.e. in NUMA[26] systems, it might also | |
| be a good idea to confine WireGuard to one NUMA node, to save bus traffic. | |
| - Performance profiling and benchmarking. I included performance profiling and | |
| benchmarking in my project proposal, because serious performance optimization | |
| has to rely on benchmarks to find out which approach is *really* the fastest, | |
| and on profiling to figure out where the bottlenecks are. I did, however, do | |
| very little in this direction, beyond running iperf3. | |
| Even though I did not do everything I intended to do, it was nice to bring | |
| WireGuard a bit further and learn something about optimization and the Linux | |
| networking stack. Thanks to Jason Donenfeld, the Linux Foundation, and Google | |
| for providing this opportunity. | |
| Jonathan Neuschäfer | |
| [1]: https://summerofcode.withgoogle.com/ | |
| [2]: https://www.wireguard.com/todo/ | |
| [3]: https://git.zx2c4.com/WireGuard/tree/src/allowedips.c?h=0.0.20180420 | |
| [4]: https://git.zx2c4.com/WireGuard/commit/?h=5e3532eb1895ccdd0a53d14b47d7fc85234f7866 | |
| [5]: https://git.zx2c4.com/WireGuard/commit/?h=856f10593d2fc8b87992d9c5c556ddf108d313e8 | |
| [6]: http://graphviz.org/ | |
| [7]: https://git.zx2c4.com/WireGuard/commit/?h=b289d122d8fe935de7138f77a8495dca76555b0e | |
| [8]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/ptr_ring.h?h=v4.18-rc8 | |
| [9]: https://github.com/concurrencykit/ck/blob/0.6.0/include/ck_ring.h | |
| [10]: https://git.zx2c4.com/WireGuard/tree/src/selftest/mpmc_ring.h?h=jn/mpmc-wip | |
| [11]: https://git.zx2c4.com/WireGuard/log/?h=jn/mpmc-wip | |
| [12]: https://git.zx2c4.com/WireGuard/commit/?h=6008eacbf2c7a5f31b0c9d5d0a629cbdfbb8f222 | |
| [13]: https://en.wikipedia.org/wiki/New_API | |
| [14]: https://lists.zx2c4.com/pipermail/wireguard/2018-July/003115.html | |
| [15]: https://lists.zx2c4.com/pipermail/wireguard/2018-July/003124.html | |
| [16]: https://lwn.net/Articles/358910/ | |
| [17]: https://git.zx2c4.com/WireGuard/commit/?h=95951af7249912a4356b9a03cf3addc7e3f8f724 | |
| [18]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/core/dev.c?h=v4.18-rc8#n193 | |
| [19]: https://git.zx2c4.com/WireGuard/commit/?h=fe5f0f661797b41648eac64b40e5038b25175047 | |
| [20]: https://git.zx2c4.com/WireGuard/tree/src/hashtables.h?h=0.0.20180802 | |
| [21]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/rhashtable.h?h=v4.18-rc8 | |
| [22]: https://git.zx2c4.com/linux-dev/log/?h=jn/rhashtable-next&id=6f450b7f8f722dec835fe96ff5b99be98cd5622e | |
| [23]: https://en.wikipedia.org/wiki/CoDel | |
| [24]: https://lwn.net/Articles/454390/ | |
| [25]: https://git.zx2c4.com/WireGuard/tree/src/queueing.h?h=0.0.20180802#n119 | |
| [26]: https://en.wikipedia.org/wiki/Non-uniform_memory_access |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment