WireGuard GSoC 2018 Report
Hi there, my name is Thomas Gschwantner and this is my report for the work I've done on WireGuard, for the GSoC 2018.
Before GSoC even started proper, Jason asked us to work on a small fix regarding
the endianness of the trie used to store IPs in
they'd be stored in network byte order (big endian), meaning on little endian
system there would be unnecessary conversions. The main point of this was
probably to get everyone up and running in terms of setting up and working with
the WireGuard codebase.
The next task was developing and testing a replacement for the ring buffer implementation in the kernel (ptr_ring.h) that is lock-free. For that we consulted a userspace implementation called Concurrency Kit. Since the task was rather hard but would result in relatively little code, Jason suggested Jonathan Neuschäfer and I work on it separately at the start, merging the code later, which we did. My own version, before any merging happened, can be found here.
The task proved challenging, because the concurrency kit implementation made heavy use of macros and uses different semantics for atomic operations and memory barriers than the linux kernel. Another problem we ran into was kernel deadlocks, that only happened in specific situations while testing. We ultimately determined that the cause for them was unfortunate scheduling of the kthreads by the kernel, that caused concurrent producers to be stuck waiting for each other for a long time. This was ultimately solved by disabling preemption.
While the performance of the final implementation was better when directly compared to the other one in regards to raw produce/consume performance, when benchmarking WireGuard as a hole it performed the same or slightly worse. The reason for this is likely the way WireGuard processes packets, which consumes items from the queue one-by-one. As a result, the final version has not been merged yet. This may change in the future however, if the consumers can be made to run multithreaded.
Next we worked on making WireGuard take advantage of NAPI, a kernel internal API designed to reduce the overhead of receiving packets. I worked on this for some time, but Jonathan ended up being faster than me with his solution. My (naturally unfinished) version can be found here. This tasks involved a lot of reading kernel code to figure out how NAPI works since all available documentation is either outdated or simply nonexistent. I also ended up doing a lot of benchmarking, since many solutions we came up with had unfavourable performance characteristics.
While working on this we also bumped into an interesting problem with the NAPI, that involved napi_hash, a hashtable used by busy polling. Because we used one napi_struct per WireGuard peer, there was concern that this hashtable would blow up in larger deployments. I ended up solving this problem, after much research, in a deceivingly simple commit.
Lastly, I worked on implementing a new socket option,
would cause all sk_buffs to be securely zeroed out when freed. While I got a
simple version of this working very quickly, I could only make it work on
AF_INET, but not on
AF_ALG and other netlink sockets, which I
assumed would be the primary usage for this option. The reason for that was, as
I found out after much debugging and digging in crypto/, that the corresponding
kernel code wasn't even using sk_buffs. Instead, the code would lock the socket
directly, and then use memcpy_to_msg to copy the data to userspace.
The code for this can be found here, however it is not quite
finished yet as I haven't been able to find a more general way of setting the
zero_on_free bit on sk_buffs for all socket types yet.
Working on WireGuard definitely was a ton of fun and I also learned a lot about kernel development. Thanks a lot to Jason A. Donenfeld for providing mentorship, and his work on WireGuard in general, that made all of this possible.