This is a collection of random notes about FreeBSD's in kernel TLS implementation based on reading the source code (so far). The focus is TLS for NFS/RPC.
- supports TLSv1.0 to TLSv1.3
- development time frames:
- first FreeBSD commit (b2e60773) 27-Aug-2019, this might be just the "upstreaming" work
- development on going (March 2021)
- TCP only
- no rekeying so far (as of April 2021)
KERN_TLS
define wraps the functionalityshare/man/man4/ktls.4
man page outlines the basic concept:the initial handshake for a socket using TLS is performed in userland. Once the session keys are negotiated, they are provided to the kernel via the TCP_TXTLS_ENABLE and TCP_RXTLS_ENABLE socket options. Both socket options accept a struct tls_so_enable structure as their argument. The members of this structure describe the cipher suite used for the TLS session and provide the session keys used for the respective direction.
- the TX/RX split was done because of the Netflix data handling (https://youtu.be/la-ljVavd3c?t=943)
- mostly static content is sent to the client, the RX traffic is mostly TCP ACKs
- also:
A given socket may use different modes for transmit and receive, or a socket may only offload a single direction.
- modes meaning HW/SW.
- it also highlights the limitations:
only permits the session keys to be set once in each direction. As a result, applications must disable rekeying when using ktls.
- TODO: check session resumption
- the app can alter the behavior a bit:
Most data is transmitted in application layer TLS records, and the kernel chooses how to partition data among TLS records. Individual TLS records with a fixed length and record type can be sent by sendmsg(2) with the TLS record type set in a TLS_SET_RECORD_TYPE control message. The payload of this control message is a single byte holding the desired TLS record type. This can be used to send TLS records with a type other than application data (for example, handshake messages) or to send application data records with specific contents (for example, empty fragments).
- receive needs to be done using recvmsg(2)
Once TLS receive is enabled by a successful set of the TCP_RXTLS_ENABLE socket option, all data read from the socket is returned as decrypted TLS records. Each received TLS record must be read from the socket using recvmsg(2). Each received TLS record will contain a TLS_GET_RECORD control message along with the decrypted payload. The control message contains a struct tls_get_record which includes fields from the TLS record header. If an invalid or corrupted TLS record is received, recvmsg(2) will fail with one of the following errors...
- TODO: what happens if the app calls read(2) on the socket ?
- In case of SW mode the encryption/decryption is done using crypto(9):
The base system includes a software backend for the TCP_TLS_MODE_SW mode which uses crypto(9) to encrypt and decrypt TLS records. This backend can be enabled by loading the ktls_ocf.ko kernel module.
- the OCF glue for KTLS seems to be produced by Netflix (the respective source file has Netflix copyright)
- the TX/RX split was done because of the Netflix data handling (https://youtu.be/la-ljVavd3c?t=943)
- TLS encryption/decryption can be done in software (
TCP_TLS_MODE_SW
mode) or hardware (NICs with TLS HW support)- e.g. some Mellanox NICs have TLS (encryption/decryption/authentication) offload
sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c
(or rather the NIC supported by the driver) supports only TLSv1.2, TLSv1.3 and AES GCM NIST cipher suite- the whole file is ifdef'd
KERN_TLS
- the whole file is ifdef'd
- 🔧 TODO: how the encryption key is setup ? is it per connection ?
- the mbufs have TLS send tags that are passed to the NIC. The tag is probably associated with the key.
sys/opencrypto/ktls_ocf.c
is the OCF glue for SW crypto operations
- e.g. some Mellanox NICs have TLS (encryption/decryption/authentication) offload
- TX/RX split:
- can enable offload for send, receive or both
- send side can use either sendfile(2) or write(2)
- receive side has to always use recvmsg(2)
- i.e. not transparent to the application
sys/kern/uipc_ktls.c
: main KTLS file with functions for processing TLS recordsktls_work_thread()
- infinite cycle to process items in queue, calls ktls_encrypt() and then ktls_decrypt()ktls_encrypt()
/ktls_decrypt()
- encrypts/decrypts data, expects the TLS headers are presentktls_frame()
- adds TLS header
- worker threads:
- TODO: per CPU
- the work is partly asynchronous: https://youtu.be/JdgOQi6lX5M?t=1438
- due to the worker threads
- if the app exits before the records in the socket buffer are completed/decrypted, this will lead to freeing the pending mbufs in the socket buffer which could be a problem for the worker threads that still want to process these.
mbuf
structure modified to hold KTLS data. Notably, it holds the struct ktls_session *m_epg_tls
.
This is used e.g. in tcp_m_copym()
to check the mbuf chain traversal to:
Avoid mixing TLS records with handshake data or TLS records from different sessions
The KERN_TLS
code seems to be spread across lots of places in the networking code. There is no framework.
- 2 separate mbuf chains: one for the unprocessed data (incomplete TLS records), another for TLS records to be processed
- done in cset 3c0e5685, explanation:
Initially I tried to make this work by marking incoming mbufs as
M_NOTREADY, but there didn't seemed to be a non-gross way to deal with
picking a portion of the mbuf chain and turning it into a new record
in the socket buffer after decrypting the TLS record it contained
(along with prepending a control message). Also, such mbufs would
also need to be "pinned" in some way while they are being decrypted
such that a concurrent sbcut() wouldn't free them out from under the
thread performing decryption.
As such, I settled on the following solution:
- Socket buffers now contain an additional chain of mbufs (sb_mtls,
sb_mtlstail, and sb_tlscc) containing encrypted mbufs appended by
the protocol layer. These mbufs are still marked M_NOTREADY, but
soreceive*() generally don't know about them (except that they will
block waiting for data to be decrypted for a blocking read).
- Each time a new mbuf is appended to this TLS mbuf chain, the socket
buffer peeks at the TLS record header at the head of the chain to
determine the encrypted record's length. If enough data is queued
for the TLS record, the socket is placed on a per-CPU TLS workqueue
(reusing the existing KTLS workqueues and worker threads).
- The worker thread loops over the TLS mbuf chain decrypting records
until it runs out of data. Each record is detached from the TLS
mbuf chain while it is being decrypted to keep the mbufs "pinned".
However, a new sb_dtlscc field tracks the character count of the
detached record and sbcut()/sbdrop() is updated to account for the
detached record. After the record is decrypted, the worker thread
first checks to see if sbcut() dropped the record. If so, it is
freed (can happen when a socket is closed with pending data).
Otherwise, the header and trailer are stripped from the original
mbufs, a control message is created holding the decrypted TLS
header, and the decrypted TLS record is appended to the "normal"
socket buffer chain.
(Side note: the SBCHECK() infrastucture was very useful as I was
able to add assertions there about the TLS chain that caught several
bugs during development.)
ktls_decrypt()
peeks into thesb_tlscc
socket queue to see if there is TLS record to be processed- if yes, calls
ktls_detach_record()
to extract the mbuf - then decrypts
- appends control mbuf (with version information, length)
- done in cycle until there are complete records to be processed
- if yes, calls
sb_mark_notready()
explains how this works:
* To manage not-yet-decrypted data for KTLS RX, the following scheme
* is used:
*
* - A single chain of NOTREADY mbufs is hung off of sb_mtls.
*
* - ktls_check_rx checks this chain of mbufs reading the TLS header
* from the first mbuf. Once all of the data for that TLS record is
* queued, the socket is queued to a worker thread.
*
* - The worker thread calls ktls_decrypt to decrypt TLS records in
* the TLS chain. Each TLS record is detached from the TLS chain,
* decrypted, and inserted into the regular socket buffer chain as
* record starting with a control message holding the TLS header and
* a chain of mbufs holding the encrypted data.
The last sentence probably wanted to say "decrypted data".
ktls_check_rx()
is called e.g. from sbappend_ktls_rx()
that is called from sbappendstream_locked()
which is called from various places in TCP stack / drivers.
- KTLS has a notion of "crypto backends" (
struct ktls_crypto_backend
) so it does not have to go solely through OCF- OCF:
ktls_ocf
kmod - can use other kmods (not plugging into OCF)
- the
kcf_isa-l
(Intel ISA-L library) available as a port. That probably bypasses OCF.- used by Netflix in production
- the
- there is also the AES-NI kernel driver
- part of OCF:
sys/crypto/aesni/aesni.c
- part of OCF:
- plus SHA1/SHA256 Ryzen/Intel acceleration
- OCF:
Chacha20/Poly1305 (SHOULD
in TLSv1.3 spec) support for KTLS OCF: https://reviews.freebsd.org/D27841
OpenSSL 3.x has bunch of modifications to support KTLS. The code is mostly in ssl/ktls.c
and include/internal/ktls.h
.
Defines: OPENSSL_NO_KTLS
, OPENSSL_KTLS_TLS13
Note: FreeBSD main repository currently (February 2022) bundles OpenSSL 1.1.1 under crypto/openssl
. There are patches merged there from OpenSSL upstream, particularly to ktls.c
.
The KTLS in OpenSSL has 2 flavors - Linux and FreeBSD (ifdef
'd). They mostly look the same, however Linux signals that the socket is enabled with KTLS and also once the keys are determined.
Once the handshake arrives to the "master secret", it signals this together with the session keys to the kernel using the SOL_TLS
setsockopt
. This is done via the BIO_CTRL_SET_KTLS
BIO
control, that calls ktls_start()
. This is called via BIO_set_ktls()
(that passes pointer to the ktls_crypto_info_t
structure that was populated with the key info/material in ktls_configure_crypto()
shortly before) e.g. from at the end of tls13_change_cipher_state()
in case of TLSv1.3.
The read side is done by wrapping recvmsg()
via ktls_read_record()
that is called from sock_read()
or conn_read()
.
After the ktls_start()
call is done, all the TLS processing should be done in the kernel. This actually includes messages sent/received during the rest of the handshake. For TLSv1.3, the "ChangeCipherSpec" message (that is not actually needed for TLSv1.3 and is only included for compatiblity) is the last TLS message sent by the server that has the Content-type visible (the rest of the TLS messages goes disguised as Application Data Protocol). OpenSSL expects that the ktls_read_record()
returns a buffer with a TLS header however the rest is decrypted and the hidden content type byte removed. Specifically, the ssl3_get_record()
has the skip_decryption
label for this purpose.
KTLS is supported by both s_client
and s_server
- via SSL_sendfile()
. The SSL_sendfile()
only works if KTLS is enabled/present - ktls_sendfile()
calls the sendfile
syscall.
It seems that KTLS is enabled/used automatically if the cipher suite (
This changed in OpenSSL Git changeset a3a54179 and now KTLS has to be explicitly enabled using the ktls_check_supported_cipher()
) and other parameters (no padding, fragment size equal to maximum padding) allows it. Relevant part of the BIO layer use BIO_get_ktls_send()
/BIO_get_ktls_recv()
. These basically check the BIO flags set by BIO_set_ktls()
.SSL_OP_ENABLE_KTLS
SSL_CTX flag.
The exports(5)
man page defines the format of the /etc/exports
server side file that defines mount points to be used by NFS clients and their options. For NFSv4 it specifies 3 TLS related options:
tls
-
requires that the client use TLS
-
tlscert
-
requires that the client use TLS and provide a verifiable X.509 certificate during TLS handshake.
-
tlscertuser
-
requires that the client use TLS and provide a verifiable X.509 certificate. The otherName component of the certificate's subjAltName must have a an OID of 1.3.6.1.4.1.2238.1.1.1 and a UTF8 string of the form user@domain will be translated to the credentials of the specified user in the same manner as nfsuserd
-
Also, by default TLS is optional:
If none of these three flags are specified, TLS mounts are permitted but not required.
The exports(5) man page mentions that NFSv4 does not use the mount protocol.
The TLS options correspond to the mountd
flags: MNT_EXTLS
, MNT_EXTLSCERT
, MNT_EXTLSCERTUSER
mountd
parses the exports
file and pushes the configuration to the kernel using the nmount
syscall.
In the kernel the flags are converted to ND_EXTLS
, ND_EXTLSCERT
, ND_EXTLSCERTUSER
(nfs.h
). These flags are then used in the sys/fs/nfsserver/
code. The presence of the ND_TLS*
flags implies the use of the external mbufs (like for sendfile), e.g. in nfsrvd_readdir()
it sets the ND_EXTPG
flag in the nfsrv_descript
structure (describing each request to the NFS server)
A server export can have non-default certificate. This is done via the tlscertname
NFS mount option.
For client side it begins in the nfs_mount
system call that eventually calls mountnfs()
. The system call handling in nfs_mount()
checks various options (specified in the nfs_opts
array). The "tls", "tlscertname" are relevant for KTLS. These can be specified as mount
options, mount_nfs(8) man page documents them:
- tls
This option specifies that the connection to the server must use TLS
- tlscertname
This option specifies the name of an alternate certificate to be presented to the NFS server during TLS handshake. The default certificate file names are cert.pem and certkey.pem. When this option is specified, name replaces cert in the above file names. For example, if the value of name is specified as other the certificate file names to be used will be other.pem and otherkey.pem. These files are stored in /etc/rpc.tlsclntd by default. This option is only meaningful when used with the tls option and the rpc.tlsclntd 8 is running with the -m command line flag set.
In the course of processing the nfs_mount
system call, newnfs_connect()
is called and it will set the CLIENT
RPC specific options from the options inside struct nfsmount
, specifically the certificate name:
if (NFSHASTLS(nmp)) {
CLNT_CONTROL(client, CLSET_TLS, &one);
if (nmp->nm_tlscertname != NULL)
CLNT_CONTROL(client, CLSET_TLSCERTNAME,
nmp->nm_tlscertname);
}
These options are handled in clnt_reconnect_control()
.
There is also clnt_vc_control()
that handles various request. For the CLSET_TLS
request, it starts a kernel thread to handle upcalls.
Also, closing of the TLS session has special aspects - it should be done by the userland daemon. The RPCTLS_SYSC_CLSOCKET
syscall handling has:
/*
* Set ssl refno so that clnt_vc_destroy() will
* not close the socket and will leave that for
* the daemon to do.
*/
This needs to be done probably because of close_notify
etc.
rpc.tlsservd(8) man page:
program provides support for the server side of the kernel Sun RPC over TLS implementation. This daemon must be running to allow the kernel RPC to perform the TLS handshake after a TCP client has sent the STARTTLS Null RPC request to the server.
According to source code comment, it originated as a copy of the gssd
source code. Runs infinite select
loop, handling requests.
/*
* We provide an RPC service on a local-domain socket. The
* kernel rpctls code will upcall to this daemon to do the initial
* TLS handshake.
*/
rpc.tlsservd
daemon manages the TLS connections (list of ssl_entry
structures that contain the OpenSSL structures SSL
and X509
as members). The socket is retrieved from the kernel using the rpctls_syscall
first. Then the TLS connection is setup on this socket in rpctls_server()
(calls SSL_accept()
, verifies certificates, verifies that KTLS is working for the connection by calling BIO_get_ktls_send()
).
The rpctls_syscall
has multiple operation values.
It also provides certificate management tasks. Upon SIGHUP
reloads the CRL and terminates any extant connections using if corresponding certificate was revoked.
sys/rpc/rpcsec_tls/
contains kernel side code + XDR defitions for RPC used for the upcalls.
There is a number of upcalls defined in rpctls_impl.c
:
rpctls_connect()
performs "connect" upcall- called from
clnt_reconnect_connect()
- mainline RPC client code - performs AUTH_TLS NULL RPC first (using the
STARTTLS
data, see below) - the upcalls are serialized
- the upcall is sent to the
rpc.tlsclntd
daemon- the XDR names have the
rpctlscd
and there are source numerous code references - perhaps old/original name ?
- the XDR names have the
- called from
rpctls_cl_handlerecord
handle non application-data record in the client daemonrpctls_srv_handlerecord()
- ditto for server daemonrpctls_srv_disconnect()
- perform disconnect in the server daemonrpctls_server()
- get new server TLS socket- serialized like the connect
rpctls_server_s
is global variable. Perhaps there can only be one such request happening ?
- serialized like the connect
It seems there can be only one TLS connection setup (either client or server side) happening at given moment.
Life cycle for the server TLS socket:
- kernel sets the socket in global variable
- kernel performs upcall
tlsservd
gets the upcall, retrieves the socket using therpctls_syscall
tlsservd
performs TLS accept inrpctls_server()
, adds the SSL structure to the list if successful, returnsrpctls_server()
(same name as the function intlsservd
) in the kernel sets the global variables to NULL
The sys_rpctls_syscall()
provides file descriptor allocation for the userland daemons - RPCTLS_SYSC_CLSOCKET
, RPCTLS_SYSC_SRVSOCKET
. These call falloc()
to allocate new descriptor and associate it with the pre existing socket - the socket is created in the kernel first for the RPC.
The overall design is described in the commit message of https://github.com/freebsd/freebsd-src/commit/ab0c29af0512df1e40c30f1b361da7803594336e#diff-2bbd30a05d28da3c007b59a10096aa5bbedd88ae9b3afcad52c659a7b448b9f7
Some interesting tidbits:
The upcalls to the daemons use three fields to uniquely identify the TCP connection. They are the time.tv_sec, time.tv_usec of the connection establshment, plus a 64bit sequence number. The time fields avoid problems with re-use of the sequence number after a daemon restart. For the server side, once a Null RPC with AUTH_TLS is received, kernel reception on the socket is blocked and an upcall to the rpctlssd(8) daemon is done to perform the TLS handshake. Upon completion, the completion status of the handshake is stored in xp_tls as flag bits and the reply to the Null RPC is sent. For the client, if CLSET_TLS has been set, a new TCP connection will send the Null RPC with AUTH_TLS to initiate the handshake. The client kernel RPC code will then block kernel I/O on the socket and do an upcall to the rpctlscd(8) daemon to perform the handshake. If the upcall is successful, ct_rcvstate will be maintained to indicate if/when an upcall is being done.
When the socket is being shut down, upcalls are done to the daemons, so that they can perform SSL_shutdown() calls to perform the "peer reset".
Also, this change prohibits the unload of the krpc
kernel module because there might be
rpctls syscalls in progress.
Once the TLS support is discovered on the server side using NULL RPC call (specifically "the NULL RPC with authentication flavor of AUTH_TLS"), the same connection has to be reused as noted in the 4.1. Discovering Server-side TLS Support section of https://tools.ietf.org/html/draft-ietf-nfsv4-rpc-tls-11:
The RPC server signals its corresponding support for RPC-over-TLS by replying with a reply_stat of MSG_ACCEPTED and an AUTH_NONE verifier containing the "STARTTLS" token. The client SHOULD proceed with TLS session establishment, even if the Reply's accept_stat is not SUCCESS. If the AUTH_TLS probe was done via TCP, the RPC client MUST send the "ClientHello" message on the same connection. If the AUTH_TLS probe was done via UDP, the RPC client MUST send the "ClientHello" message to the same UDP destination port.
The client uses rpctls_impl.c#rpctls_connect()
that performs the NULL RPC call (this is special procedure that "takes null arguments and returns them").
_svcauth_rpcsec_tls()
is the server side:
- receives the NULL RPC request
- responds to the client with
AUTH_TLS
- disable reception on the "krpc"
- performs upcall to the
rpc.tlsservd
daemon - enable reception on the "krpc"
The STARTTLS
identifier is used in sys/rpc/rpcsec_tls/auth_tls.c
and defined as RPCTLS_START_STRING
.
usr.sbin/rpc.tlsclntd/
is the "Sun RPC over TLS Client Daemon"- rpc.tlsclntd(8)
- performs connection and related handling (certificates for mutual auth)
- manages list of
SSL
structures
usr.sbin/rpc.tlsservd/
is the "Sun RPC over TLS Server Daemon"- rpc.tlsservd(8)
- performs initial TLS handshake when triggered by the upcall from the rpctls code in the kernel
- certificate handling
- uses local domain socket
- manages list of
SSL
structures