vladak/freebsd-ktls.md

## freebsd-ktls.md

      
    Raw
  

              freebsd-ktls.md
            
          
    This is a collection of random notes about FreeBSD's in kernel TLS implementation based on reading the source code (so far). The focus is TLS for NFS/RPC.
FreeBSD in kernel TLS


supports TLSv1.0 to TLSv1.3
development time frames:

first FreeBSD commit (b2e60773) 27-Aug-2019, this might be just the "upstreaming" work
development on going (March 2021)


TCP only
no rekeying so far (as of April 2021)
KERN_TLS define wraps the functionality
share/man/man4/ktls.4 man page outlines the basic concept:

the initial handshake for a socket using TLS is performed in userland.
Once the session keys are negotiated, they are provided to the kernel via the TCP_TXTLS_ENABLE
and TCP_RXTLS_ENABLE socket options. Both socket options accept a struct tls_so_enable structure as their argument.
The members of this structure describe the cipher suite used for the TLS session and provide the session keys
used for the respective direction.


the TX/RX split was done because of the Netflix data handling (https://youtu.be/la-ljVavd3c?t=943)

mostly static content is sent to the client, the RX traffic is mostly TCP ACKs


also:

A given socket may use different modes for transmit and receive, or a socket may only offload a single direction.


modes meaning HW/SW.


it also highlights the limitations:

only permits the session keys to be set once in each direction. As a result, applications must disable rekeying when using ktls.


TODO: check session resumption


the app can alter the behavior a bit:

Most data is transmitted in application layer TLS records, and the kernel chooses how to partition data among TLS records.
Individual TLS records with a fixed length and record type can be sent by sendmsg(2) with the TLS record type set in a TLS_SET_RECORD_TYPE control message.
The payload of this control message is a single byte holding the desired  TLS record type.
This can be used to send TLS records with a type other than application data (for example, handshake messages) or to send application data records with specific contents (for example, empty fragments).


receive needs to be done using recvmsg(2)

Once TLS receive is enabled by a successful set of the TCP_RXTLS_ENABLE socket option,
all data read from the socket is returned as decrypted TLS records.
Each received TLS record must be read from the socket using recvmsg(2).
Each received TLS record will contain a TLS_GET_RECORD control message along with the decrypted payload.
The control message contains a struct tls_get_record which includes fields from the TLS record header.
If an invalid or corrupted TLS record is received, recvmsg(2) will fail with one of the following errors...


TODO: what happens if the app calls read(2) on the socket ?


In case of SW mode the encryption/decryption is done using crypto(9):

The base system includes a software backend for the TCP_TLS_MODE_SW mode which uses
crypto(9) to encrypt and decrypt TLS records.
This backend can be enabled by loading the ktls_ocf.ko kernel module.


the OCF glue for KTLS seems to be produced by Netflix (the respective source file has Netflix copyright)


TLS encryption/decryption can be done in software (TCP_TLS_MODE_SW mode) or hardware (NICs with TLS HW support)

e.g. some Mellanox NICs have TLS (encryption/decryption/authentication) offload

sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c (or rather the NIC supported by the driver) supports only TLSv1.2, TLSv1.3 and AES GCM NIST cipher suite

the whole file is ifdef'd KERN_TLS


🔧 TODO: how the encryption key is setup ? is it per connection ?

the mbufs have TLS send tags that are passed to the NIC. The tag is probably associated with the key.


sys/opencrypto/ktls_ocf.c is the OCF glue for SW crypto operations


TX/RX split:

can enable offload for send, receive or both
send side can use either sendfile(2) or write(2)
receive side has to always use recvmsg(2)

i.e. not transparent to the application


sys/kern/uipc_ktls.c: main KTLS file with functions for processing TLS records

ktls_work_thread() - infinite cycle to process items in queue, calls ktls_encrypt() and then ktls_decrypt()
ktls_encrypt() / ktls_decrypt() - encrypts/decrypts data, expects the TLS headers are present
ktls_frame() - adds TLS header


worker threads:

TODO: per CPU


the work is partly asynchronous: https://youtu.be/JdgOQi6lX5M?t=1438

due to the worker threads
if the app exits before the records in the socket buffer are completed/decrypted, this will lead to freeing the pending mbufs in the socket buffer which could be a problem for the worker threads that still want to process these.


Network

mbuf structure modified to hold KTLS data. Notably, it holds the struct ktls_session *m_epg_tls.
This is used e.g. in tcp_m_copym() to check the mbuf chain traversal to:

Avoid mixing TLS records with handshake  data or TLS records from different sessions

The KERN_TLS code seems to be spread across lots of places in the networking code. There is no framework.
RX work:


2 separate mbuf chains: one for the unprocessed data (incomplete TLS records), another for TLS records to be processed

done in cset 3c0e5685, explanation:


Initially I tried to make this work by marking incoming mbufs as
M_NOTREADY, but there didn't seemed to be a non-gross way to deal with
picking a portion of the mbuf chain and turning it into a new record
in the socket buffer after decrypting the TLS record it contained
(along with prepending a control message). Also, such mbufs would
also need to be "pinned" in some way while they are being decrypted
such that a concurrent sbcut() wouldn't free them out from under the
thread performing decryption.

As such, I settled on the following solution:

- Socket buffers now contain an additional chain of mbufs (sb_mtls,
sb_mtlstail, and sb_tlscc) containing encrypted mbufs appended by
the protocol layer. These mbufs are still marked M_NOTREADY, but
soreceive*() generally don't know about them (except that they will
block waiting for data to be decrypted for a blocking read).

- Each time a new mbuf is appended to this TLS mbuf chain, the socket
buffer peeks at the TLS record header at the head of the chain to
determine the encrypted record's length. If enough data is queued
for the TLS record, the socket is placed on a per-CPU TLS workqueue
(reusing the existing KTLS workqueues and worker threads).

- The worker thread loops over the TLS mbuf chain decrypting records
until it runs out of data. Each record is detached from the TLS
mbuf chain while it is being decrypted to keep the mbufs "pinned".
However, a new sb_dtlscc field tracks the character count of the
detached record and sbcut()/sbdrop() is updated to account for the
detached record. After the record is decrypted, the worker thread
first checks to see if sbcut() dropped the record. If so, it is
freed (can happen when a socket is closed with pending data).
Otherwise, the header and trailer are stripped from the original
mbufs, a control message is created holding the decrypted TLS
header, and the decrypted TLS record is appended to the "normal"
socket buffer chain.

(Side note: the SBCHECK() infrastucture was very useful as I was
able to add assertions there about the TLS chain that caught several
bugs during development.)


ktls_decrypt() peeks into the sb_tlscc socket queue to see if there is TLS record to be processed

if yes, calls ktls_detach_record() to extract the mbuf
then decrypts
appends control mbuf (with version information, length)
done in cycle until there are complete records to be processed


sb_mark_notready() explains how this works:
* To manage not-yet-decrypted data for KTLS RX, the following scheme
 * is used:
 *
 * - A single chain of NOTREADY mbufs is hung off of sb_mtls.
 *
 * - ktls_check_rx checks this chain of mbufs reading the TLS header
 *   from the first mbuf.  Once all of the data for that TLS record is
 *   queued, the socket is queued to a worker thread.
 *
 * - The worker thread calls ktls_decrypt to decrypt TLS records in
 *   the TLS chain.  Each TLS record is detached from the TLS chain,
 *   decrypted, and inserted into the regular socket buffer chain as
 *   record starting with a control message holding the TLS header and
 *   a chain of mbufs holding the encrypted data.

The last sentence probably wanted to say "decrypted data".
ktls_check_rx() is called e.g. from sbappend_ktls_rx() that is called from sbappendstream_locked() which is called from various places in TCP stack / drivers.
Crypto


KTLS has a notion of "crypto backends" (struct ktls_crypto_backend) so it does not have to go solely through OCF

OCF: ktls_ocf kmod
can use other kmods (not plugging into OCF)

the kcf_isa-l (Intel ISA-L library) available as a port. That probably bypasses OCF.

used by Netflix in production


there is also the AES-NI kernel driver

part of OCF: sys/crypto/aesni/aesni.c


plus SHA1/SHA256 Ryzen/Intel acceleration


Chacha20/Poly1305 (SHOULD in TLSv1.3 spec) support for KTLS OCF: https://reviews.freebsd.org/D27841
OpenSSL modifications

OpenSSL 3.x has bunch of modifications to support KTLS. The code is mostly in ssl/ktls.c and include/internal/ktls.h.
Defines: OPENSSL_NO_KTLS, OPENSSL_KTLS_TLS13
Note: FreeBSD main repository currently (February 2022) bundles OpenSSL 1.1.1 under crypto/openssl. There are patches merged there from OpenSSL upstream, particularly to ktls.c.
The KTLS in OpenSSL has 2 flavors - Linux and FreeBSD (ifdef'd). They mostly look the same, however Linux signals that the socket is enabled with KTLS and also once the keys are determined.
Once the handshake arrives to the "master secret", it signals this together with the session keys to the kernel using the SOL_TLS setsockopt. This is done via the BIO_CTRL_SET_KTLS BIO control, that calls ktls_start(). This is called via BIO_set_ktls() (that passes pointer to the ktls_crypto_info_t structure that was populated with the key info/material in ktls_configure_crypto() shortly before) e.g. from at the end of tls13_change_cipher_state() in case of TLSv1.3.
The read side is done by wrapping recvmsg() via ktls_read_record() that is called from sock_read() or conn_read().
After the ktls_start() call is done, all the TLS processing should be done in the kernel. This actually includes messages sent/received during the rest of the handshake. For TLSv1.3, the "ChangeCipherSpec" message (that is not actually needed for TLSv1.3 and is only included for compatiblity) is the last TLS message sent by the server that has the Content-type visible (the rest of the TLS messages goes disguised as Application Data Protocol). OpenSSL expects that the ktls_read_record() returns a buffer with a TLS header however the rest is decrypted and the hidden content type byte removed. Specifically, the ssl3_get_record() has the skip_decryption label for this purpose.
KTLS is supported by both s_client and s_server - via SSL_sendfile(). The SSL_sendfile() only works if KTLS is enabled/present - ktls_sendfile() calls the sendfile syscall.
It seems that KTLS is enabled/used automatically if the cipher suite (ktls_check_supported_cipher()) and other parameters (no padding, fragment size equal to maximum padding) allows it. Relevant part of the BIO layer use BIO_get_ktls_send()/BIO_get_ktls_recv(). These basically check the BIO flags set by BIO_set_ktls().
This changed in OpenSSL Git changeset a3a54179 and now KTLS has to be explicitly enabled using the SSL_OP_ENABLE_KTLS SSL_CTX flag.
NFS

server / mountd

The exports(5) man page defines the format of the /etc/exports server side file that defines mount points to be used by NFS clients and their options. For NFSv4 it specifies 3 TLS related options:

tls


requires that the client use TLS


tlscert


requires that the client use TLS and provide a verifiable X.509 certificate during TLS handshake.


tlscertuser


requires that the client use TLS and provide a verifiable X.509 certificate. The otherName component of the certificate's subjAltName must have a an OID of 1.3.6.1.4.1.2238.1.1.1 and a UTF8 string of the form user@domain will be translated to the credentials of the specified user in the same manner as nfsuserd


Also, by default TLS is optional:

If none of these three flags are specified, TLS mounts are permitted but not required.

The exports(5) man page mentions that NFSv4 does not use the mount protocol.
The TLS options correspond to the mountd flags: MNT_EXTLS, MNT_EXTLSCERT, MNT_EXTLSCERTUSER
mountd parses the exports file and pushes the configuration to the kernel using the nmount syscall.
In the kernel the flags are converted to ND_EXTLS, ND_EXTLSCERT, ND_EXTLSCERTUSER (nfs.h). These flags are then used in the sys/fs/nfsserver/ code. The presence of the ND_TLS* flags implies the use of the external mbufs (like for sendfile), e.g. in nfsrvd_readdir() it sets the ND_EXTPG flag in the nfsrv_descript structure (describing each request to the NFS server)
A server export can have non-default certificate. This is done via the tlscertname NFS mount option.
NFS client

For client side it begins in the nfs_mount system call that eventually calls mountnfs(). The system call handling in nfs_mount() checks various options (specified in the nfs_opts array). The "tls", "tlscertname" are relevant for KTLS. These can be specified as mount options, mount_nfs(8) man page documents them:

tls

This option specifies that the connection to the server must use TLS


tlscertname

This option specifies the name of an alternate certificate to be presented to the NFS server during TLS handshake.
The default certificate file names are cert.pem and certkey.pem.
When this option is specified, name replaces cert in the above file names.
For example, if the value of name is specified as other the certificate file names to be used will be
other.pem and otherkey.pem. These files are stored in /etc/rpc.tlsclntd by default.
This option is only meaningful when used with the tls option and the rpc.tlsclntd 8
is running with the -m command line flag set.


In the course of processing the nfs_mount system call, newnfs_connect() is called and it will set the CLIENT RPC specific options from the options inside struct nfsmount, specifically the certificate name:
		if (NFSHASTLS(nmp)) {
			CLNT_CONTROL(client, CLSET_TLS, &one);
			if (nmp->nm_tlscertname != NULL)
				CLNT_CONTROL(client, CLSET_TLSCERTNAME,
				    nmp->nm_tlscertname);
		}
These options are handled in clnt_reconnect_control().
There is also clnt_vc_control() that handles various request. For the CLSET_TLS request, it starts a kernel thread to handle upcalls.
Also, closing of the TLS session has special aspects - it should be done by the userland daemon. The RPCTLS_SYSC_CLSOCKET syscall handling has:
				/*
				 * Set ssl refno so that clnt_vc_destroy() will
				 * not close the socket and will leave that for
				 * the daemon to do.
				 */
This needs to be done probably because of close_notify etc.
RPC

rpc.tlsservd

rpc.tlsservd(8) man page:

program provides support for the server side of the kernel Sun RPC over TLS implementation.
This daemon must be running to allow the kernel RPC to perform the TLS
handshake after a TCP client has sent the STARTTLS Null RPC request to the server.

According to source code comment, it originated as a copy of the gssd source code. Runs infinite select loop, handling requests.
        /*
	 * We provide an RPC service on a local-domain socket. The
	 * kernel rpctls code will upcall to this daemon to do the initial
	 * TLS handshake.
	 */
rpc.tlsservd daemon manages the TLS connections (list of ssl_entry structures that contain the OpenSSL structures SSL and X509 as members). The socket is retrieved from the kernel using the rpctls_syscall first. Then the TLS connection is setup on this socket in rpctls_server() (calls SSL_accept(), verifies certificates, verifies that KTLS is working for the connection by calling BIO_get_ktls_send()).
The rpctls_syscall has multiple operation values.
It also provides certificate management tasks. Upon SIGHUP reloads the CRL and terminates any extant connections using if corresponding certificate was revoked.
kernel

sys/rpc/rpcsec_tls/ contains kernel side code + XDR defitions for RPC used for the upcalls.
There is a number of upcalls defined in rpctls_impl.c:

rpctls_connect() performs "connect" upcall

called from clnt_reconnect_connect() - mainline RPC client code
performs AUTH_TLS NULL RPC first (using the STARTTLS data, see below)
the upcalls are serialized
the upcall is sent to the rpc.tlsclntd daemon

the XDR names have the rpctlscd and there are source numerous code references - perhaps old/original name ?


rpctls_cl_handlerecord handle non application-data record in the client daemon
rpctls_srv_handlerecord() - ditto for server daemon
rpctls_srv_disconnect() - perform disconnect in the server daemon
rpctls_server() - get new server TLS socket

serialized like the connect

rpctls_server_s is global variable. Perhaps there can only be one such request happening ?


It seems there can be only one TLS connection setup (either client or server side) happening at given moment.
Life cycle for the server TLS socket:

kernel sets the socket in global variable
kernel performs upcall
tlsservd gets the upcall, retrieves the socket using the rpctls_syscall
tlsservd performs TLS accept in rpctls_server(), adds the SSL structure to the list if successful, returns
rpctls_server() (same name as the function in tlsservd) in the kernel sets the global variables to NULL

The sys_rpctls_syscall() provides file descriptor allocation for the userland daemons - RPCTLS_SYSC_CLSOCKET, RPCTLS_SYSC_SRVSOCKET. These call falloc() to allocate new descriptor and associate it with the pre existing socket - the socket is created in the kernel first for the RPC.
The overall design is described in the commit message of https://github.com/freebsd/freebsd-src/commit/ab0c29af0512df1e40c30f1b361da7803594336e#diff-2bbd30a05d28da3c007b59a10096aa5bbedd88ae9b3afcad52c659a7b448b9f7
Some interesting tidbits:

The upcalls to the daemons use three fields to uniquely identify the
TCP connection. They are the time.tv_sec, time.tv_usec of the connection
establshment, plus a 64bit sequence number. The time fields avoid problems
with re-use of the sequence number after a daemon restart.
For the server side, once a Null RPC with AUTH_TLS is received, kernel
reception on the socket is blocked and an upcall to the rpctlssd(8) daemon
is done to perform the TLS handshake.  Upon completion, the completion
status of the handshake is stored in xp_tls as flag bits and the reply to
the Null RPC is sent.
For the client, if CLSET_TLS has been set, a new TCP connection will
send the Null RPC with AUTH_TLS to initiate the handshake.  The client
kernel RPC code will then block kernel I/O on the socket and do an upcall
to the rpctlscd(8) daemon to perform the handshake.
If the upcall is successful, ct_rcvstate will be maintained to indicate
if/when an upcall is being done.


When the socket is being shut down, upcalls are done to the daemons, so
that they can perform SSL_shutdown() calls to perform the "peer reset".

Also, this change prohibits the unload of the krpc kernel module because there might be
rpctls syscalls in progress.
AUTH_TLS

Once the TLS support is discovered on the server side using NULL RPC call (specifically "the NULL RPC with authentication flavor of AUTH_TLS"), the same connection has to be reused as noted in the 4.1.  Discovering Server-side TLS Support section of https://tools.ietf.org/html/draft-ietf-nfsv4-rpc-tls-11:

The RPC server signals its corresponding support for RPC-over-TLS by
replying with a reply_stat of MSG_ACCEPTED and an AUTH_NONE verifier
containing the "STARTTLS" token.  The client SHOULD proceed with TLS
session establishment, even if the Reply's accept_stat is not
SUCCESS.  If the AUTH_TLS probe was done via TCP, the RPC client MUST
send the "ClientHello" message on the same connection.  If the
AUTH_TLS probe was done via UDP, the RPC client MUST send the
"ClientHello" message to the same UDP destination port.

The client uses rpctls_impl.c#rpctls_connect() that performs the NULL RPC call (this is special procedure that "takes null arguments and returns them").
_svcauth_rpcsec_tls() is the server side:

receives the NULL RPC request
responds to the client with AUTH_TLS
disable reception on the "krpc"
performs upcall to the rpc.tlsservd daemon
enable reception on the "krpc"

The STARTTLS identifier is used in sys/rpc/rpcsec_tls/auth_tls.c and defined as RPCTLS_START_STRING.
Userland daemons


usr.sbin/rpc.tlsclntd/ is the "Sun RPC over TLS Client Daemon"

rpc.tlsclntd(8)
performs connection and related handling (certificates for mutual auth)
manages list of SSL structures


usr.sbin/rpc.tlsservd/ is the "Sun RPC over TLS Server Daemon"

rpc.tlsservd(8)
performs initial TLS handshake when triggered by the upcall from the rpctls code in the kernel
certificate handling
uses local domain socket
manages list of SSL structures