fabiand/gist:81beda6550fae5731fd7e1a1ad48cc73

## gistfile1.txt

## Another shot at the problem scope

Devices to support

- GPUs
- FPGAs
- Network cards
- Random Number Generators

This directly leads to different way of how these devices can be consumed:

- Path in file-system
- Device node
- PCI access
- Netdev

If we look further down the stack, then we see that multiple namespaces are
affected:

- mount
- network
- cgroup
- ipc

For non-container runtimes this has even other implications, as no Linux
namespaces need to be modified, but rather the devices need to be passed to
the pod in different ways, i.e.:

- Device passthrough
- Network file-system
- Virtual devices

A lot of stuff to consider when setting - providing - devices to pods.

Aside from that there is no mechanism which is providing any _details_ about
the devices to the workload.
Thus a workload itself has a hard time to understand

- _what_ resources are exposed and
- _how_ the resources are exposed

## Another look at todays situation

In a rough sketch (just a stencil one), the cain a device needs to pass in
order to be consumable by a pod looks like:

    +----+      +---------+      +-----+      +-----+
    | DP | ---- | Kubelet | ---- | CRI | ---- | Pod |
    +----+      +---------+      +-----+      +-----+

However, considering that different CRI implementations have different isolation
mechanisms the picture rather looks like:

                  Node context      :      Runtime context
                                    :
    +----+      +---------+      +-----+      +-----+
    | DP | ---- | Kubelet | ---- | CRI | ---- | Pod |
    +----+      +---------+      +-----+      +-----+
                                    :
                                    :

The separator passing through the CRI, is isolating (as intended) the workload
from the node.
To highlight, the implications of this is that it can not be assumed that there
is a shared environment between the pod and device plugin.


## Status

## Path based devices

Today paths can be passed down the DP gRPC API in order to get it mounted into
the pod. This works fine for container runtimes, but i.e. hypervisor runtimes
like kata have issues.

For kata a device needs to be passed into a virtual machine, which requires a
different "device form" than a path. In order to get this other device form
which is suitable to be passed to a virtual machine, kata today looks up the
device behind the passed path, and does magic, to free the device and ultimately
pass it via a passthrough mechanism to the hypervisor [REF].

Note: This is a rough descritpion.

## Non-path based devices and additional operations

Even container based runtimes have requirements when passing devices to
containers which are not met by the current DP gRPC API, i.e. when trying
to pass a network interface to a pod.

The issue is that network interfaces are not represented with paths, and
can thus not use the existing API to bring the device into a pod.

Currently workarounds are in use [REF] which allow the device plugin to
retrieve the pod uuid by one way or the other, in order to permit the device
plugin to operate on the pod context and they containers within.

I.e. the pod uuid can be used to move a network interface into a pod and perform
additional actions like setting up iptables rules.

## Yet another option

tl;dr Extend the DP gRPC API in order to pass multiple device types and
introduce an optional DP side car container concept to perform operations
within a pod.

### Declaring devices

Above it was discussed that today only paths can be passed via the DP gRPC API.
A simple solution to support hypervisor runtimes is to have the ability to
pass a type hint and device identifier via the API.

I.e.
- (path, "/some/path")
- (pci, "0011:2233")

As some logical devices expose more than one physical or virtual device, the
API should permit to pass a list of (type, device) tuples.

### Device Plugin Sidecar-Container

With declaring multiple devices including their type, we can assume that we
get the required device into a pod and container.
However, specifically for network devices - or certain network plugins - it is
required to perform additional pod level operation slike setting up routes
or iptables rules.

As outlined above, the current workaround is to directly access the pod, the
issue is, that it assumes a shared context between the node and runtime - but
this assumption is not always met.

In order to permit the manipulation of the pod context, we introduce a side car
concpet.
Whenever an operation is needed on the pod level, then a side car can be
introduced, which speaks to the DP via a gRPC mechanism, and this side car
can perform all the necessary operations.

The main assumption here is that the DP and side car can communicate over the
gRPC channel.

In a stencil sketch:

Node context      :      Runtime context
                  :
+----+      +---------+      +-----+      +-----------+
| DP | ---- | Kubelet | ---- | CRI | ---- | Pod       |
+----+      +---------+      +-----+      | [sidecar] |
   |              :                       +-----A-----+
   |              :                             |
   +------------------- gRPC -------------------+
                  :

	## Another shot at the problem scope

	Devices to support

	- GPUs
	- FPGAs
	- Network cards
	- Random Number Generators

	This directly leads to different way of how these devices can be consumed:

	- Path in file-system
	- Device node
	- PCI access
	- Netdev

	If we look further down the stack, then we see that multiple namespaces are
	affected:

	- mount
	- network
	- cgroup
	- ipc

	For non-container runtimes this has even other implications, as no Linux
	namespaces need to be modified, but rather the devices need to be passed to
	the pod in different ways, i.e.:

	- Device passthrough
	- Network file-system
	- Virtual devices

	A lot of stuff to consider when setting - providing - devices to pods.

	Aside from that there is no mechanism which is providing any _details_ about
	the devices to the workload.
	Thus a workload itself has a hard time to understand

	- _what_ resources are exposed and
	- _how_ the resources are exposed

	## Another look at todays situation

	In a rough sketch (just a stencil one), the cain a device needs to pass in
	order to be consumable by a pod looks like:

	+----+ +---------+ +-----+ +-----+
	\| DP \| ---- \| Kubelet \| ---- \| CRI \| ---- \| Pod \|
	+----+ +---------+ +-----+ +-----+

	However, considering that different CRI implementations have different isolation
	mechanisms the picture rather looks like:

	Node context : Runtime context
	:
	+----+ +---------+ +-----+ +-----+
	\| DP \| ---- \| Kubelet \| ---- \| CRI \| ---- \| Pod \|
	+----+ +---------+ +-----+ +-----+
	:
	:

	The separator passing through the CRI, is isolating (as intended) the workload
	from the node.
	To highlight, the implications of this is that it can not be assumed that there
	is a shared environment between the pod and device plugin.


	## Status

	## Path based devices

	Today paths can be passed down the DP gRPC API in order to get it mounted into
	the pod. This works fine for container runtimes, but i.e. hypervisor runtimes
	like kata have issues.

	For kata a device needs to be passed into a virtual machine, which requires a
	different "device form" than a path. In order to get this other device form
	which is suitable to be passed to a virtual machine, kata today looks up the
	device behind the passed path, and does magic, to free the device and ultimately
	pass it via a passthrough mechanism to the hypervisor [REF].

	Note: This is a rough descritpion.

	## Non-path based devices and additional operations

	Even container based runtimes have requirements when passing devices to
	containers which are not met by the current DP gRPC API, i.e. when trying
	to pass a network interface to a pod.

	The issue is that network interfaces are not represented with paths, and
	can thus not use the existing API to bring the device into a pod.

	Currently workarounds are in use [REF] which allow the device plugin to
	retrieve the pod uuid by one way or the other, in order to permit the device
	plugin to operate on the pod context and they containers within.

	I.e. the pod uuid can be used to move a network interface into a pod and perform
	additional actions like setting up iptables rules.

	## Yet another option

	tl;dr Extend the DP gRPC API in order to pass multiple device types and
	introduce an optional DP side car container concept to perform operations
	within a pod.

	### Declaring devices

	Above it was discussed that today only paths can be passed via the DP gRPC API.
	A simple solution to support hypervisor runtimes is to have the ability to
	pass a type hint and device identifier via the API.

	I.e.
	- (path, "/some/path")
	- (pci, "0011:2233")

	As some logical devices expose more than one physical or virtual device, the
	API should permit to pass a list of (type, device) tuples.

	### Device Plugin Sidecar-Container

	With declaring multiple devices including their type, we can assume that we
	get the required device into a pod and container.
	However, specifically for network devices - or certain network plugins - it is
	required to perform additional pod level operation slike setting up routes
	or iptables rules.

	As outlined above, the current workaround is to directly access the pod, the
	issue is, that it assumes a shared context between the node and runtime - but
	this assumption is not always met.

	In order to permit the manipulation of the pod context, we introduce a side car
	concpet.
	Whenever an operation is needed on the pod level, then a side car can be
	introduced, which speaks to the DP via a gRPC mechanism, and this side car
	can perform all the necessary operations.

	The main assumption here is that the DP and side car can communicate over the
	gRPC channel.

	In a stencil sketch:

	Node context : Runtime context
	:
	+----+ +---------+ +-----+ +-----------+
	\| DP \| ---- \| Kubelet \| ---- \| CRI \| ---- \| Pod \|
	+----+ +---------+ +-----+ \| [sidecar] \|
	\| : +-----A-----+
	\| : \|
	+------------------- gRPC -------------------+
	: