Skip to content

Instantly share code, notes, and snippets.

@fabiand
Created May 25, 2018 07:14
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save fabiand/81beda6550fae5731fd7e1a1ad48cc73 to your computer and use it in GitHub Desktop.
Save fabiand/81beda6550fae5731fd7e1a1ad48cc73 to your computer and use it in GitHub Desktop.
## Another shot at the problem scope
Devices to support
- GPUs
- FPGAs
- Network cards
- Random Number Generators
This directly leads to different way of how these devices can be consumed:
- Path in file-system
- Device node
- PCI access
- Netdev
If we look further down the stack, then we see that multiple namespaces are
affected:
- mount
- network
- cgroup
- ipc
For non-container runtimes this has even other implications, as no Linux
namespaces need to be modified, but rather the devices need to be passed to
the pod in different ways, i.e.:
- Device passthrough
- Network file-system
- Virtual devices
A lot of stuff to consider when setting - providing - devices to pods.
Aside from that there is no mechanism which is providing any _details_ about
the devices to the workload.
Thus a workload itself has a hard time to understand
- _what_ resources are exposed and
- _how_ the resources are exposed
## Another look at todays situation
In a rough sketch (just a stencil one), the cain a device needs to pass in
order to be consumable by a pod looks like:
+----+ +---------+ +-----+ +-----+
| DP | ---- | Kubelet | ---- | CRI | ---- | Pod |
+----+ +---------+ +-----+ +-----+
However, considering that different CRI implementations have different isolation
mechanisms the picture rather looks like:
Node context : Runtime context
:
+----+ +---------+ +-----+ +-----+
| DP | ---- | Kubelet | ---- | CRI | ---- | Pod |
+----+ +---------+ +-----+ +-----+
:
:
The separator passing through the CRI, is isolating (as intended) the workload
from the node.
To highlight, the implications of this is that it can not be assumed that there
is a shared environment between the pod and device plugin.
## Status
## Path based devices
Today paths can be passed down the DP gRPC API in order to get it mounted into
the pod. This works fine for container runtimes, but i.e. hypervisor runtimes
like kata have issues.
For kata a device needs to be passed into a virtual machine, which requires a
different "device form" than a path. In order to get this other device form
which is suitable to be passed to a virtual machine, kata today looks up the
device behind the passed path, and does magic, to free the device and ultimately
pass it via a passthrough mechanism to the hypervisor [REF].
Note: This is a rough descritpion.
## Non-path based devices and additional operations
Even container based runtimes have requirements when passing devices to
containers which are not met by the current DP gRPC API, i.e. when trying
to pass a network interface to a pod.
The issue is that network interfaces are not represented with paths, and
can thus not use the existing API to bring the device into a pod.
Currently workarounds are in use [REF] which allow the device plugin to
retrieve the pod uuid by one way or the other, in order to permit the device
plugin to operate on the pod context and they containers within.
I.e. the pod uuid can be used to move a network interface into a pod and perform
additional actions like setting up iptables rules.
## Yet another option
tl;dr Extend the DP gRPC API in order to pass multiple device types and
introduce an optional DP side car container concept to perform operations
within a pod.
### Declaring devices
Above it was discussed that today only paths can be passed via the DP gRPC API.
A simple solution to support hypervisor runtimes is to have the ability to
pass a type hint and device identifier via the API.
I.e.
- (path, "/some/path")
- (pci, "0011:2233")
As some logical devices expose more than one physical or virtual device, the
API should permit to pass a list of (type, device) tuples.
### Device Plugin Sidecar-Container
With declaring multiple devices including their type, we can assume that we
get the required device into a pod and container.
However, specifically for network devices - or certain network plugins - it is
required to perform additional pod level operation slike setting up routes
or iptables rules.
As outlined above, the current workaround is to directly access the pod, the
issue is, that it assumes a shared context between the node and runtime - but
this assumption is not always met.
In order to permit the manipulation of the pod context, we introduce a side car
concpet.
Whenever an operation is needed on the pod level, then a side car can be
introduced, which speaks to the DP via a gRPC mechanism, and this side car
can perform all the necessary operations.
The main assumption here is that the DP and side car can communicate over the
gRPC channel.
In a stencil sketch:
Node context : Runtime context
:
+----+ +---------+ +-----+ +-----------+
| DP | ---- | Kubelet | ---- | CRI | ---- | Pod |
+----+ +---------+ +-----+ | [sidecar] |
| : +-----A-----+
| : |
+------------------- gRPC -------------------+
:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment