Created
May 25, 2018 07:14
-
-
Save fabiand/81beda6550fae5731fd7e1a1ad48cc73 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## Another shot at the problem scope | |
Devices to support | |
- GPUs | |
- FPGAs | |
- Network cards | |
- Random Number Generators | |
This directly leads to different way of how these devices can be consumed: | |
- Path in file-system | |
- Device node | |
- PCI access | |
- Netdev | |
If we look further down the stack, then we see that multiple namespaces are | |
affected: | |
- mount | |
- network | |
- cgroup | |
- ipc | |
For non-container runtimes this has even other implications, as no Linux | |
namespaces need to be modified, but rather the devices need to be passed to | |
the pod in different ways, i.e.: | |
- Device passthrough | |
- Network file-system | |
- Virtual devices | |
A lot of stuff to consider when setting - providing - devices to pods. | |
Aside from that there is no mechanism which is providing any _details_ about | |
the devices to the workload. | |
Thus a workload itself has a hard time to understand | |
- _what_ resources are exposed and | |
- _how_ the resources are exposed | |
## Another look at todays situation | |
In a rough sketch (just a stencil one), the cain a device needs to pass in | |
order to be consumable by a pod looks like: | |
+----+ +---------+ +-----+ +-----+ | |
| DP | ---- | Kubelet | ---- | CRI | ---- | Pod | | |
+----+ +---------+ +-----+ +-----+ | |
However, considering that different CRI implementations have different isolation | |
mechanisms the picture rather looks like: | |
Node context : Runtime context | |
: | |
+----+ +---------+ +-----+ +-----+ | |
| DP | ---- | Kubelet | ---- | CRI | ---- | Pod | | |
+----+ +---------+ +-----+ +-----+ | |
: | |
: | |
The separator passing through the CRI, is isolating (as intended) the workload | |
from the node. | |
To highlight, the implications of this is that it can not be assumed that there | |
is a shared environment between the pod and device plugin. | |
## Status | |
## Path based devices | |
Today paths can be passed down the DP gRPC API in order to get it mounted into | |
the pod. This works fine for container runtimes, but i.e. hypervisor runtimes | |
like kata have issues. | |
For kata a device needs to be passed into a virtual machine, which requires a | |
different "device form" than a path. In order to get this other device form | |
which is suitable to be passed to a virtual machine, kata today looks up the | |
device behind the passed path, and does magic, to free the device and ultimately | |
pass it via a passthrough mechanism to the hypervisor [REF]. | |
Note: This is a rough descritpion. | |
## Non-path based devices and additional operations | |
Even container based runtimes have requirements when passing devices to | |
containers which are not met by the current DP gRPC API, i.e. when trying | |
to pass a network interface to a pod. | |
The issue is that network interfaces are not represented with paths, and | |
can thus not use the existing API to bring the device into a pod. | |
Currently workarounds are in use [REF] which allow the device plugin to | |
retrieve the pod uuid by one way or the other, in order to permit the device | |
plugin to operate on the pod context and they containers within. | |
I.e. the pod uuid can be used to move a network interface into a pod and perform | |
additional actions like setting up iptables rules. | |
## Yet another option | |
tl;dr Extend the DP gRPC API in order to pass multiple device types and | |
introduce an optional DP side car container concept to perform operations | |
within a pod. | |
### Declaring devices | |
Above it was discussed that today only paths can be passed via the DP gRPC API. | |
A simple solution to support hypervisor runtimes is to have the ability to | |
pass a type hint and device identifier via the API. | |
I.e. | |
- (path, "/some/path") | |
- (pci, "0011:2233") | |
As some logical devices expose more than one physical or virtual device, the | |
API should permit to pass a list of (type, device) tuples. | |
### Device Plugin Sidecar-Container | |
With declaring multiple devices including their type, we can assume that we | |
get the required device into a pod and container. | |
However, specifically for network devices - or certain network plugins - it is | |
required to perform additional pod level operation slike setting up routes | |
or iptables rules. | |
As outlined above, the current workaround is to directly access the pod, the | |
issue is, that it assumes a shared context between the node and runtime - but | |
this assumption is not always met. | |
In order to permit the manipulation of the pod context, we introduce a side car | |
concpet. | |
Whenever an operation is needed on the pod level, then a side car can be | |
introduced, which speaks to the DP via a gRPC mechanism, and this side car | |
can perform all the necessary operations. | |
The main assumption here is that the DP and side car can communicate over the | |
gRPC channel. | |
In a stencil sketch: | |
Node context : Runtime context | |
: | |
+----+ +---------+ +-----+ +-----------+ | |
| DP | ---- | Kubelet | ---- | CRI | ---- | Pod | | |
+----+ +---------+ +-----+ | [sidecar] | | |
| : +-----A-----+ | |
| : | | |
+------------------- gRPC -------------------+ | |
: |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment