notme43/gentoo-podman.md

## gentoo-podman.md

      
    Raw
  

              gentoo-podman.md
            
          
1 Introduction

1.1 Purpose
1.2 Prerequisites


2 Host Setup

2.1 NVIDIA Drivers
2.2 NVIDIA Container Toolkit
2.3 MACVLAN Networking
2.4 Podman


3 Podman Configuration

3.1 NVIDIA container runtime
3.2 Container Networking


4 Running Containers

4.1 General OpenRC init script recipe
4.2 Running GPU-accelerated containers


5 Miscellaneous

5.1 Additional Podman pain-points

5.1.1 Kernel modules and capabilities
5.1.2 Assigned IP addresses with dns-cni-plugin
5.1.3 Automatic container updates
5.1.4 Leftover build cache
5.1.5 ZFS on root
5.1.6 NVIDIA CDI Spec


6 Revisions

1 Introduction

1.1 Purpose

To show the steps taken to setup Podman with NVIDIA container runtime and MACVLAN networking on a headless Gentoo installation. The use case is for running containerized Plex/Jellyfin or NVENC/NVDEC. This may not be sufficient for other CUDA applications.
A lot of these topics have little documentation, so this is mostly to document these for my own benefit but others might find it useful. These instructions may not follow all security guidelines or best practice, so use at your own risk.
Why Gentoo?
Gentoo allows us to fine-tune our software so only the features and dependencies we want are pulled in, reducing the size of our install. This is perfect for container hosts, because all we really need is something to run containers on.
Installing Gentoo is out of this document's scope, and I suggest following the official Gentoo handbook.
1.2 Prerequisites


Working AMD64 Gentoo installation using glibc and OpenRC
NVIDIA graphics card (GeForce 900 or later)

2 Host Setup

2.1 NVIDIA Drivers

For the NVIDIA container toolkit to function, we must install the proprietary NVIDIA drivers.
Let's go ahead and make sure the open source nouveau drivers aren't loaded into the kernel at boot. NVIDIA already blacklists the nouveau driver at install of their drivers, so this may not be strictly necessary. This expands on that to include all modules provided by nouveau.
/etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
blacklist rivafb
blacklist nvidiafb
blacklist rivatv
blacklist nv
blacklist uvcvideo

If you haven't already, specify the video card variable in make.conf. It might be a good idea to run emerge -avuDU --with-bdeps=y @world after changing this.
/etc/portage/make.conf
VIDEO_CARDS="nvidia"

Add a line for NVIDIA drivers to add in persistenced and prevent X11 dependencies from pulling in. NVIDIA persistenced handles the activation of kernel modules and GPU initialization so we don't have to manually.
/etc/portage/package.use/nvidia
x11-drivers/nvidia-drivers -tools -X persistenced

To accept the NVIDIA proprietary licenses, add these lines to your portage configuration:
/etc/portage/package.license
x11-drivers/nvidia-drivers NVIDIA-r2
dev-util/nvidia-cuda-toolkit NVIDIA-CUDA

We should now be ready to install the NVIDIA drivers. Be sure to inspect all the dependencies and USE flags before installing. You may notice LibX11 among the dependencies, which is fine, but stop if you see anything like GTK and friends.
# emerge -av nvidia-drivers

Enable persistenced.
 # rc-update add nvidia-persistenced default

Assuming you are using a dist-kernel, regenerate initramfs. If you aren't using a dist-kernel, you probably already know how to do this 😉
# emerge --config gentoo-kernel

Reboot to make sure all the changes have taken effect, take a look at dmesg to ensure there weren't any errors.
2.2 NVIDIA Container Toolkit

The next two pieces of software needed from NVIDIA are libnvidia-container and nvidia-container-toolkit. These hook into Podman to allow GPU access and can be obtained from the Gentoo GURU repository.
First, let's add the GURU repo. Easiest way is to enable it with eselect-repository. We'll also need Git to sync these repositories.
# emerge -av eselect-repository dev-vcs/git

Enable the GURU repo:
# eselect repository enable guru

Refresh Portage:
# emaint -a all

Everything in the GURU repository is masked with ~amd64, because this is a user-maintained repository (sorta kinda like the AUR) and we don't want to upgrade our existing packages or install from this repo by default. So we have to explicity allow the two packages we need:
/etc/portage/package.accept_keywords/nvidia
sys-libs/libnvidia-container ~amd64
app-containers/nvidia-container-toolkit ~amd64

Install the software:
# emerge -av sys-libs/libnvidia-container app-containers/nvidia-container-toolkit

There is additional setup required to integrate this with Podman that will be covered later.
2.3 MACVLAN Networking

MACVLAN networking with containers allows us to assign routable IP addresses to our containers, exposing them to the network and eliminating the need for port forwarding on the host. Docker made this easy by even going as far as creating the parent VLAN network interface on the host for you. Podman has the same functionality but does not create the parent for us, so we have to create it ourselves. Fortunately this is easy on Gentoo.
For the purposes of this section, we will assume that you have configured a VLAN on your local network and your installation is configured with an existing ethernet interface.
On the router side, setup your VLAN on a different subnet and disable DHCP so we can manually assign IP's and they won't clash with any existing leases.
Add the following lines to your Gentoo network configuration, substituting enp9s0 and the VLAN ID with your own.
/etc/conf.d/net
# existing config for ethernet iface
config_enp9s0="10.9.27.8/24"
routes_enp9s0="default via 10.9.27.1"
dns_servers_enp9s0="1.1.1.1 1.0.0.1"

# lines to add for example VLAN ID 50
vlans_enp9s0="50"
config_enp9s0_50="10.19.27.1 netmask 255.255.254.0"
routes_enp9s0_50="default via 10.19.27.1"

After a reboot, we should now have an additional network interface called enp9s0.50 that we can attach a Podman network to.
I can't remember if this is necessary, but I believe it is for this particular network configuration. Add the following conf file to your sysctl.conf.d directory:
/etc/sysctl.conf.d/forwarding.conf
net.ipv4.ip_forward = 1

2.4 Podman

Finally, lets install Podman. There are a few additional packages we will be installing as well:

app-containers/netavark - New network stack for podman
app-containers/aardvark-dns - DNS server for containers written in Rust
app-misc/jq - parses jQuery so we can pipe podman output to configuration files

Netavark and Aardvark is the upstream default for podman, but is currently masked in the main repo. Initially, I could not get it working in Gentoo so as an alternative I was using app-containers/dnsname-cni-plugin from GURU. This is still a working option, but as a caveat I discovered that container-to-container name resolution was spotty.
/etc/portage/package.accept_keywords/podman
app-containers/netavark ~amd64
app-containers/aardvark-dns ~amd64

Install the packages:
# emerge -av app-containers/netavark app-containers/aardvark-dns app-containers/podman app-misc/jq

Create the following config file:
/etc/containers/containers.conf
[network]
# default value, or when using dnsname-cni-plugin = "cni"
network_backend = "netavark"
# this will bind port 53 on the host as well
dns_bind_port = 53

Note that if you specify --dns in the container run command, it will override /etc/resolv.conf and container-to-container name resolution will not work.
3 Podman Configuration

3.1 NVIDIA container runtime

The task to setup GPU containers with Podman is not well documented, but fortunately NVIDIA's recent changes have made this a little easier.
First, let's make sure our runtime is configured properly. The runtime runs as a hook at container starup, and mounts all the binary utilities like nvidia-smi to /opt/bin inside the container. Review /etc/nvidia-container-runtime/config.toml:
Disclaimer: This is my personal config, I am providing it as a "works for me" example. I have indicated where I've uncommented or changed a value from the default. I don't actually remember why I've changed some of these, but I don't dare touch it at this point.
/etc/nvidia-container-runtime/config.toml
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"

[nvidia-container-cli]
#root = "/run/nvidia/driver"
#### path is commented out in default config
path = "/usr/bin/nvidia-container-cli"
environment = []
#### debug is optional, commented out in default config.
debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
#### no-cgroups is commented out in default, must be set to true to use rootless containers
no-cgroups = false
user = "root:video"
ldconfig = "@/sbin/ldconfig"
#alpha-merge-visible-devices-envvars = false

[nvidia-container-runtime]
debug = "~/.local/nvidia-container-runtime.log"

After we've reviewed the runtime config, let's generate the CDI specification so Podman can talk to our GPU. NVIDIA has a tool to generate this for us. Make sure nvidia-persistenced is running before doing this.
# nvidia-ctk cdi generate --output /etc/cdi/nvidia.yaml

You should see some warnings about missing certain OpenGL libraries, which is normal considering we did not install it.
3.2 Container Networking

Bridge Network
Podman does not have built-in container-to-container name resolution like docker does. If you recall in the installation stage, we installed a couple of extra packages to enable container-to-container DNS resolution and a jQuery parser.
You can modify the default podman network to use the DNS resolver, but I prefer just creating a new one.
First, create the podman network:
# podman network create my-net

Podman stores network configurations in the CNI cache. If that is ever unavailable, it will look in /etc/containers/networks. Let's use our jQuery parser so we can capture our config and put it there.
# podman network inspect my-net | jq .[] > ~/my-net.json

Move the file to /etc/containers/networks - piping the output directly results in an error from podman and an empty file.
When running a container, make sure you specify which network you'd like it to be attached to. If none is specified, it will be the default podman network.
MACVLAN Network
We setup MACVLAN on the host earlier, so this should be just as trivial as the bridge network. Keep in mind that docker containers on the bridge network are not addressable by their container names on the MACVLAN network. The containers on the MACVLAN network will inherit the host's /etc/resolv.conf.
Create a new podman network, replacing the parent interface and IP addresses with your own. The below is using our previous example from the installation stage.
# podman network create -d macvlan -o parent=enp9s0.50 --subnet 10.19.27.0/24 --gateway 10.19.27.1 macvlan50

Make it persistent:
# podman network inspect macvlan50 | jq .[] > ~/macvlan50.json

Copy the config to /etc/containers/networks.
4 Running Containers

4.1 General OpenRC init script recipe

Many folks that work with containers would recommend you use docker-compose to manage starting, stopping and general management of the container's lifecycle. They are correct, as this gives you the most control over all of the container's parameters and fine control of updates.
I however do not use docker-compose for the init scripts, and use the run command instead. I realize I can start and stop container pods through init, but I do not need that fine of control over my containers and I would rather just run watchtower and deal with any potential consequences.
Personally, I have been running auto-updates on my containers for 7 years and have only been bit by it a couple of times. Both times were pretty inconsequential with these types of media services. I am not hosting mission-critical services after all. Your mileage may vary of course, and obviously this is not recommended for production use.
You can also run podman generate-systemd and it will generate an ID number in the resulting service file that runs the container with the same parameter it was started with.
The following is a general recipe I use for running containers with OpenRC. Supervise-daemon works nicely with podman and will restart the container if it fails. This is an example of cloudflared running a DNS tunnel:
/etc/init.d/cloudflare-dns
#!/sbin/openrc-run
name="Cloudflare DoH container"
supervisor="supervise-daemon"
command="/usr/bin/podman"
command_args="run \
        --rm \
        --name cloudflare-dns \
        --network=macvlan50 \
        --cap-add=cap_net_bind_service \
        --ip=10.19.27.14 \
        cloudflare/cloudflared \
        proxy-dns \
        --address 10.19.27.14"
depend() {
	need net
}

It's important we use --rm here so the container image is removed every time, otherwise it wouldn't be able to start again.
4.2 Running GPU-accelerated containers

The arguments for running GPU-accelerated containers are a bit different with Podman than with Docker. The following flags will need added to the podman run command:
--device /dev/nvidiactl
--device /dev/nvidia-modeset
--device /dev/nvidia0
--device=nvidia.com/gpu=all

You can change --device=nvidia.com/gpu=all to point to a certain GPU. They are defined in the CDI configuration we generated earlier in /etc/cdi/nvidia.yaml.
The magic word that activates the container runtime is an environment variable set in the container. A lot of times the GPU-aware containers you find in the wild already have this set. If not, add a flag for it in the podman run command.
-e NVIDIA_VISIBLE_DEVICES=all
-e NVIDIA_DRIVER_CAPABILITIES=video,utility,compute

It works with any container based on glibc, but NVIDIA provides their own with the full CUDA SDK baked-in on their Docker Hub and the NGC repo.
5 Miscellaneous

5.1 Additional Podman pain-points

5.1.1 Kernel modules and capabilities

The biggest difference for me as a user between Podman and Docker is Docker tends to give you a more "batteries included" experience. Some pain points for me were capabilities and kernel modules.
When running binhex/delugevpn under Podman for instance, I had to give it the additional capability of CAP_NET_RAW which it did not need under Docker. I also had to pass --device /dev/net/tun when running it, which again, Docker did not require. This behavior is probably a good thing on Podman's part (principle of least privilege), but was difficult to diagnose.
Docker also loaded relevant kernel modules for me while Podman does not. The same Deluge container needs the br_netfilter module, and Docker would silently load it in the background. Another tricky issue to diagnose, but probably a good thing like before.
Add it to /etc/modules-load.d:
# echo br_netfilter > /etc/modules-load.d/br_netfilter

5.1.2 Assigned IP addresses with dns-cni-plugin

Podman stores a list of its assigned IP addresses in /var/lib/cni/networks. When the containers are stopped, the entry for its IP address is supposed to be deleted. If for some reason the container is killed or not stopped properly, that entry may not be deleted. This is a problem for the macvlan network, where we are assigning the same IP every time. Just delete the entries if this prevents a container from starting.
In the same vein, I noticed the container short-names would stop resolving if the target container had restarted since the source had been running.
This behavior is fixed with Netavark.
5.1.3 Automatic container updates

Podman has built-in automatic updates, which is great, but it depends on systemd. They want you to start your container or pod, then run podman generate-systemd which makes you this nifty little service file for you to run your container. You set a label on the containers you want to auto-update, then a systemd timer checks every so often for updates. If it finds any, it restarts the service file it generated earlier.
Just use watchtower and let supervise-daemon restart the containers. The podman service does need to be running for this.
# podman run \
        --rm \
        -v /var/run/podman/podman.sock:/var/run/docker.sock \
        -v /etc/localtime:/etc/localtime:ro \
        --name watchtower \
        containrrr/watchtower \
        --no-restart

There is a restart limit however with supervise-daemon (I think it's 5?) where it will eventually mark the service as failed.
5.1.4 Leftover build cache

There was a few instances where there were leftover images from building a Dockerfile that podman could not see and persisted even after running podman system prune -f -a --volumes. This prevented me from running podman system reset.
I had to install buildah and run buildah rm --all. Annoying.
5.1.5 ZFS on root

I did a fresh install with ZFS on root. OverlayFS is not compatible with ZFS and interacting with podman on even a basic level was throwing all sorts of interesting errors. I highly recommend changing the filesystem backend before doing anything at all with podman. There isn't a USE flag for zfs on podman in Gentoo, it just works out of the box I guess.
I could not run podman system reset -f without errors after changing this, which the documentation suggests you do.
Create the following config:
/etc/containers/storage.conf
[storage]
driver = "zfs"
runroot = "/run/containers/storage"
graphroot = "/var/lib/containers/storage"

[storage.options.zfs]
fsname = "your-pool/subvolume"

5.1.6 NVIDIA CDI Spec

There have been a few instances where the generated CDI specification is referring to a device that doesn't currently exist, and it prevents the container from starting. Nvidia-persistenced is supposed to prevent this by loading all the drivers needed at the time we generate that file. For some reason there's always one or two /dev/nvidia- files that existed when we generated the CDI spec, but not at runtime. So far they haven't been of any consequence to me, but inspect the error messages to ensure nothing important is omitted for your use-case.
To allow the container to start without the devices, regenerate the CDI specification.
6 Revisions

08/26/2023 - 1.2 - Added information: Netavark, CDI, ZFS
08/02/2023 - 1.1 - Formatting, added additional info about kernel modules
08/02/2023 - 1.0 - Initial