When running containers in parallel with kata runtime, we are facing many failed to create shim task
errors.
Ubuntu 20.04 VM, with 16 CPU and 64 GB of memory. Scaleway GP1-M instance (https://www.scaleway.com/en/virtual-instances/general-purpose/).
# uname -a
Linux kata-test 5.4.0-122-generic #138-Ubuntu SMP Wed Jun 22 15:00:31 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
# cat /etc/*-release | grep DISTRIB
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.4 LTS"
# cat /proc/cpuinfo | grep processor | wc -l
16
# cat /proc/meminfo | grep MemTotal
MemTotal: 65857924 kB
Prerequisites: containerd, runc, cni, nerdctl, kata.
CONTAINERD_VERSION=1.6.8
RUNC_VERSION=1.1.4
CNI_VERSION=1.1.1
NERDCTL_VERSION=0.23.0
KATA_VERSION=2.5.1
# containerd
wget https://github.com/containerd/containerd/releases/download/v$CONTAINERD_VERSION/containerd-$CONTAINERD_VERSION-linux-amd64.tar.gz
tar Cxzvf /usr/local containerd-$CONTAINERD_VERSION-linux-amd64.tar.gz
wget -P /usr/local/lib/systemd/system/ https://raw.githubusercontent.com/containerd/containerd/v$CONTAINERD_VERSION/containerd.service
systemctl daemon-reload
systemctl enable --now containerd
# runc
wget https://github.com/opencontainers/runc/releases/download/v$RUNC_VERSION/runc.amd64
chmod u+x runc.amd64
mv runc.amd64 /usr/local/sbin/runc
# cni
wget https://github.com/containernetworking/plugins/releases/download/v$CNI_VERSION/cni-plugins-linux-amd64-v$CNI_VERSION.tgz
mkdir -p /opt/cni/bin
tar Cxzvf /opt/cni/bin cni-plugins-linux-amd64-v$CNI_VERSION.tgz
# nerdctl
wget https://github.com/containerd/nerdctl/releases/download/v$NERDCTL_VERSION/nerdctl-$NERDCTL_VERSION-linux-amd64.tar.gz
tar Cxzvvf /usr/local/bin nerdctl-$NERDCTL_VERSION-linux-amd64.tar.gz
# kata
wget https://github.com/kata-containers/kata-containers/releases/download/$KATA_VERSION/kata-static-$KATA_VERSION-x86_64.tar.xz
xzcat kata-static-$KATA_VERSION-x86_64.tar.xz | sudo tar -xvf - -C /
rm -f /usr/local/bin/kata-runtime
rm -f /usr/local/bin/containerd-shim-kata-v2
ln -s /opt/kata/bin/kata-runtime /usr/local/bin
ln -s /opt/kata/bin/containerd-shim-kata-v2 /usr/local/bin
mkdir -p /etc/containerd
cat <<EOF > /etc/containerd/config.toml
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata]
runtime_type = "io.containerd.kata.v2"
EOF
systemctl restart containerd
If we run containers:
- without kata runtime (one after the other or in parallel), all containers are running correctly
- with kata runtime one after the other, all containers are running correctly
- with kata runtime in parallel, we are facing issues (see below details)
nerdctl pull ubuntu
for i in `seq 5`
do
nerdctl run --rm --runtime io.containerd.run.kata.v2 ubuntu uname -r &
done
Sometimes it work, but most of time we have errors.
Output:
5.15.63
5.15.63
FATA[0007] failed to create shim task: open /sys/fs/cgroup/systemd/vc/tasks: no such file or directory: not found
FATA[0007] failed to create shim task: open /sys/fs/cgroup/systemd/vc/tasks: no such file or directory: not found
FATA[0007] failed to create shim task: open /sys/fs/cgroup/systemd/vc/tasks: no such file or directory: not found
nerdctl pull ubuntu
for i in `seq 20`
do
nerdctl run --rm --runtime io.containerd.run.kata.v2 ubuntu uname -r &
done
We always have many errors (95% of errors, with a new no such file or directory
error).
Extract of output:
5.15.63
FATA[0030] failed to create shim task: open /sys/fs/cgroup/systemd/vc/tasks: no such file or directory: not found
FATA[0031] failed to create shim task: open /sys/fs/cgroup/cpuset/vc/cpuset.mems: no such file or directory: not found
During that test, all CPUs are reaching 100% of usage.
nerdctl pull ubuntu
for i in `seq 30`
do
nerdctl run --rm --runtime io.containerd.run.kata.v2 ubuntu uname -r &
done
We always have many errors (95% of errors, with a new timeout
error).
Extract of output:
5.15.63
FATA[0030] failed to create shim task: open /sys/fs/cgroup/systemd/vc/tasks: no such file or directory: not found
FATA[0036] failed to create shim task: Failed to Check if grpc server is working: rpc error: code = DeadlineExceeded desc = timed out connecting to vsock 1541455645:1024: unknown
Note that we have many failed to create shim task: Failed to Check if grpc server is working: rpc error: code = DeadlineExceeded desc = timed out connecting to vsock 1541455645:1024: unknown
at the end (around 40 seconds after running the command), but not necessarily only at the end.
During that test, all CPUs are reaching 100% of usage.