c3d/gist:e342ace5084c5e11662ee2f7fef33097

## gistfile1.txt
- Upgraded to 19.4, tried to run on the machines I recently upgraded to Fedora
  33. I run into an ongoing issue where /dev/zram0 swap is re-enabled, not sure
  what does it yet.

 + kubeadm really does not want to init a system without swap. I know this has
   been a topic of recent discussion.

    [init] Using Kubernetes version: v1.19.4
    [preflight] Running pre-flight checks
    error execution phase preflight: [preflight] Some fatal errors occurred:
       [ERROR Swap]: running with swap on is not supported. Please disable swap

 + I tried to brute force it, but it seems there is some subltety happening
   here. Basically, /dev/zram0 keeps reappearing even after I swapoff it to
   death. So I tried with "--ignore-preflight-errors=Swap", but I vaguely
   remember that did not work too well in the past.

 + After that ran into a problem where the kubelet now wants cgroup v2:
   failed to get the kubelet's cgroup: cpu and memory cgroup hierarchy not unified.
   cpu: /system.slice, memory: /system.slice/kubelet.service.
   Kubelet system container metrics may be missing.
   This does not look fatal, but may be the root cause of my crash.

 + After fixing that, the kubelet still fails with:
   F1124 16:31:10.962395    5796 server.go:265] failed to run Kubelet: running
   with swap on is not supported, please disable swap! or set --fail-swap-on
   flag to false. /proc/swaps contained: ...
   Of course, /dev/zram0 is back up.
   The friend that blocked me earlier is a service called
   swap-create@zram0.service. I could disable that, but I want to see if I can
   make the thing to work with swap enabled.

 + Editing /etc/systemd/system/multi-user.target.wants/kubelet.service
   to add under [Service]:
      Environment="KUBELET_EXTRA_ARGS=--fail-swap-on=false"

 + After that, kubelet complains because it tries to talk to docker and not
   crio. Something was broken during the F33 upgrade. Edited the same service
   file and added:
      Environment="KUBELET_EXTRA_ARGS=--fail-swap-on=false --container-runtime=remote --container-runtime-endpoint=unix:///var/run/crio/crio.sock"
   Also added under [Unit]:
      Wants=crio.service
   as suggested by https://github.com/cri-o/cri-o/blob/master/tutorials/kubernetes.md,
   which also suggests adding docker.socket, but I don't have that one.
   I don't recall doing that earlier, but maybe I had.

 + After that, ran into another message that only shows with the full command
   line
     cri-o configured with systemd cgroup manager, but did not receive slice as
     parent: /kubepods/burstable/pode95d6f5518631c3f14475cf585810
   and then plenty of connexion failurs to port 6443:
      k8s.io/client-go/informers/factory.go:134: Failed to watch
      *v1beta1.RuntimeClass: failed to list *v1beta1.RuntimeClass: Get
      "https://192.168.77.55:6443/apis/node.k8s.io/v1beta1/runtimeclasses?limit=500&resourceVersion=0":
      dial tcp 192.168.77.55:6443: connect: connection refused
   which lead to:
     node "shuttle" not found
   (presumably because of the above, 192.168.77.55 is shuttle)
   The port 6443 is not open on my system according to nmap. It seems to be the
   k8s API server.

 + So it turns out that setting Environment in kubelet.service is ignored,
   because it ends up overwriting it with the content of file
   /etc/sysconfig/kubelet. That makes kubelet stable. Now on to the crio message
   about the "cgroup...did not receive slice as parent" message regarding crio.

 + THe new way to do things seems to be to go through the configuration file
   specified by --config. The problem is that what is passed here is
   /var/lib/config/kubelet.yaml, which is written by someone else.

 + The next blocking error seems to be
     E1125 17:39:28.760148 28021 remote_runtime.go:113] RunPodSandbox from
     runtime service failed: rpc error: code = Unknown desc = error converting
     cgroup memory value from string to int "max": strconv.ParseInt: parsing
     "max": invalid syntax
   Looking at my crio, it has a suspicious version:
     2:1.17.4-1.module_f32+8729+8e6b62f2
   Removing and reinstalling fails to find the package. Where did I get that
   CRI-O package from? Apparently, it's part of modular Fedora, and there are
   CRI-O for multiple versions. Latest is 1.19.
   dnf module enable cri-o:1.19.

 + Following message is
   error execution phase preflight: docker is required for container runtime:
   exec: "docker": executable file not found in $PATH
   What? I just installed CRI-O!
   Ah, need `systemctl start crio` :-( Can't really get used to install not
   starting services.

 + Now creating some containers at last. Now I have
   Nov 25 17:55:39 shuttle crio[30802]: time="2020-11-25
   17:55:39.338601792+01:00" level=error msg="Container creation error:
   time=\"2020-11-25T17:55:39+01:00\" level=error msg=\"this version of runc
   doesn't work on cgroups v2\"\n"
   This machine has both crun and runc. Grmnbl, historical crap.
   dnf remove runc

 + Next run gives me this:
   Nov 25 17:59:04 shuttle crio[33090]: time="2020-11-25
   17:59:04.208274806+01:00" level=fatal msg="Validating runtime config: runtime
   validation: \"runc\" not found in $PATH: exec: \"runc\": executable file not
   found in $PATH"
   I love it when they complain about a runtime configuration file but don't
   tell you where it is. I'm lucky to know that. OK, that points to runc.
   Trying an experiment: removing the rpm, reinstalling. It reinstalls runc, and
   the configuration file points to it. Uninstalling / reinstalling crun to see
   if it patches the configuration files correctly. Nope.
   Adding the following in /etc/crio/crio.conf

     [crio.runtime.runtimes.crun]
     runtime_path = "/usr/bin/crun"
     runtime_type = "oci"
     runtime_root = "/run/crun"

 + FINALLY, it starts. On the master node. Now need to repeat the operations on
   the worker nodes.

 + After that, had to add the same "--ignore-preflight-error=Swap" to the join
   command, because, swap.

 + A little bit extra manual twiddling with all the slaves, editing their
   /etc/crio/crio.conf to configure it correcty, and then finally I have a VR
   system that runs as a Kubernetes slave node, with Jenkins running in a
   container. Yay!
	- Upgraded to 19.4, tried to run on the machines I recently upgraded to Fedora
	33. I run into an ongoing issue where /dev/zram0 swap is re-enabled, not sure
	what does it yet.

	+ kubeadm really does not want to init a system without swap. I know this has
	been a topic of recent discussion.

	[init] Using Kubernetes version: v1.19.4
	[preflight] Running pre-flight checks
	error execution phase preflight: [preflight] Some fatal errors occurred:
	[ERROR Swap]: running with swap on is not supported. Please disable swap

	+ I tried to brute force it, but it seems there is some subltety happening
	here. Basically, /dev/zram0 keeps reappearing even after I swapoff it to
	death. So I tried with "--ignore-preflight-errors=Swap", but I vaguely
	remember that did not work too well in the past.

	+ After that ran into a problem where the kubelet now wants cgroup v2:
	failed to get the kubelet's cgroup: cpu and memory cgroup hierarchy not unified.
	cpu: /system.slice, memory: /system.slice/kubelet.service.
	Kubelet system container metrics may be missing.
	This does not look fatal, but may be the root cause of my crash.

	+ After fixing that, the kubelet still fails with:
	F1124 16:31:10.962395 5796 server.go:265] failed to run Kubelet: running
	with swap on is not supported, please disable swap! or set --fail-swap-on
	flag to false. /proc/swaps contained: ...
	Of course, /dev/zram0 is back up.
	The friend that blocked me earlier is a service called
	swap-create@zram0.service. I could disable that, but I want to see if I can
	make the thing to work with swap enabled.

	+ Editing /etc/systemd/system/multi-user.target.wants/kubelet.service
	to add under [Service]:
	Environment="KUBELET_EXTRA_ARGS=--fail-swap-on=false"

	+ After that, kubelet complains because it tries to talk to docker and not
	crio. Something was broken during the F33 upgrade. Edited the same service
	file and added:
	Environment="KUBELET_EXTRA_ARGS=--fail-swap-on=false --container-runtime=remote --container-runtime-endpoint=unix:///var/run/crio/crio.sock"
	Also added under [Unit]:
	Wants=crio.service
	as suggested by https://github.com/cri-o/cri-o/blob/master/tutorials/kubernetes.md,
	which also suggests adding docker.socket, but I don't have that one.
	I don't recall doing that earlier, but maybe I had.

	+ After that, ran into another message that only shows with the full command
	line
	cri-o configured with systemd cgroup manager, but did not receive slice as
	parent: /kubepods/burstable/pode95d6f5518631c3f14475cf585810
	and then plenty of connexion failurs to port 6443:
	k8s.io/client-go/informers/factory.go:134: Failed to watch
	v1beta1.RuntimeClass: failed to list v1beta1.RuntimeClass: Get
	"https://192.168.77.55:6443/apis/node.k8s.io/v1beta1/runtimeclasses?limit=500&resourceVersion=0":
	dial tcp 192.168.77.55:6443: connect: connection refused
	which lead to:
	node "shuttle" not found
	(presumably because of the above, 192.168.77.55 is shuttle)
	The port 6443 is not open on my system according to nmap. It seems to be the
	k8s API server.

	+ So it turns out that setting Environment in kubelet.service is ignored,
	because it ends up overwriting it with the content of file
	/etc/sysconfig/kubelet. That makes kubelet stable. Now on to the crio message
	about the "cgroup...did not receive slice as parent" message regarding crio.

	+ THe new way to do things seems to be to go through the configuration file
	specified by --config. The problem is that what is passed here is
	/var/lib/config/kubelet.yaml, which is written by someone else.

	+ The next blocking error seems to be
	E1125 17:39:28.760148 28021 remote_runtime.go:113] RunPodSandbox from
	runtime service failed: rpc error: code = Unknown desc = error converting
	cgroup memory value from string to int "max": strconv.ParseInt: parsing
	"max": invalid syntax
	Looking at my crio, it has a suspicious version:
	2:1.17.4-1.module_f32+8729+8e6b62f2
	Removing and reinstalling fails to find the package. Where did I get that
	CRI-O package from? Apparently, it's part of modular Fedora, and there are
	CRI-O for multiple versions. Latest is 1.19.
	dnf module enable cri-o:1.19.

	+ Following message is
	error execution phase preflight: docker is required for container runtime:
	exec: "docker": executable file not found in $PATH
	What? I just installed CRI-O!
	Ah, need `systemctl start crio` :-( Can't really get used to install not
	starting services.

	+ Now creating some containers at last. Now I have
	Nov 25 17:55:39 shuttle crio[30802]: time="2020-11-25
	17:55:39.338601792+01:00" level=error msg="Container creation error:
	time=\"2020-11-25T17:55:39+01:00\" level=error msg=\"this version of runc
	doesn't work on cgroups v2\"\n"
	This machine has both crun and runc. Grmnbl, historical crap.
	dnf remove runc

	+ Next run gives me this:
	Nov 25 17:59:04 shuttle crio[33090]: time="2020-11-25
	17:59:04.208274806+01:00" level=fatal msg="Validating runtime config: runtime
	validation: \"runc\" not found in $PATH: exec: \"runc\": executable file not
	found in $PATH"
	I love it when they complain about a runtime configuration file but don't
	tell you where it is. I'm lucky to know that. OK, that points to runc.
	Trying an experiment: removing the rpm, reinstalling. It reinstalls runc, and
	the configuration file points to it. Uninstalling / reinstalling crun to see
	if it patches the configuration files correctly. Nope.
	Adding the following in /etc/crio/crio.conf

	[crio.runtime.runtimes.crun]
	runtime_path = "/usr/bin/crun"
	runtime_type = "oci"
	runtime_root = "/run/crun"

	+ FINALLY, it starts. On the master node. Now need to repeat the operations on
	the worker nodes.

	+ After that, had to add the same "--ignore-preflight-error=Swap" to the join
	command, because, swap.

	+ A little bit extra manual twiddling with all the slaves, editing their
	/etc/crio/crio.conf to configure it correcty, and then finally I have a VR
	system that runs as a Kubernetes slave node, with Jenkins running in a
	container. Yay!