Skip to content

Instantly share code, notes, and snippets.

@mythi
Last active April 17, 2024 05:05
Show Gist options
  • Save mythi/0c7381613510a72ed4810d826549290b to your computer and use it in GitHub Desktop.
Save mythi/0c7381613510a72ed4810d826549290b to your computer and use it in GitHub Desktop.
SGX EPC cgroups for Kubernetes
1. Prepare the kernel
git clone --depth 1 -b sgx_cg_upstream_v12 https://github.com/haitaohuang/linux.git linux-epc-cgroups
Added config:
CONFIG_CGROUP_SGX_EPC=y
2. Boot the VM and check SGX cgroups
host:$ qemu-system-x86_64 \
...
-object memory-backend-epc,id=mem1,size=64M,prealloc=on \
-M sgx-epc.0.memdev=mem1 \
-drive file=jammy.raw,if=virtio,aio=threads,format=raw,index=0,media=disk \
-kernel ./arch/x86_64/boot/bzImage \
...
guest:$ grep sgx_epc /sys/fs/cgroup/misc.capacity
sgx_epc 67108864
3. Setup (a single node) K8S cluster w/ containerd 1.7 and SGX EPC NRI plugin on Ubuntu 22.04
$ dpkg -l |grep containerd
ii containerd 1.7.2-0ubuntu1~22.04.1 amd64 daemon to control runC
# NB: config.toml: enable nri (disable = false), systemdCgroup = true
$ grep -A7 nri\.v1 /etc/containerd/config.toml
[plugins."io.containerd.nri.v1.nri"]
disable = false
disable_connections = false
plugin_config_path = "/etc/nri/conf.d"
plugin_path = "/opt/nri/plugins"
plugin_registration_timeout = "5s"
plugin_request_timeout = "2s"
socket_path = "/var/run/nri/nri.sock"
$ sudo ls /var/run/nri/
nri.sock
$ git clone -b PR-2023-050 https://github.com/mythi/intel-device-plugins-for-kubernetes.git
$ cd intel-device-plugins-for-kubernetes
$ make intel-deviceplugin-operator
$ docker save intel/intel-deviceplugin-operator:devel > op.tar
$ sudo ctr -n k8s.io i import op.tar
$ kubectl apply -k deployments/operator/default/
$ kubectl apply -f deployments/operator/samples/deviceplugin_v1_sgxdeviceplugin.yaml
4. Run
Use https://raw.githubusercontent.com/containers/nri-plugins/main/scripts/testing/kube-cgroups and run
watch -n 1 "./kube-cgroups -n 'sgxplugin-*' -f '(misc|memory).(max|current)'" -p 'sgx-epc-*'
(with the targeted namespace (-n) and podname filter (-p))
Run a pod requesting sgx.intel.com/epc: "65536"
5. e2e test framework
$ git clone -b PR-2023-050 https://github.com/mythi/intel-device-plugins-for-kubernetes.git
$ cd intel-device-plugins-for-kubernetes
$ make stress-ng-gramine intel-sgx-admissionwebhook
$ docker save intel/intel-sgx-admissionwebhook:devel > wh.tar
$ sudo ctr -n k8s.io i import wh.tar
$ docker save intel/stress-ng-gramine:devel > gr.tar
$ sudo ctr -n k8s.io i import gr.tar
$ go test -v ./test/e2e/... -ginkgo.v -ginkgo.focus "Device:sgx.*App:sgx-epc-cgroup"
NB: The e2e test framework expects cert-manager is deployed in the cluster
NB: The e2e test framework deletes all but kube-system and cert-manager namespaces before running the tests so do not run in a cluster with something important deployed!
@CyanDevs
Copy link

CyanDevs commented Nov 3, 2023

@mythi I'm getting the error qemu-system-x86_64: invalid object type: memory-backend-epc. I found intel/qemu-sgx#10 and my QEMU does not seem to have an -M option, which makes me wonder if I'm suppose to be using a generic QEMU install, or to use QEMU-SGX? My QEMU version: QEMU emulator version 6.2.0 (Debian 1:6.2+dfsg-2ubuntu6.15)

Command ran: $ qemu-system-x86_64 -object memory-backend-epc,id=mem1,size=64M,prealloc=on -M sgx-epc.0.memdev=mem1 -drive file=jammy.raw,if=virtio,aio=threads,format=raw,index=0,media=disk -kernel linux-epc-cgroups/arch/x86_64/boot/bzImage ubuntu-22.04.3-desktop-amd64.iso

Thanks

@mythi
Copy link
Author

mythi commented Nov 3, 2023

@CyanDevs I have that exact same version. I believe the error is you don't have enough permissions to /dev/sgx_vepc. It likely works if you run qemu using sudo if that's possible for you.

@CyanDevs
Copy link

CyanDevs commented Nov 3, 2023

@mythi Does this feature depend on /dev/sgx_vepc being exposed and accessible on the host? As the device is only specific to KVM, is it fair to say that this is also a KVM-specific feature? Or is the device is only necessary for testing the kernel with QEMU?

I am asking because I do not have /dev/sgx_vepc exposed on my VM, and I cannot switch to KVM.

@mythi
Copy link
Author

mythi commented Nov 3, 2023

@CyanDevs I think you need to skip that qemu part on your side and figure out how to make the EPC cgroups enabled VM/guest kernel available in your environment. I thought that you had Ubuntu "baremetal" just like I do.

@CyanDevs
Copy link

CyanDevs commented Nov 4, 2023

After some tinkering I was able to install and load the kernel directly onto the same VM I built the kernel with, and confirm that the cgroups feature is enabled. This was on an Azure VM -- unfortunately no access to a baremetal Ubuntu machine.

$ uname -a
Linux cgroups 6.6.0-rc7+ #1 SMP Sat Nov  4 02:55:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
$ grep sgx_epc /sys/fs/cgroup/misc.capacity
sgx_epc 117440512

Thanks for your clarification on this, otherwise I may have continued spinning my wheels on figuring out the vEPC device.

@mythi
Copy link
Author

mythi commented Nov 6, 2023

@CyanDevs good to hear! I hope it's not too difficult to get new kernel versions tested. I discovered an issue with sgx_cg_upstream_v6. The suggestion is to go with sgx_cg_upstream_v6_fixup

@CyanDevs
Copy link

CyanDevs commented Nov 6, 2023

@mythi Not difficult. I'll write something up to help others run the same as well. I've picked up the new branch without any issues.

When creating the cluster with kustomize I'm getting the following error. Applying again, and applying each subfolder in deployments/operator did not resolve this.

$ kubectl apply -k deployments/operator
namespace/inteldeviceplugins-system unchanged
namespace/system unchanged
customresourcedefinition.apiextensions.k8s.io/acceleratorfunctions.fpga.intel.com unchanged
customresourcedefinition.apiextensions.k8s.io/dlbdeviceplugins.deviceplugin.intel.com unchanged
customresourcedefinition.apiextensions.k8s.io/dsadeviceplugins.deviceplugin.intel.com unchanged
customresourcedefinition.apiextensions.k8s.io/fpgadeviceplugins.deviceplugin.intel.com unchanged
customresourcedefinition.apiextensions.k8s.io/fpgaregions.fpga.intel.com unchanged
customresourcedefinition.apiextensions.k8s.io/gpudeviceplugins.deviceplugin.intel.com unchanged
customresourcedefinition.apiextensions.k8s.io/iaadeviceplugins.deviceplugin.intel.com unchanged
customresourcedefinition.apiextensions.k8s.io/qatdeviceplugins.deviceplugin.intel.com unchanged
customresourcedefinition.apiextensions.k8s.io/sgxdeviceplugins.deviceplugin.intel.com unchanged
role.rbac.authorization.k8s.io/inteldeviceplugins-leader-election-role unchanged
clusterrole.rbac.authorization.k8s.io/inteldeviceplugins-gpu-manager-role unchanged
clusterrole.rbac.authorization.k8s.io/inteldeviceplugins-manager-role unchanged
clusterrole.rbac.authorization.k8s.io/inteldeviceplugins-metrics-reader unchanged
clusterrole.rbac.authorization.k8s.io/inteldeviceplugins-proxy-role unchanged
rolebinding.rbac.authorization.k8s.io/inteldeviceplugins-leader-election-rolebinding unchanged
clusterrolebinding.rbac.authorization.k8s.io/inteldeviceplugins-manager-rolebinding unchanged
clusterrolebinding.rbac.authorization.k8s.io/inteldeviceplugins-proxy-rolebinding unchanged
service/inteldeviceplugins-controller-manager-metrics-service unchanged
service/inteldeviceplugins-webhook-service unchanged
service/webhook-service unchanged
deployment.apps/inteldeviceplugins-controller-manager unchanged
deployment.apps/controller-manager unchanged
mutatingwebhookconfiguration.admissionregistration.k8s.io/inteldeviceplugins-mutating-webhook-configuration configured
mutatingwebhookconfiguration.admissionregistration.k8s.io/mutating-webhook-configuration configured
validatingwebhookconfiguration.admissionregistration.k8s.io/inteldeviceplugins-validating-webhook-configuration configured
validatingwebhookconfiguration.admissionregistration.k8s.io/validating-webhook-configuration configured
resource mapping not found for name: "inteldeviceplugins-serving-cert" namespace: "inteldeviceplugins-system" from "deployments/operator": no matches for kind "Certificate" in version "cert-manager.io/v1"
ensure CRDs are installed first
resource mapping not found for name: "inteldeviceplugins-selfsigned-issuer" namespace: "inteldeviceplugins-system" from "deployments/operator": no matches for kind "Issuer" in version "cert-manager.io/v1"
ensure CRDs are installed first

Have you encountered this before?

My kubectl and minikube versions

Client Version: v1.28.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.27.4
minikube version: v1.31.2
commit: fd7ecd9c4599bef9f04c0986c4a0187f98a4396e

Edit: seems like I just had to install cert-manager as described here https://cert-manager.io/docs/installation/

Edit 2: Not sure if installing cert-manager separately was the right move. Now running into the following

$ kubectl apply -f samples/deviceplugin_v1_sgxdeviceplugin.yaml
Error from server (InternalError): error when creating "samples/deviceplugin_v1_sgxdeviceplugin.yaml": Internal error occurred: failed calling webhook "msgxdeviceplugin.kb.io": failed to call webhook: Post "https://inteldeviceplugins-webhook-service.inteldeviceplugins-system.svc:443/mutate-deviceplugin-intel-com-v1-sgxdeviceplugin?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority

@mythi
Copy link
Author

mythi commented Nov 7, 2023

Edit 2: Not sure if installing cert-manager separately was the right move. Now running into the following

apologies, I had forgotten the cert-manager part. The other issue is that to deploy the operator, use kubectl apply -k deployments/operator/default/ (i.e., add default)

@CyanDevs
Copy link

CyanDevs commented Nov 7, 2023

Is node feature discovery, or anything else needed as well? My minikube node is not getting EPC memory (as well the SGX drivers) even though host VM has the in-kernel SGX drivers available as well as EPC memory.

Capacity:
  cpu:                      8
  ephemeral-storage:        259966896Ki
  hugepages-1Gi:            0
  hugepages-2Mi:            0
  memory:                   32826292Ki
  pods:                     110
  sgx.intel.com/enclave:    0
  sgx.intel.com/provision:  0
Allocatable:
  cpu:                      8
  ephemeral-storage:        259966896Ki
  hugepages-1Gi:            0
  hugepages-2Mi:            0
  memory:                   32826292Ki
  pods:                     110
  sgx.intel.com/enclave:    0
  sgx.intel.com/provision:  0
$ grep sgx_epc /sys/fs/cgroup/misc.capacity
sgx_epc 176160768
$ ls /dev/sgx*
/dev/sgx_enclave  /dev/sgx_provision

@mythi
Copy link
Author

mythi commented Nov 7, 2023

Correct, my instructions assume this flow is followed (which I forgot to add): https://github.com/intel/intel-device-plugins-for-kubernetes/blob/main/cmd/operator/README.md

The default SgxDevicePlugin sample CR uses a nodeselector label it gets from NFD. Once NFD is deployed, the device plugin should be up (and the NRI plugin also)

@CyanDevs
Copy link

CyanDevs commented Nov 8, 2023

Unfortunately after installing NFD, the label intel.feature.node.kubernetes.io/sgx=true isn't being applied to the minikube node. If I manually label it then I am able get the SGX driver resources registered -- but still nothing for EPC memory capacity. Any idea why the label isn't being applied and EPC memory isn't being registered? And, how can I confirm the NRI plugin is running correctly?

Capacity:
  cpu:                      8
  ephemeral-storage:        259966896Ki
  hugepages-1Gi:            0
  hugepages-2Mi:            0
  memory:                   32826292Ki
  pods:                     110
  sgx.intel.com/enclave:    110
  sgx.intel.com/provision:  110
Allocatable:
  cpu:                      8
  ephemeral-storage:        259966896Ki
  hugepages-1Gi:            0
  hugepages-2Mi:            0
  memory:                   32826292Ki
  pods:                     110
  sgx.intel.com/enclave:    110
  sgx.intel.com/provision:  110

The nodes are present and running

$ kubectl get nodes
NAME           STATUS   ROLES           AGE   VERSION
minikube       Ready    control-plane   49m   v1.27.4
minikube-m02   Ready    <none>          46m   v1.27.4
cyan@cgroupsdev2204-5:~/intel-device-plugins-for-kubernetes$ kubectl get pods -n node-feature-discovery
NAME                          READY   STATUS    RESTARTS   AGE
nfd-master-7f6b649c74-6zcz6   1/1     Running   0          45m
nfd-worker-6gn8r              1/1     Running   0          45m
nfd-worker-bmjwb              1/1     Running   0          45m

The following logs are noted in master

s$ kubectl logs nfd-master-7f6b649c74-6zcz6 -n node-feature-discovery
I1107 19:50:56.145734       1 nfd-master.go:213] "Node Feature Discovery Master" version="v0.14.1" nodeName="minikube" namespace="node-feature-discovery"
I1107 19:50:56.145802       1 nfd-master.go:1214] "configuration file parsed" path="/etc/kubernetes/node-feature-discovery/nfd-master.conf"
I1107 19:50:56.146006       1 nfd-master.go:1274] "configuration successfully updated" configuration=<
        DenyLabelNs: {}
        EnableTaints: false
        ExtraLabelNs: {}
        Klog: {}
        LabelWhiteList: {}
        LeaderElection:
          LeaseDuration:
            Duration: 15000000000
          RenewDeadline:
            Duration: 10000000000
          RetryPeriod:
            Duration: 2000000000
        NfdApiParallelism: 10
        NoPublish: false
        ResourceLabels: {}
        ResyncPeriod:
          Duration: 3600000000000
 >
I1107 19:50:56.146018       1 nfd-master.go:1338] "starting the nfd api controller"
I1107 19:50:56.146115       1 node-updater-pool.go:79] "starting the NFD master node updater pool" parallelism=10
I1107 19:50:56.158909       1 metrics.go:115] "metrics server starting" port=8081
I1107 19:50:56.159045       1 component.go:36] [core][Server #1] Server created
I1107 19:50:56.159060       1 nfd-master.go:347] "gRPC server serving" port=8080
I1107 19:50:56.159109       1 component.go:36] [core][Server #1 ListenSocket #2] ListenSocket created
I1107 19:50:57.159004       1 nfd-master.go:694] "will process all nodes in the cluster"
E1107 19:50:58.354440       1 nfd-master.go:1001] "failed to process rule" err="feature \"kernel.config\" not available" ruleName="intel.sgx" nodefeaturerule="intel-dp-devices" nodeName="minikube"
E1107 19:50:58.354452       1 nfd-master.go:1001] "failed to process rule" err="feature \"kernel.config\" not available" ruleName="intel.sgx" nodefeaturerule="intel-dp-devices" nodeName="minikube-m02"
I1107 19:50:58.369031       1 nfd-master.go:1086] "node updated" nodeName="minikube-m02"
I1107 19:50:58.369091       1 nfd-master.go:1086] "node updated" nodeName="minikube"
E1107 20:15:54.914367       1 nfd-master.go:1001] "failed to process rule" err="feature \"kernel.config\" not available" ruleName="intel.sgx" nodefeaturerule="intel-dp-devices" nodeName="minikube-m02"
E1107 20:15:54.914378       1 nfd-master.go:1001] "failed to process rule" err="feature \"kernel.config\" not available" ruleName="intel.sgx" nodefeaturerule="intel-dp-devices" nodeName="minikube"

The worker pod log says the kernel configs aren't present in those locations, but I'm not sure if this is related to the node not being labelled.

$ kubectl logs nfd-worker-bmjwb -n node-feature-discovery
I1107 19:50:54.286791       1 main.go:66] "-server is deprecated, will be removed in a future release along with the deprecated gRPC API"
I1107 19:50:54.287209       1 nfd-worker.go:219] "Node Feature Discovery Worker" version="v0.14.1" nodeName="minikube-m02" namespace="node-feature-discovery"
I1107 19:50:54.287394       1 nfd-worker.go:520] "configuration file parsed" path="/etc/kubernetes/node-feature-discovery/nfd-worker.conf"
I1107 19:50:54.287708       1 nfd-worker.go:552] "configuration successfully updated" configuration={"Core":{"Klog":{},"LabelWhiteList":{},"NoPublish":false,"FeatureSources":["all"],"Sources":null,"LabelSources":["local"],"SleepInterval":{"Duration":60000000000}},"Sources":{"cpu":{"cpuid":{"attributeBlacklist":["BMI1","BMI2","CLMUL","CMOV","CX16","ERMS","F16C","HTT","LZCNT","MMX","MMXEXT","NX","POPCNT","RDRAND","RDSEED","RDTSCP","SGX","SGXLC","SSE","SSE2","SSE3","SSE4","SSE42","SSSE3","TDX_GUEST"]}},"custom":[],"fake":{"labels":{"fakefeature1":"true","fakefeature2":"true","fakefeature3":"true"},"flagFeatures":["flag_1","flag_2","flag_3"],"attributeFeatures":{"attr_1":"true","attr_2":"false","attr_3":"10"},"instanceFeatures":[{"attr_1":"true","attr_2":"false","attr_3":"10","attr_4":"foobar","name":"instance_1"},{"attr_1":"true","attr_2":"true","attr_3":"100","name":"instance_2"},{"name":"instance_3"}]},"kernel":{"KconfigFile":"","configOpts":["NO_HZ","NO_HZ_IDLE","NO_HZ_FULL","PREEMPT"]},"local":{},"pci":{"deviceClassWhitelist":["03","0b40","12"],"deviceLabelFields":["class","vendor"]},"usb":{"deviceClassWhitelist":["0e","ef","fe","ff"],"deviceLabelFields":["class","vendor","device"]}}}
I1107 19:50:54.287914       1 metrics.go:70] "metrics server starting" port=8081
E1107 19:50:54.288491       1 kernel.go:134] "failed to read kconfig" err="failed to read kernel config from [ /proc/config.gz /host-usr/src/linux-6.6.0-rc7+/.config /host-usr/src/linux/.config /host-usr/lib/modules/6.6.0-rc7+/config /host-usr/lib/ostree-boot/config-6.6.0-rc7+ /host-usr/lib/kernel/config-6.6.0-rc7+ /host-usr/src/linux-headers-6.6.0-rc7+/.config /lib/modules/6.6.0-rc7+/build/.config /host-boot/config-6.6.0-rc7+]"
I1107 19:50:54.289570       1 nfd-worker.go:562] "starting feature discovery..."
I1107 19:50:54.289581       1 nfd-worker.go:577] "feature discovery completed"
I1107 19:50:55.058647       1 nfd-worker.go:698] "creating NodeFeature object" nodefeature=""

Checking current sgx_epc memory in the minikube node:

docker@minikube:/sys/fs/cgroup$ cat misc.current
sgx_epc 0

Perhaps it is the way I created the minikube cluster and it is incorrect somehow.

minikube start --container-runtime=containerd -mount --mount-string /var/run/aesmd/:/var/run/aesmd/ 

@mythi
Copy link
Author

mythi commented Nov 8, 2023

Any idea why the label isn't being applied and EPC memory isn't being registered?

The SGX labeling rule is such that all conditions must match. You already saw the reason for why the label is not created. Also the kernel config needs to be there to get the label.

This is a minikube thing and I've heard people hitting the issue. One easy fix would be to drop that highlighted match rule. Is that minikube setup cgroupv2 enabled? My suggestion is not to use minikube but something almost as simple like kubeadm. It gives you a cluster without any docker layers.

And, how can I confirm the NRI plugin is running correctly?

It's deployed together with the SGX device plugin:

$ kubectl describe pod intel-sgx-plugin-rrxxn -n inteldeviceplugins-system
... 
Containers:
  intel-sgx-plugin:
    Container ID:  containerd://121c84e4671ce6376685d2870c030901d580716e9d38514191bde4bcc85df8a7
    Image:         intel/intel-sgx-plugin:0.28.0
    Image ID:      docker.io/intel/intel-sgx-plugin@sha256:51b768fb07611454d62b1833ecdbd09d41eeb7f257893193dab1f7e061f9c54c
    Port:          <none>
    Host Port:     <none>
    Args:
      -v
      4
      -enclave-limit
      110
      -provision-limit
      110
    State:          Running
      Started:      Mon, 06 Nov 2023 14:02:23 +0000
    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Mon, 06 Nov 2023 14:00:16 +0000
      Finished:     Mon, 06 Nov 2023 14:01:53 +0000
    Ready:          True
    Restart Count:  2
    Environment:    <none>
    Mounts:
      /dev/sgx_enclave from sgx-enclave (ro)
      /dev/sgx_provision from sgx-provision (ro)
      /var/lib/kubelet/device-plugins from kubeletsockets (rw)
  nri-sgx-epc:
    Container ID:   containerd://6aa52f56bb96cfa63dbee7542598c0bf2850fea46775c03ea063b60108f75e83
    Image:          ghcr.io/containers/nri-plugins/nri-sgx-epc:unstable
    Image ID:       ghcr.io/containers/nri-plugins/nri-sgx-epc@sha256:f2b5fb6f70c6366494667b734a9399cfc4951b8680a83593109ef29677dc9128
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Mon, 06 Nov 2023 14:02:23 +0000
    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Mon, 06 Nov 2023 14:00:16 +0000
      Finished:     Mon, 06 Nov 2023 14:01:53 +0000
    Ready:          True
    Restart Count:  2
    Environment:    <none>
    Mounts:
      /var/run/nri from nrisockets (rw)

@CyanDevs
Copy link

CyanDevs commented Nov 8, 2023

Awesome, removing the kernel.config labelling rule worked like magic and the sgx.intel.com/epc resource is now registered with the node. Not sure why the kernel config is missing. I'll give kubeadm a try later - thanks for the suggestion. Also - yes the minikube nodes are using cgroupv2.

My sgx_epc is not reporting the expected 65536 allocation though. It's probably because my nri plugin isn't running. I'll investigate why tomorrow.

Thanks for all your help @mythi, I appreciate it.

docker@minikube:/sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podcfe87044_f090_498e_81e7_f8f032d945
9c.slice/cri-containerd-3b11506602acad6132b5da715a016d0a166c4aae541af7be5ccfe74b0e777581.scope$ cat misc.max
sgx_epc max

The job spec

apiVersion: batch/v1
kind: Job
metadata:
  name: oe-helloworld
  namespace: default
spec:
  template:
    metadata:
      labels:
        app: oe-helloworld
    spec:
      containers:
      - name: oe-helloworld
        image: mcr.microsoft.com/acc/samples/oe-helloworld:1.1
        command: [ "sleep", "infinity" ]
        resources:
          limits:
            sgx.intel.com/epc: "65536"
          requests:
            sgx.intel.com/epc: "65536"
        volumeMounts:
        - name: var-run-aesmd
          mountPath: /var/run/aesmd
      restartPolicy: "Never"
      volumes:
      - name: var-run-aesmd
        hostPath:
          path: /var/run/aesmd
  backoffLimit: 0

pod

$ kubectl describe pod oe-helloworld-xpcvg
Name:             oe-helloworld-xpcvg
Namespace:        default
Priority:         0
Service Account:  default
Node:             minikube/192.168.49.2
Start Time:       Wed, 08 Nov 2023 06:48:31 +0000
Labels:           app=oe-helloworld
                  batch.kubernetes.io/controller-uid=00df9e9d-ba40-482f-92c8-68dc1808745f
                  batch.kubernetes.io/job-name=oe-helloworld
                  controller-uid=00df9e9d-ba40-482f-92c8-68dc1808745f
                  job-name=oe-helloworld
Annotations:      sgx.intel.com/epc: 64Ki
Status:           Running
IP:               10.244.0.11
IPs:
  IP:           10.244.0.11
Controlled By:  Job/oe-helloworld
Containers:
  oe-helloworld:
    Container ID:  containerd://3b11506602acad6132b5da715a016d0a166c4aae541af7be5ccfe74b0e777581
    Image:         mcr.microsoft.com/acc/samples/oe-helloworld:1.1
    Image ID:      mcr.microsoft.com/acc/samples/oe-helloworld@sha256:64033ee002d17d69790398e4c272a9c467334a931ca0fb087b98b96b9f3be3db
    Port:          <none>
    Host Port:     <none>
    Command:
      sleep
      infinity
    State:          Running
      Started:      Wed, 08 Nov 2023 06:48:31 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      sgx.intel.com/enclave:  1
      sgx.intel.com/epc:      65536
    Requests:
      sgx.intel.com/enclave:  1
      sgx.intel.com/epc:      65536
    Environment:              <none>
    Mounts:
      /var/run/aesmd from var-run-aesmd (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xjdmk (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  var-run-aesmd:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/aesmd
    HostPathType:
  kube-api-access-xjdmk:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  4m55s  default-scheduler  Successfully assigned default/oe-helloworld-xpcvg to minikube
  Normal  Pulled     4m55s  kubelet            Container image "mcr.microsoft.com/acc/samples/oe-helloworld:1.1" already present on machine
  Normal  Created    4m55s  kubelet            Created container oe-helloworld
  Normal  Started    4m55s  kubelet            Started container oe-helloworld

intel-sgx-plugin doesn't have a section for nri-sgx-epc

$ kubectl describe pod intel-sgx-plugin-42txt -n inteldeviceplugins-sy
stem
Name:             intel-sgx-plugin-42txt
Namespace:        inteldeviceplugins-system
Priority:         0
Service Account:  default
Node:             minikube/192.168.49.2
Start Time:       Tue, 07 Nov 2023 23:32:26 +0000
Labels:           app=intel-sgx-plugin
                  controller-revision-hash=868bb58f4b
                  pod-template-generation=1
Annotations:      <none>
Status:           Running
IP:               10.244.0.2
IPs:
  IP:           10.244.0.2
Controlled By:  DaemonSet/intel-sgx-plugin
Containers:
  intel-sgx-plugin:
    Container ID:  containerd://dacad378e7115d25edc2fe9a67e799ee3a542e5d42caf920fe6087e119eac345
    Image:         intel/intel-sgx-plugin:0.28.0
    Image ID:      docker.io/intel/intel-sgx-plugin@sha256:51b768fb07611454d62b1833ecdbd09d41eeb7f257893193dab1f7e061f9c54c
    Port:          <none>
    Host Port:     <none>
    Args:
      -v
      4
      -enclave-limit
      110
      -provision-limit
      110
    State:          Running
      Started:      Wed, 08 Nov 2023 00:36:11 +0000
    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Tue, 07 Nov 2023 23:32:28 +0000
      Finished:     Wed, 08 Nov 2023 00:35:44 +0000
    Ready:          True
    Restart Count:  1
    Environment:    <none>
    Mounts:
      /dev/sgx_enclave from sgx-enclave (ro)
      /dev/sgx_provision from sgx-provision (ro)
      /var/lib/kubelet/device-plugins from kubeletsockets (rw)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  kubeletsockets:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:
  sgx-enclave:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/sgx_enclave
    HostPathType:  CharDevice
  sgx-provision:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/sgx_provision
    HostPathType:  CharDevice
QoS Class:         BestEffort
Node-Selectors:    intel.feature.node.kubernetes.io/sgx=true
Tolerations:       node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                   node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                   node.kubernetes.io/not-ready:NoExecute op=Exists
                   node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                   node.kubernetes.io/unreachable:NoExecute op=Exists
                   node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:            <none>

@mythi
Copy link
Author

mythi commented Nov 8, 2023

@CyanDevs You are most likely missing

$ make intel-deviceplugin-operator
$ docker save intel/intel-deviceplugin-operator:devel > op.tar
$ sudo ctr -n k8s.io i import op.tar

that is: make sure the operator deployment does not pull the image from dockerhub but it uses the custom image built from my devel branch

@CyanDevs
Copy link

CyanDevs commented Nov 9, 2023

Hi @mythi I've successfully validated this on my end using an Azure VM. The issue was with using minikube and that inherently had issues running the NRI plugin (as well as the missing kernel.config for NFD). Once I switched to using kubeadm, these issues were non-existent and everything ran as expected.

$ cd ./kubepods-besteffort-pode845916d_a5eb_4abf_8c5c_e6d3a2d4f5b6.slice/cri-containerd-ac823861137eed2323214cefdc27b7295bbbaf4d55e4ee919e772fef133d02c3.scope
$ cat misc.max
sgx_epc 65536

I'm grateful for your guidance and prompt responses here. Thank you!

@mythi
Copy link
Author

mythi commented Nov 9, 2023

@CyanDevs Great to hear! Any suggestions where I could improve the documentation here other than clearly mention minikube is known not to work? I'm also about to add the steps to get cAdvisor set up for the telemetry piece.

Go ahead with more (stress) testing and let me and Haitao know if there are issues.

@CyanDevs
Copy link

@mythi I sent you my notes that I wrote as I went through the steps. This guide is great. Some improvements I can think of is including notes for installing cert-manager and NFD -- I did not know this as I had never used intel-device-plugin before this. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment