Skip to content

Instantly share code, notes, and snippets.

@mvazquezc
Last active February 13, 2023 21:18
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mvazquezc/bf5f129025f9c7f224ed1acb32c95c27 to your computer and use it in GitHub Desktop.
Save mvazquezc/bf5f129025f9c7f224ed1acb32c95c27 to your computer and use it in GitHub Desktop.
Container Security, an introduction to capabilities an seccomp profiles demos

Capabilities on Containers demos

Demo 1 - Run a container and get its thread capabilities

  1. Let’s run a test container, this container has an application that listens on a given port, but that’s not important for now:

    podman run -d --rm --name reversewords-test quay.io/mavazque/reversewords:latest
  2. We can always get capabilities for a process by querying the /proc filesystem:

    # Get container's PID
    CONTAINER_PID=$(podman inspect reversewords-test --format {{.State.Pid}})
    # Get caps for a given PID
    grep Cap /proc/${CONTAINER_PID}/status
  3. We get the capability sets in hex format, we can decode them using capsh tool:

    capsh --decode=00000000800405fb
  4. We can use podman inspect as well:

    podman inspect reversewords-test --format {{.EffectiveCaps}}
  5. Stop the container:

    podman stop reversewords-test

Demo 2 - Container running with UID 0 vs container running with nonroot UID

  1. Run our test container with a root uid and get it’s capabilities:

    podman run --rm -it --user 0 --entrypoint /bin/bash --name reversewords-test quay.io/mavazque/reversewords:ubi8
    grep Cap /proc/1/status
  2. We can see thread's permitted and effective capability sets populated, let's decode them:

    capsh --decode=00000000800405fb
  3. Exit the container:

    exit
  4. Same test but running the container with a nonroot UID:

    podman run --rm -it --user 1024 --entrypoint /bin/bash --name reversewords-test quay.io/mavazque/reversewords:ubi8 
    grep Cap /proc/1/status
  5. We can see thread's permitted and effective capability sets cleared, we can exit our container now:

    exit
  6. We can requests extra capabilities and those will be assigned to the corresponding sets:

    podman run --rm -it --user 1024 --cap-add=cap_net_bind_service --entrypoint /bin/bash --name reversewords-test quay.io/mavazque/reversewords:ubi8
    grep Cap /proc/1/status
  7. Since Podman supports ambient capabilities, you can see how we got the NET_BIND_SERVICE cap into the ambient, permitted and effective sets.

  8. We can exit the container now:

    exit

Demo 3 - Real world scenario

Using thread capabilities

  1. We can control in which port our application listens by using the APP_PORT environment variable. Let’s try to run our application in a non-privileged port with a non-privileged user:

    podman run --rm --user 1024 -e APP_PORT=8080 --name reversewords-test quay.io/mavazque/reversewords:ubi8
  2. Stop the container with Ctrl+C and try to bind to port 80 this time:

    podman run --rm --user 1024 -e APP_PORT=80 --name reversewords-test quay.io/mavazque/reversewords:ubi8
  3. This time it fails, remember that since we're running as nonroot, permitted and effective capability sets were cleared (so NET_BIND_SERVICE present on podman's default cap set is not available).

  4. We know that the capability NET_BIND_SERVICE allows unprivileged processes to bind to ports under 1024, let’s assign this capability to the container and see what happens:

    podman run --rm --user 1024 -e APP_PORT=80 --cap-add=cap_net_bind_service --name reversewords-test quay.io/mavazque/reversewords:ubi8
  5. This time it worked because the NET_BIND_SERVICE cap was added to the ambient, permitted and effective sets.

  6. You can stop the container using Ctrl+C.

Using file capabilities

  1. We added the NET_BIND_SERVICE capability to our binary when we built the image:

    setcap 'cap_net_bind_service+ep' /usr/bin/reverse-words
  2. Let's take a look inside the container:

    podman run --rm -it --entrypoint /bin/bash --user 1024 -e APP_PORT=80 --name reversewords-test quay.io/mavazque/reversewords-captest:latest
    getcap /usr/bin/reverse-words
  3. The capability is added to the effective and permitted file capability sets.

  4. Let's review the thread capabilities:

    grep Cap /proc/1/status 
  5. As you can see, the effective and permitted sets are cleared. But the inheritable and bounding do have the NET_BIND_SERVICE.

  6. Let's run our app:

    /usr/bin/reverse-words &
  7. We were able to bind to port 80, the binary had the file capability required to do that and it was present on the inheritable and bounding sets, to the thread adquired the capability on its effective set. We can check the effective and permitted sets:

    grep Cap /proc/<app_pid>/status
  8. We can exit the container now.

    exit
  9. Does this mean that we can bypass thread capabilities? - Let's see:

    podman run --rm -it --entrypoint /bin/bash --user 1024 --cap-drop=all -e APP_PORT=80 --name reversewords-test quay.io/mavazque/reversewords-captest:latest
  10. Check the cointainer thread capabilities:

    grep Cap /proc/1/status
  11. All sets are zeroed, let's try to run our app:

    /usr/bin/reverse-words
  12. The kernel blocked the execution, since NET_BIND_SERVICE capability cannot be acquired.

  13. That answers the question, NO. Now we can exit the container:

    exit

Seccomp on Containers demos

Demo 1 - Create your own seccomp profile

  1. We will use the OCI Hook project in order to generate the seccomp profile for our app

  2. Create a container with the OCI Hook which runs our application:

    sudo podman run --rm --annotation io.containers.trace-syscall="of:/tmp/ls.json" fedora:32 ls / > /dev/null
  3. The hook wrote the seccomp profile to /tmp/ls.json, let's review it

    jq < /tmp/ls.json
  4. We can now run our app with this profile

    podman run --rm --security-opt seccomp=/tmp/ls.json fedora:32 ls /
  5. What happens if we change the command?

    podman run --rm --security-opt seccomp=/tmp/ls.json fedora:32 ls -l /
  6. The required syscalls are not allowed, so it fails. Let's use the hook to append the ones we're missing:

    sudo podman run --rm --annotation io.containers.trace-syscall="if:/tmp/ls.json;of:/tmp/lsl.json" fedora:32 ls -l / > /dev/null
  7. We have an updated seccomp profile now, let's diff them:

    diff <(jq -S . /tmp/ls.json) <(jq -S . /tmp/lsl.json)
  8. We can use this new profile to run our app:

    podman run --rm --security-opt seccomp=/tmp/lsl.json fedora:32 ls -l /

Capabilities on Kubernetes demos

Demo 1 - Pod running with UID 0 vs container running with nonroot UID

Cluster was created with the following command: kcli create kube generic -P masters=1 -P workers=1 -P master_memory=4096 -P numcpus=2 -P worker_memory=4096 -P sdn=calico -P version=1.24 -P ingress=true -P ingress_method=nginx -P metallb=true -P engine=crio -P domain=linuxera.org caps-cluster

  1. Create a namespace

    NAMESPACE=test-capabilities
    kubectl create ns ${NAMESPACE}
  2. Create a pod running our application with UID 0:

    cat <<EOF | kubectl -n ${NAMESPACE} create -f -
    apiVersion: v1
    kind: Pod
    metadata:
      name: reversewords-app-captest-root
    spec:
      containers:
      - image: quay.io/mavazque/reversewords:ubi8
        name: reversewords
        securityContext:
          runAsUser: 0
      dnsPolicy: ClusterFirst
      restartPolicy: Never
    status: {}
    EOF
  3. Let's review the thread capability sets:

    kubectl -n ${NAMESPACE} exec -ti reversewords-app-captest-root -- grep Cap /proc/1/status
  4. We can see that the permitted and effective set have some capabilities, if we decode them this is what we get:

    capsh --decode=00000000000005fb
  5. Now, let's run the same application pod but with a nonroot UID:

    cat <<EOF | kubectl -n ${NAMESPACE} create -f -
    apiVersion: v1
    kind: Pod
    metadata:
      name: reversewords-app-captest-nonroot
    spec:
      containers:
      - image: quay.io/mavazque/reversewords:ubi8
        name: reversewords
        securityContext:
          runAsUser: 1024
      dnsPolicy: ClusterFirst
      restartPolicy: Never
    status: {}
    EOF
  6. If we review the thread capability sets this is what we get:

    kubectl -n ${NAMESPACE} exec -ti reversewords-app-captest-nonroot -- grep Cap /proc/1/status
  7. The permitted and effective sets got cleared, if you remember this is expected. The problem on Kube is that it doesn't support ambient capabilities, as you can see the ambient set is cleared. That leaves us only with two options: File caps or caps aware apps.

Demo 2 - Application with NET_BIND_SERVICE

  1. In this first deployment we are going to run our app with root uid and drop every runtime capability but NET_BIND_SERVICE.

    cat <<EOF | kubectl -n ${NAMESPACE} create -f -
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      creationTimestamp: null
      labels:
        app: reversewords-app-rootuid
      name: reversewords-app-rootuid
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: reversewords-app-rootuid
      strategy: {}
      template:
        metadata:
          creationTimestamp: null
          labels:
            app: reversewords-app-rootuid
        spec:
          containers:
          - image: quay.io/mavazque/reversewords:ubi8
            name: reversewords
            resources: {}
            env:
            - name: APP_PORT
              value: "80"
            securityContext:
              runAsUser: 0
              capabilities:
                drop:
                - all
                add:
                - NET_BIND_SERVICE
    status: {}
    EOF
  2. If we get the application logs we can see that it started properlly:

    kubectl -n ${NAMESPACE} logs deployment/reversewords-app-rootuid
  3. If we look at the capability sets this is what we get:

    kubectl -n ${NAMESPACE} exec -ti deployment/reversewords-app-rootuid -- grep Cap /proc/1/status
  4. We have the NET_BIND_SERVICE available so it worked as expected.

  5. Now, we are dropping all of the runtime’s default capabilities, on top of that we add the NET_BIND_SERVICE capability and request the app to run with non-root UID. In the environment variables we configure our app to listen on port 80.

    cat <<EOF | kubectl -n ${NAMESPACE} create -f -
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      creationTimestamp: null
      labels:
        app: reversewords-app-nonrootuid
      name: reversewords-app-nonrootuid
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: reversewords-app-nonrootuid
      strategy: {}
      template:
        metadata:
          creationTimestamp: null
          labels:
            app: reversewords-app-nonrootuid
        spec:
          containers:
          - image: quay.io/mavazque/reversewords:ubi8
            name: reversewords
            resources: {}
            env:
            - name: APP_PORT
              value: "80"
            securityContext:
              runAsUser: 1024
              capabilities:
                drop:
                - all
                add:
                - NET_BIND_SERVICE
    status: {}
    EOF
  6. Let's check the logs:

    kubectl -n ${NAMESPACE} logs deployment/reversewords-app-nonrootuid
  7. The application failed to bind to port 80, let's update the confiuration so we can access the pod an check the capability sets:

    # Patch the app so it binds to port 8080
    kubectl -n ${NAMESPACE} patch deployment reversewords-app-nonrootuid -p '{"spec":{"template":{"spec":{"$setElementOrder/containers":[{"name":"reversewords"}],"containers":[{"$setElementOrder/env":[{"name":"APP_PORT"}],"env":[{"name":"APP_PORT","value":"8080"}],"name":"reversewords"}]}}}}'
    # Get capability sets
    kubectl -n ${NAMESPACE} exec -ti deployment/reversewords-app-nonrootuid -- grep Cap /proc/1/status
  8. We don't have the NET_BIND_SERVICE in the effective and permitted set, that means that in order for this to work we will need the capability to be in the ambient set, but this is not supported yet on Kubernetes, we will need to make us of file capabilities.

  9. We have an image with the file capabilities configured, let's update the deployment to use port 80 and this new image:

    kubectl -n ${NAMESPACE} patch deployment reversewords-app-nonrootuid -p '{"spec":{"template":{"spec":{"$setElementOrder/containers":[{"name":"reversewords"}],"containers":[{"$setElementOrder/env":[{"name":"APP_PORT"}],"env":[{"name":"APP_PORT","value":"80"}],"image":"quay.io/mavazque/reversewords-captest:latest","name":"reversewords"}]}}}}'
  10. Let's check the logs for the app:

    kubectl -n ${NAMESPACE} logs deployment/reversewords-app-nonrootuid
  11. If we check the capabilities now this is what we get:

    kubectl -n ${NAMESPACE} exec -ti deployment/reversewords-app-nonrootuid -- grep Cap /proc/1/status
  12. We can check the file capabilities configured in our binary as well:

    kubectl -n ${NAMESPACE} exec -ti deployment/reversewords-app-nonrootuid -- getcap /usr/bin/reverse-words

Seccomp Profiles on Kubernetes demos

Demo 1 - Running a workload with a custom seccomp profile

  1. Add below's seccomp profile in your kubernetes nodes under /var/lib/kubelet/seccomp/centos8-ls.json

    {
      "defaultAction": "SCMP_ACT_ERRNO",
      "architectures": [
        "SCMP_ARCH_X86_64"
      ],
      "syscalls": [
        {
          "names": [
            "access",
            "arch_prctl",
            "brk",
            "capget",
            "capset",
            "chdir",
            "close",
            "epoll_ctl",
            "epoll_pwait",
            "execve",
            "exit_group",
            "fchown",
            "fcntl",
            "fstat",
            "fstatfs",
            "futex",
            "getdents64",
            "getpid",
            "getppid",
            "ioctl",
            "mmap",
            "mprotect",
            "munmap",
            "nanosleep",
            "newfstatat",
            "openat",
            "prctl",
            "pread64",
            "prlimit64",
            "read",
            "rt_sigaction",
            "rt_sigprocmask",
            "rt_sigreturn",
            "sched_yield",
            "seccomp",
            "set_robust_list",
            "set_tid_address",
            "setgid",
            "setgroups",
            "setuid",
            "stat",
            "statfs",
            "tgkill",
            "write"
          ],
          "action": "SCMP_ACT_ALLOW",
          "args": [],
          "comment": "",
          "includes": {},
          "excludes": {}
        }
      ]
    }
  2. Create a namespace for our workload

    NAMESPACE=test-seccomp
    kubectl create ns ${NAMESPACE}
  3. We can configure seccomp profiles at pod or container level, this time we're going to configure it at pod level:

    cat <<EOF | kubectl -n ${NAMESPACE} create -f -
    apiVersion: v1
    kind: Pod
    metadata:
      name: seccomp-ls-test
    spec:
      securityContext:
        seccompProfile:
          type: Localhost
          localhostProfile: centos8-ls.json
      containers:
      - image: registry.centos.org/centos:8
        name: seccomp-ls-test
        command: ["ls", "/"]
      dnsPolicy: ClusterFirst
      restartPolicy: Never
    status: {}
    EOF
  4. We can check pod logs:

    kubectl -n ${NAMESPACE} logs seccomp-ls-test
  5. Let's try to modify the container command, this time let's run 'ls -l /':

     cat <<EOF | kubectl -n ${NAMESPACE} create -f -
    apiVersion: v1
    kind: Pod
    metadata:
      name: seccomp-lsl-test
    spec:
      containers:
      - image: registry.centos.org/centos:8
        name: seccomp-lsl-test
        command: ["ls", "-l", "/"]
        securityContext:
          seccompProfile:
            type: Localhost
            localhostProfile: centos8-ls.json
      dnsPolicy: ClusterFirst
      restartPolicy: Never
    status: {}
    EOF
  6. This time the pod failed since the seccomp profile doesn't allow the required syscalls for ls -l / to run:

    kubectl -n ${NAMESPACE} logs seccomp-lsl-test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment