Skip to content

Instantly share code, notes, and snippets.

@hansrajdas
Last active June 7, 2024 19:30
Show Gist options
  • Save hansrajdas/283c6cb171eb329adbce54c43e65d58b to your computer and use it in GitHub Desktop.
Save hansrajdas/283c6cb171eb329adbce54c43e65d58b to your computer and use it in GitHub Desktop.
Kubernetes

Overview

  • Automatically deploying and managing container is called container orchestration
  • K8s is an container orchestration tool/technology
  • Other alternatives of K8s are docker swarm and mesos

Cluster architecture

  • K8s cluster is a set of machines (or nodes) running in sync
  • One of the node is master node, responsible for actual orchestration
  • kube-scheduler schedules pods on nodes based on node capacity, load on node and other policies. This runs in kube-system namespace
  • kubelet runs on worker node which listens for instructions from kube-apiserver and manages containers
  • kube-proxy enables communication within services within the cluster
  • kubectl tool is used to deploy an manage applications on k8s clusters
  • As k8s is container orchestration tool so we also need one container runtime engine like docker

ETCD

  • ETCD is a distributed reliable key-value store that is simple, secure and fast
  • K8s uses etcd cluster in master node to store information like which node is master, nodes, pods, configs, secrets, accounts, roles, bindings and other information
  • etdctl is a command line too which comes with ETCD server
./etcdctl set key1 value1
./etcdctl get key1
  • All the kubectl get command output comes from etcd server

Kube-API server

  • kube-apiserver is used for all management related communication in a cluster. It runs on master node
  • When we run a from kubectl, it reaches kube-api server which authenticates and validates the request and then interacts with etcd server and returns back the response
  • We don't necessarily need to use kubectl, we can directly make requests like POST request using curl to create a pod

Kube controller manager

  • Controller consistently monitors the state of the system and takes necessary action to bring back the system to normal state in case of problems
  • Kube controller interacts with kube-apiserver to get cluster info. This also runs on master node and this is the brain of the cluster
  • There are many controller in k8s, listed below are 2 of those
    • Node controller: Monitors which pod is down/unhealthy then takes necessary action to launch a new one if down
    • Replication controller: Ensures desired number of pods are running at all times within a set

Kube scheduler

  • Decides which pod goes to which node so that right container ends up on right node. It decides on basis of node CPU/mem available and required by container and finds best fit
  • This also runs on master node

Kubelet

  • This runs on worker nodes and responsible to getting gathering commands from kube apiserver and sending back the all reports for that worker node
  • kubelet needs to installed manually on worker nodes, this not installed automatically with kubeadm like other components

kube proxy

  • This runs as daemonset on each node
  • Manages networking in k8s cluster so that each pod in a cluster is able to communicate every other pod

Pods

  • POD is a single instance of an application
  • We add single container in a pod - this is recommendation but pod can have multiple containers in some cases like a container may have some helper containers which may go in same pod. Containers running in same pod can communicate using localhost itself

Setup

  • Install kubectl utility first to interact with k8s cluster

Minikube

  • minikube is the easiest way to install k8s cluster, it installs all components (etcd, container runtime, ...) on a single machine/node
minikube start                          # Start minikube
minikube stop                           # Stop minikube
minikube service appname-service --url  # Get external URL of appname

Kubeadm

  • kubeadm is more advanced tool to create multi-node k8s cluster
  • We can use tool like vgrant to create VMs on machine to have multiple nodes in k8s - master and worker(s)

YAML file

  • K8s works on yaml files, it expects 4 top level attributes
    • apiVersion
    • Kind
    • metadata
    • spec
  • Below is sample yaml file to deploy a pod with nginx container
apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    app: nginx
    tier: frontend
spec:
  containers:
  - name: nginx
    image: nginx

Replication controllers

  • Controllers are the brain behind k8s cluster, they are the process which monitors k8s objects and takes desired action
  • Replication controller ensures, specified number of pods are running at all times. Also helps for load balancing across pods - scaling
  • kind = ReplicationController, pod and replicas info is present in spec section of yaml file
apiVersion: v1
kind: ReplicationController
metadata:
  name: ...
  labels:
    ...
spec:
  template:
      <pod-definition>
  replicas: ...

Replica sets

  • This serves the same purpose as replication controller but it's an older technology. Replica sets is the recommended way
  • apiVersion = apps/v1 and kind = ReplicaSet and spec section remains same as above and one more params called selector used to select pods for replication. Selector is used to match which pods to monitor and it can be possible that pods with given labels already exists (or some exists) then replica set won't create those pods but just monitor those to have desired number of pods
apiVersion: apps/v1
kind: ReplicaSet
metadata:
  name: ...
  labels:
    ...
spec:
  template:
      <pod-definition>
  replicas: ...
  selectors:
      matchLabels:
        ...

Deployments

  • Provides capability for rolling updates, rollback, pausing and resume changes
  • Deployments come higher in the heirarchy than replica sets (pod > replica sets > deployments)
  • yaml file is almost same as of replicasets but kind = Deployment for deployment object
  • On deploying, it creates a new replica set which in turn creates pods
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ...
  labels:
    ...
spec:
  template:
      <pod-definition>
  replicas: ...
  selectors:
      matchLabels:
        ...

Namespaces(ns)

  • Default ns is automatically created when a cluster is setup
  • kube-system (for networking like DNS and security) and kube-public (for keeping public resources) are other ns created at cluster startup
  • Each ns has
    • Isolation: Each ns is isolated from other, we can have a cluster with 2 ns dev and prod. These 2 will be isolated from each other. We can access resources/services deployed in other ns using ns with service name like web-app.dev.svc...
    • Policies: Each ns has different policies
    • Resource limits: We can define different quota of resources in each ns

Services

  • It is a k8s virtual object which enables communication between various internal and external components like access from browser, b/w frontend and backend services
  • Enables loose coupling in our microservices in application

We have 3 types of services in k8s

NodePort

  • This service is used to make internal service(like webserver) accessible to the users(outside world) on a port. It exposes application on a port on all hosts
  • Has range of ports from 30000 to 30767
apiVersion: v1
kind: Service
metadata:
    ...
spec:
  type: NodePort
  ports:
    - targetPort: 80
      port: 80
      nodePort: 30008
  selectors:
    ...
  • This file has 3 ports - ports are wrt to service
    • targetPort: Pod port, this is actual port to access inside port
    • port: Service port
    • nodePort: Port exposed to external world
  • For multiple pods are running with given selector, it acts as load balancer and distributes traffic to various pods randomly
  • For multiple nodes in cluster, we can access application using any of the node port IP and port, nodePort service created spans across nodes in the cluster

ClusterIP

  • Creates an IP (and name) to communicate b/w services like from set of frontend and backend services. This is for internal access only(within cluster, not bound to specific node), different microservices communicate using ClusterIP service
  • This is the default service type
  • K8s creates one ClusterIP service by default named kubernetes
apiVersion: v1
kind: Service
metadata:
    name: backend
...
spec:
  type: ClusterIP
  ports:
    - targetPort: 80
      port: 80
  selectors:
    ...
  • Imperative way to create service
# Expose pod `messaging` on running on 6379 to 6379
# We can use deployment, rc or rs instead of pod
# We can also specify `targetPort` if want some other external port
k expose pod messaging --name messaging-service --port=6379

LoadBalancer

  • Used to create single endpoint like http://some-domain.com to access the application. The application may have multiple nodes running, the will help us create a common name to access it. Without this we will have to access the apps using specific nodeIP:port which is very hard to remember and will change when node restarts (may get new IP)
apiVersion: v1
kind: Service
...
spec:
  type: LoadBalancer
  ports:
    - targetPort: 80
      port: 80
      nodePort: 30008
  selectors:
    ...
  • Using LoadBalancer in cloud providers like AWS, GCP, k8s sends the request to cloud provider to provision a load balancer which can be used to access the application

Imperative vs declarative approach

Imperative

  • Providing instructions writing in english to do something
  • In k8s, anything done using kubectl command except apply is imperative approach like kubectl run, edit, expose, create, scale, replace, delete, ...
  • This is faster, we just have to run the right command - yaml file not always required. Use this in certification exam to save time

Declarative

  • Using tools like terraform, chef, puppet, ansible. This does lot of error handling and maintains state of steps done so far

  • In k8s, done using kubectl apply command, checks for what is the state of the system and performs relevant action only

  • It is recommended not to mix imperative and declarative approaches

Networking

  • Each Pod is assigned with an IP
  • K8s does not handle networking to communicate b/w pods so in multi node cluster, we has to setup the networking on our own using other networking softwares like vmware nsx, etc.

Scheduling

  • Scheduler assigns node for a pod, when we deploy a pod, property called nodeName (in spec section) is assigned to pod which has node name where this pod has to run
  • If pod doesn't get a nodeName assigned to it, pod remains in Pending state
  • We can also assign nodeName manually - by using this nodeName property set with our deployment yaml file
  • Note that we can't change the nodeName of a running instance of pod, to mimic this behaviour we use Binding object and send a POST request for this pod

Taint and tolerations

  • We can taint certain nodes so that only specific pods can be scheduled on those nodes. This is useful when we want to use some nodes for specific use case
  • For those specific pods which should be scheduled on tainted nodes we add tolerations for those pods which makes pods tolerant to the taint and gets scheduled on tainted nodes
  • Below command can be used to taint a node
k taint nodes node-name key=value:taint-effect  # Sample

k taint nodes node1 app=blue:NoSchedule  # Example
  • <taint-effect> specifies what happens to pod which do not tolerate this taint, it can have 3 values

    • NoSchedule Don't schedule those pods on this node
    • PreferNoSchedule System will try to avoid scheduling on this node but that's not guaranteed
    • NoExecute Don't schedule pods and existing pods which don't tolerate the taint will be evicted. This is possible if some pods got scheduled on nodes before they were tainted
  • We can add tolerations to pods in yaml definition file in spec section

...
spec:
  ...
  tolerations:
  - key: app
    operator: Equal
    value: blue
    effect: NoSchedule
  • When we create a cluster, taint is applied on master node so that no pod(workload) is scheduled on master nodes. Can be checked using below command
k describe node <master-node-name> | grep Taint
  • Untaint node
kubectl taint nodes <nodeName> node-role.kubernetes.io/master:NoSchedule-
  • Tainting nodes only restricts nodes from allowing certain pods to be scheduled on those nodes but it doesn't guarantee that a specific pod gets scheduled on specific node. A tolerant pod can be scheduled on any node in the system. If we have requirement to schedule some pods on specific nodes, this can be achieved using node affinity

Node selector

  • To schedule a pod on specific node we can use nodeSelector in spec section of pod definition yaml file
...
spec:
  ...
  nodeSelector:
    size: Large
  • Size given in above command is the label that we has to add on nodes using below command
k label nodes node-1 size=Large
  • Node selector have limitation like it doesn't support complex selection filters like schedule pod on medium or large nodes or don't schedule on small nodes, for these use cases we can use Node Affinity

Node affinity

  • We can add affinity in spec section of pod yaml definition file, to select specific nodes for scheduling pods
...
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: size
            operator: In
            values:
            - Large
            - Medium
  • Other operators we can use are Exists (doesn't need a value), NotIn, etc
  • Other node affinity preferredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingRequiredDuringExecution
    • Scheduling: Starting pod - assigning a node to pod
    • Execution: Pod is already scheduled and in running state. Considered when pod is already running on a node and someone changes the node labels

Resource requirements and limits

  • When scheduler tries to schedule a pod, k8s checks for pod's resource requirements and places on node which has sufficient resources
  • By default container requests for 0.5 CPU and 256 Mi of RAM for getting scheduled, this can be modified by adding resources section under spec of pod yaml definition
...
spec
  ...
  resources:
    requests:
      memory: "1Gi"
      cpu: 1
  • 1 CPU = 1000m = 1 vCPU = 1 AWS vCPU = 1 GCP core = 1 Azure core = 1 Hyperthread. m is millicore
  • It can as low as 0.1 which is 100m
  • For memory
    • 1 K (kilobyte) = 1,000 bytes
    • 1 M = 1,000,000 bytes
    • 1 G = 1,000,000,000 bytes
    • 1 Ki (kibibyte) = 1,024 bytes
    • 1 Mi = 1,048,576 bytes
    • ...
  • While container is running it's resource requirements can go high so by default k8s sets a limit of 1 vCPU and 512 Mi to containers, this can also be changed by adding limits section under resources section
...
spec:
  resources:
    ...
    limits:
      memory: "2Gi"
      cpu: 2
  • If container tries to use more CPU then limits, then it is throttled and in case if memory exceeds container is terminated

Daemon sets

  • Daemon sets ensures that one copy of a Pod is always running in all nodes in the cluster, when a new node is added to cluster daemon set pod starts running on that node
  • Some application of using daemon set
    • Monitoring solution
    • Logs viewer
  • kube-proxy runs as daemon set
  • YAML definition of daemon set is similar to replica sets, change is in the kind only, other params are same
apiVersion: apps/v1
kind: DaemonSet
...
spec:
  ...
  template:
    ...
  • K8s (v1.12 onwards) uses nodeAffinity and default scheduler to deploy daemon sets in each node

Static pods

  • Suppose we don't have master node in cluster (which kube-api server, etcd server and other things), we only have worker nodes which has kubelet now we can't create resources as we don't have kube-api server which can give instructions to kubelet to create anything
  • In this scenario we place pod definition yaml files at pod-manifests-path which is by default /etc/kubernetes/manifets, and kubelet checks this path and creates any pod if it finds at this path, if later pod definition file is deleted pod also gets deleted. This way of creating pods is called static pods
  • kubelet only understands pod so we are only able to create only pods and not deployments or replica sets
  • pod-manifests-path or staticPodPath can be updated while running kubelet service. To know current path check -config option used in kubelet binary running (ps -eaf | grep kubelet), -config contains kubeconfig file having other details
  • Now static pod is created but we can't use kubectl get pods to check pods because it kubectl interacts with kube-api server which is not running so in this case we can use docker commands docker ps
  • This is used to deploy control plane components while creating k8s server - etcd, api-server, controller-manager, scheduler (these pods have nodeName - master or controlplane appended to there name)

Multiple schedulers

  • Apart from default k8s cluster scheduler running on master node, we can also deploy our own scheduler
  • Custom scheduler can be deployed just like any other pod with some name - image should be k8s.gcr.io/kube-scheduler:v1.20.0, has a command section which contains various options like leader-elect, port, ...
  • While deploying other service pod we can specify our custom scheduler using schedulerName option in spec section

Monitoring and logging

  • K8s has monitoring server called Metrics server which keeps all cluster metrics, this is an in memory solution so we won't get historical data
  • Kubelet running on each node has another component called cAdvisor or container advisor which retrieves performance metrics from pods and exposes them to metrics server through kubelet APIs
  • We can install metric server using
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
  • To view metrics, we can use below top commands
k top node  # See CPU and memory usage of all nodes
k top pod   # See CPU and memory usage of pods in current namespace
  • To view pod logs
k logs <podName>     # Get all logs from begining to now
k logs -f <podName>  # Stream logs

# If a pod has multiple containers in it, specfify container name also
k logs -f <podName> <containerName>

# Check logs from all containers in a pod
k logs -f <podName> --all-containers

# Previous pod logs
k logs -f <podName> --previous

# If core components are down then kubectl commands won't work, we can get use journalctl to see logs of components like kubelet
journalctl -u kubelet  # Get kubelet logs

# We also use docker commands to get logs if kubectl is not working
docker logs <docker-id>

Application lifecycle management

Rolling updates and rollback

  • When a deployment is created, a rollout is triggered and creates a revision of deployment. On every subsequent deployment this revision if updated, this helps us to rollback to a previous version if necessary
  • To see rollouts, use below commands
k rollout status deployment/myapp-deployment   # Get status of rollout
k rollout history deployment/myapp-deployment  # Get rollout history
  • K8s support 2 deployment strategies
    • Recreate: Delete all exiting deployments and then create new ones. This will cause some downtime and all new may have some issues. This is NOT the default deployment strategy
    • Rolling update: This deletes one object (or pod) at a time and deploy newer version one by one. This way application never goes down and upgrade is seamless. This is the default deployment strategy in k8s
  • These 2 strategies can also be seen by describing a deployment, we will get the strategy name and how pods got updated - rolling or recreate
  • Under the hood, deployment creates a replica set and creates required number of pods, when deployment is updated - new replica set is created and all new pods are started in it and stopping pods from existing replica sets. To see use k get replicasets
  • If we a notice problem after upgrading our application/deployment, we can undo the deployment and rollback to previous revision, it will destroy the pods in new replica set and bring older ones up in old replica set
k rollout undo deployment/myapp-deployment  # Rollback last deployment

Commands and arguments

  • In Dockerfile we have 2 fields
    • ENTRYPOINT: Specifies which command to run
    • CMD: Takes arguments which goes with above command given in entrypoint
  • In pod definition file, we can overwrite both the above options using command and args option respectively
apiVersion: v1
kind: Pod
metadata:
  name: ubuntu-sleeper
spec:
  containers:
  - name: ubuntu-sleeper
    image: ubuntu-sleeper
    command: ["my-sleep"]  # Run this command (overwritten `ENTRYPOINT`)
    args: ["10"]  # Pass argument 10 with above command (overwritten `CMD`)

Configure environment variables

We can specify environment variables for a pod in it's definition file using env or envFrom parameter. There are 3 ways get value for env vars

Directly env name and value
env:
  - name: APP_COLOR
    value: pink
ConfigMap

ConfigMaps are used to save all configurations required by application at central place, this can be referred in pod definition file and all name/value will be available

  • ConfigMaps can be created using imperative way or declarative way
# Imperative approach
k create configmap <cm-name> --from-literal=<key1>=<value1> --from-literal=<key2>=<value2>
k create configmap <cm-name> --from-file=<path-to-file>  # Can use file with all key/val also

k get configmaps
k describe configmaps
  • Declarative approach
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  APP_COLOR: blue
  APP_MODE: prod
  • To use above configMap in pod definition file use envFrom
envFrom:
  - configMapRef:
      name: app-config  # ConfigMap name
  • Above config injects all env vars from configMap to pod, we can also take selective
env:
  - name: APP_COLOR
    valueFrom:
      configMapKeyRef:
        name: app-config
        key: APP_COLOR
  • ConfigMaps can also be mounted to volumes
volumes:
  - name: app-config-volume
    configMap:
      name: app-config
Secrets

Can be used to store any sensitive information. Same as configMaps but data is stored in encoded format. Secrets data is kept in encoded format base64

  • Imperative way to create secret
k create secret generic <secret-name> --from-literal=<key1>=<value1> --from-literal=<key2>=<value2>
k create secret generic <secret-name> --from-file=<path-to-file>  # Can use file with all key/val also

k get secrets
k describe secrets
  • Declarative way: To use declarative way values should be base64 encoded, if we don't want to encode to base64 we can use stringData field instead of data
apiVersion: v1
kind: Secret
metadata:
  name: app-secret
data:
  DB_Host: bXlzcWwK
  DB_User: cm9vdAo=
  DB_Passowrd: YWJjMTIzCg==
  • To use above secret in pod definition file use envFrom
envFrom:
  - secretRef:
      name: app-secret  # secret name
  • Above config injects all env vars from secret to pod, we can also take selective
env:
  - name: APP_COLOR
    valueFrom:
      secretKeyRef:
        name: app-secret
        key: DB_Passowrd
  • Secrets can also be mounted to volumes. This mounting creates files in container for each parameter one file is created.
volumes:
  - name: app-secret-volume
    secret:
      name: app-secret
# 3 files are created corresponding to 3 vars in secret
ls /opt/app-secret-volumes
DB_Host  DB_Passowrd  DB_User

Multi container pods

  • There can be cases when we need 2 services to work together - scale up/down, share same network (can be accessed using localhost), share same volume. Example would be web server and a logging service
  • Use 2 containers defined in containers section of spec
...
spec:
  containers:
  - name: sample-app
    image: sample-app:1.1
  - name: logger
    image: log-agent:1.5
  • 3 multi container pods design patterns [discussed in CKAD course]
    • sidecar: For example using logging service with app container
    • adapter
    • ambassador

Init containers

  • Init containers are used for doing some task before actual container starts like some other task is done or checkout some source code from repository. This executes only once at the beginning
  • Similar to containers but defined under initContainers section in spec section - it is a list so can have multiple init containers and it executes in sequence as defined
  • If init container fails whole pod is restarted
spec:
  containers:
  - name: myapp-container
    image: busybox:1.28
    command: ['sh', '-c', 'echo The app is running! && sleep 3600']
  initContainers:
  - name: init-myservice
    image: busybox:1.28
    command: ['sh', '-c', 'until nslookup myservice; do echo waiting for myservice; sleep 2; done;']
  - name: init-mydb
    image: busybox:1.28
    command: ['sh', '-c', 'until nslookup mydb; do echo waiting for mydb; sleep 2; done;']

Cluster maintenance

OS upgrades

  • If a node is down, then k8s waits for node eviction timeout (default=5 mins) before scheduling pods on that node to other nodes. In cases when node is down and comes up immediately, pods are scheduled on the same node
  • For maintenance purpose we can drain all pods on a node to get scheduled on other nodes. Pods which are not managed by deployment or replicaSets are lost and not scheduled on other nodes (this is warned and can deleted using --force option)
k drain <node-name>
k drain <node-name> --ignore-daemonsets
  • When node back again, we can uncordan to make this node available for scheduling new pods
k uncordon <node-name>
  • There is another command cordon which makes a node un-schedulable for new pods however existing pods remain running on that node
k cordon <node-name>

Kubernetes versions

# This gives client and server versions
# client = kubectl version
# server = kubernetes version
k version
k version --short

k get nodes  # Also gives kubelet version running
  • Version = x.y.x x = Major version y = Minor version z = Patch version

Cluster upgrade process

  • kube-apiserver is the main component in k8s cluster, if this is at version X (minor version) then
    • controller-manager and kube-scheduler can at max one version lower than X
    • And kubelet and kube-proxy can be at max 2 versions lower than X
    • None of them can have higher version than X
    • However kubectl can have version between X - 1 to X + 1
  • At a given point in time, k8s community supports latest 3 versions (minor)
  • 2 steps in upgrading cluster
    • First upgrade master nodes: While master node upgrade is in process, workloads on worker nodes will continue to work but management functions won't like we can't create or delete a pod or if pod crashes it won't be rescheduled
    • Then worker nodes: We have 3 strategies for this
      • Upgrade all worker nodes at the same time - requires downtime
      • Upgrade one node at a time - kind of rolling upgrade
      • Add new nodes with upgraded version then remove existing nodes
  • Recommended approach is to upgrade one minor version at a time - not to skip the versions
kubeadm upgrade plan

# This command does not upgrade kubelet, we has to upgrade kubelet by going(ssh) on each nodes and upgrading
kubeadm upgrade apply <version>

Backups and restore methods

We need to backup below componets in a cluster

  • Resource configs: Take backup of all resources deployed (either using imperative or declarative way).
    • We can use below command to get all resources deployed
    k get all --all-namespaces -o yaml > all-resources.yaml
    • There are other solutions already build for this to take backup of all resources like velero by heptIO
  • ETCD cluster: Stores state of the cluster, this also runs as a static pod on master nodes
    • Taking backup of ETCD also gives all resource information
    • We can take backup of ETCD using etcdctl command
    # trusted-ca-file, cert-file and key-file can be obtained from the description of the etcd Pod
    ETCDCTL_API=3 etcdctl --endpoints=https://<IP>:2379 \
      --cacert=<trusted-ca-file> --cert=<cert-file> --key=<key-file> \
      snapshot save <backup-file-location>
    
    # To restore this snapshot
    # 1 - we can first stop kube api server
    service kube-apiserver stop
    
    # 2 -  then restore from backup
    ETCDCTL_API=3 etcdctl restore <snapshot-name>.db --data-dir=/var/lib/etcd-from-backup
    
    # 3 - update etcd with new path, etcd is static pod so update manifests file (by default here - /etc/kubernetes/manifests/etcd.yaml)
    
    # 4 - reload service daemon and restart etcd service
    systemctl daemon-reload
    service etcd restart
    
    # 5 - start kube apiserver
    service kube-apiserver start
  • Volumes

Security

All communication between various k8s components are TLS based

Authentication

  • kube-apiserver serves all requests to the cluster so this is responsible to authenticating the requests. User can send request using kubectl command or curl
  • Authentication can be done below methods, it is configured while starting kube-apiserver
  • Basic authentication
    • Static password file: Using username and password from csv file. While requesting using curl we can use -u option to specify username/password
    • Static token file: Instead of using password, keeping token in file. This token can be sent in header of HTTP request
    • Note: Both above method are not recommended as they are not secure, so we use certificate based authentication

TLS

  • Symmetric encryption: Same key is used for encryption and decryption. Problem is sharing that key b/w client and server securely
  • Asymmetric encryption: Uses 2 keys
    • Public key: For encryption
    • Private key: For decryption
  • ssh also uses asymmetric encryption - ssh-keygen generates public and private keys. Private key is used to login to the server and public key is used to lock the access to severs
HTTPS flow
  • Key exchange - PKI (Public key infrastructure)
    • Server shares public key (certificate) to client
    • Client generates encryption key and sends back to server - this key is encrypted using public key which can only be decrypted using server using private key
    • Now both client and server has exchanged encryption key securely which can be used to encrypt further messages
  • Domain authorization
    • With public key (from server to client), a digital certificate is also sent which is signed/approved/authorized by a certificate authority (CA) to confirm that domain is actually what is says - like xyz.com is actually xyz.com and not someone else with fraud identity. Some popular CA are symantec, digicert, globalsign, ...
    • Domain owner has to generate a certificate signing request (CSR) and sent to CA, then CA verifies all details and sends back signed certificate
    • How CAs are validated? Each CA also have a pair of public and private key (this is called root certificates) and they sign the certificates using private key and there public keys are stored in each clinet like browsers so from there client verifies certificate is signed by authorized by CA
    • For interval usage, we can host our own CA also and sign certificates
  • Naming conventions
    • Public key(certificate): *.crt, *.pem
    • Private key: *.key, *-key.pem
  • Note: Private key can also be used to encrypt data and can be decrypted using public key, this is never done because anyone having public key will be able to decrypt it
  • Everything mentioned above is to verify if we are communicating to right server or not using it's certificate, there can be cases when server also needs to verify if it is communicating to correct client and can ask for client certificate from client

TLS in k8s

Bases on interaction, we can have server and client components in k8s. Each component will have it's own certificate

  • Server
    • kube-apiserver: apiserver.crt, apiserver.key
    • etcd: etcdserver.crt, etcdserver.key
    • kubelet: kubelet.crt, kubelet.key
  • Client: All below clients talks to kube-apiserver
    • User(admin): admin.crt, admin.key
    • kube-scheduler: scheduler.crt, scheduler.key
    • kube-controller-manager: controller-manager.crt, controller-manager.key
    • kube-proxy: kube-proxy.crt, kube-proxy.key
  • We also need at least one CA to generate certificates for all above components which also has certificates - ca.crt, ca.key

TLS in k8s - certificate creation

  • Generate CA self signed certificates - root certificates
# 1. Generate keys
openssl genrsa -out ca.key 2048

# 2. Certificate signing request
openssl req -new -key ca.key -subj "/CN=KUBERNETES-CA" -out ca.csr

# 3. Sign certificates
openssl x509 -req -in ca.csr -signkey ca.key -out ca.crt

# Now for all other certificates, we will use this key pair to sign them
  • Generate certificates for other components and sign using above CA - like admin user certificate
# 1. Generate keys
openssl genrsa -out admin.key 2048

# 2. Certificate signing request
openssl req -new -key admin.key -subj "/CN=kube-admin" -out admin.csr
openssl req -new -key admin.key -subj "/CN=kube-admin/O=system:masters" -out admin.csr  # Admin user

# 3. Sign certificates - using CA key pair
openssl x509 -req -in ca.csr -CA ca.crt -CAkey ca.key -out admin.crt
  • Now we have admin user certificate, we can use this in 3 ways
    • curl command
    curl https://kube-apiserver:6443/api/v1/pods \
      --key admin.key --cert admin.crt
      --cacert ca.crt
    • kubectl command
    kubectl get pods \
      --server kube-apiserver:6443 \
      --client-key admin.key
      --client-certificate admin.crt
      --certificate-authority ca.crt
    • Specifying certificates in each command is not very handy so add these information to kubeconfig file, then specify this file with command
    kubectl get pods --kubeconfig ~/.kube/<config-name>
    
    # By default kubeconfig file used is ~/.kube/config
  • Note: Each component should have root certificate file (ca.crt) present with them

View certificate details

  • We should know how cluster is setup, like if cluster is setup using kubeadm then all certificates are placed at /etc/kubernetes/pki/
  • If we want to know the details from a components certificate, we can use below command - will print details like expiry, issuer, alternate names, ...
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -text -noout

# Decode CSR file
openssl req -in filename.csr -noout -text

Certificates API

  • kubeadm tool creates a pair of CA keys (public and private keys) and places them on master node so master is becomes our CA server. All new CSR will go to master to getting signed
  • When a new user wants to access the cluster he can create a CSR and send to admin, admin will then creates a CSR object using yaml manifests file - kind: CertificateSigningRequest
...
kind: CertificateSigningRequest
...
spec:
  ...
  request:
    <base64 encoded CSR>
  • Now admin can use kubectl commands to view/approve CSRs
k get csr                     # Get list of all CSRs
k certificate approve <name>  # Approve CSR
k get csr <name> -o yaml      # Gives user certificate in base64 format
  • On master node all certificate related operations is taken care by controller-manager - has csr-approving and csr-signing controllers.
  • To sign the CSR, controller manager should have root certificates (CA key pairs) - while starting controller manager is accepts root certificates in --cluster-signing-cert-file and --cluster-sigining-key-file

Kubeconfig

  • kubeconfig file has 3 sections
    • Clusters: List of cluster (dev, prod) with CA root certificates - ca.crt
    • Users: List of users (admin, readonly) with certificate key pairs (crt and key)
    • Contexts: Combination of above 2 - List of cluster and users like which cluster to user with which user - readonly@prod, admin@dev, ... At the top level of config file, we also have a default context to use if we don't explicitly chose one
k config view                       # See current kubeconfig file
k config use-context prod@readonly  # Change current context. This command updates the `current-context` field in kubeconfig file

# Use some other kubeconfig (default is ~/.kube/config)
export KUBECONFIG=/path/my-kube-config

# Set default context of given kubeconfig to context-1
k config --kubeconfig=/path/my-custom-config use-context context-1
  • We can also set the namespace in context section of kubeconfig file to point a specific namespace, by default it is pointed to default ns
# Set context to dev, from next commands we don't have specify ns name in commands
k config set-context --current --namespace=dev
  • To debug problems with kubeconfig file, we can use cluster-info command
# Use current kubeconfig
k cluster-info

# Use custom kubeconfig
k cluster-info --kubeconfig=/path/to/kubeconfig

API groups

  • Objects in k8s are categorised in different API groups
    • /metrics: Getting metrics
    • /healthz: Get health information
    • /version: Get cluster version
    • /api: Interact with various core resources like pods, configMaps, namespace, etc.
    • /apis: Named APIs, further categorized into below API groups
      • /apps: /v1/deployments, /v1/replicasets, /v1/statefulsets
      • /extensions
      • /networking.k8s.io: /v1/networkpolicies
      • /storage.k8s.io
      • /authentication.k8s.io
      • /certificates.k8s.io: /v1/certificatesigningrequests
    • /logs: For fetching logs
  • Verbs are operation of API groups like get, list, update, ...
  • To list all API groups we can do a curl on cluster domain name
curl http://<api-server>:6443

# Above command will fail we haven't specified the certificates so we can use `kubectl` to start a proxy client which will take certs from `kubeconfig` and run on localhost
kubectl proxy
Starting to serve on 127.0.0.1:8001

# Now we can access cluster using curl command via this proxy - will use credentials from kubeconfig and forward request to api server
curl http://localhost:8001  # List all API groups
curl http://localhost:8001/version
curl http://localhost:8001/api/v1/pods

Authorization

  • Once user/machine gains access to cluster what all things it can do is defined by authorization
  • Authorization mechanisms
    • Node: Used by agents inside cluster like kubelet, these requests are authorized by Node authorizer. In certificates if name has system like system:node then these are system components and authorized using node authorizer
    • ABAC: Attribute based access control, for external access
      • This associates user(s) to a set of permissions
      • We can create these policy using kind: Policy
      • Managment is harder because we has to update policy for each user when required to update permissions
    • RBAC: Role based
      • Instead of user(s) <> permission mapping, we create a role like developer, security-team and role has set of permissions then associate user to role
    • Webhook: Outsource authorization to other tools like open policy agent
  • We can provide authorization-mode in kube-apiserver (by default it is always-allow), it can have multiple values like Node,RBAC,Webhook - For access, check is made against all values specified till access if granted to chain ends

RBAC

  • To create a role, we create Role object. In rule section, we can add various access permissions. This has scope of namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: developer
  namespace: testing
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["list", "get", "create", "update", "delete"]
- apiGroups: [""]
  resources: ["ConfigMaps"]
  verbs: ["create"]
  • Link user(s) to role - using RoleBinding object
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: devuser-developer-binding
subjects:
- kind: User
  name: dev-user
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: developer
  apiGroup: rbac.authorization.k8s.io
  • We can also check, we user(self, other) has access to perform some operation
k auth can-i create deployments
k auth can-i delete nodes
k auth can-i create pods --as dev-user

# Can dev-user has permission to create pod in test namespace
k auth can-i create pods --as dev-user --namespace test
  • We can also give access to specific resources, using resourceName field in rules
...
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "create", "delete"]
  resourceName: ["blue", "green"]
  • Imperative ways
k create role pod-reader --verb=get --verb=list --verb=watch --resource=pods
k create rolebinding pod-reader-binding --clusterrole=pod-reader --user=bob --namespace=acme

Cluster role and role bindings

  • Resources in k8s can be namespaced(pods, rs, cm, roles) or cluster scoped(nodes, clusterroles) - can get whole list using
k api-resources --namespaced=true   # Get all namespaced resources
k api-resources --namespaced=false
  • clusterrole and clusterrolebindings has cluster scope (remember role had ns scope) - this role created has cluster level access
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cluster-administrator
rules:
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["list", "get", "create", "delete"]
  • Link user(s) to cluster role - using ClusterRoleBinding object
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cluster-admin-role-binding
subjects:
- kind: User
  name: cluster-user
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: cluster-administrator
  apiGroup: rbac.authorization.k8s.io
  • Imperative ways
k create clusterrole pod-reader --verb=get,list,watch --resource=pods
k create clusterrolebinding pod-reader-binding --clusterrole=pod-reader --user=root
  • Although clusterRole is cluster scoped but we can create it for namespaced resources also - will access on all namespace for that object. For example if we create clusterRole for pods, then this role will have to pods across all namespaces

Service accounts

  • 2 types of accounts
    • User: used by humans like Admin, developer
    • Service: used by machined like build tools, prometheus
    k create serviceaccounts dashboard-sa
    k get serviceaccounts
    k describe serviceaccounts dashboard-sa
  • Imperative way
# Grant read-only permission within "my-namespace" to the "my-sa" service account
k create rolebinding my-sa-view \
  --clusterrole=view \
  --serviceaccount=my-namespace:my-sa \
  --namespace=my-namespace
  • Service account has a token which is used by any third party service to access cluster(kube-apiserver using curl command), this token is kept as secret. This sa can now be associated with a role using RBAC for specific access
  • If third party service is running in cluster itself like as a pod then we can mount this secret as volume and then pod can access it directly - use serviceAccountName field in spec section
  • A default service account is also created in each ns and moounted with each pod if don't specify any other
  • automountServiceAccountToken: false - don't mount service account token with pod

Image security

  • When we specify image in pod definition file, it follows docker naming convention - image: nginx actually becomes image: docker.io/library/nginx where
    • docker.io is the default registry to look for image
    • library is default user/account
    • nginx is the repository name for image
  • gcr.io is another public registry where all k8s related images stored, for end to end testing gcr.io/kubernetes-e2e-test-image/dnsutils
  • Public cloud providers also has container registry service like ECR is by AWS
  • Private repository: Store images which are not public, requires some credentials to access - using docker login
docker login private-registry.io
docker run private-registry.io/apps/internal-app
  • For using private registry in pod definition file, we has to create secret of type docker-registry and specify name in pod definition
k create secret docker-registry regcred \
    --docker-server=private-registry.io \
    --docker-username=registry-user \
    --docker-password=registry-password \
    --docker-email=registry-user@org.com
...
kind: Pod
spec:
  containers:
      image: private-registry.io/apps/internal-app
  imagePullSecrets:
  - name: regcred
...

Security context

  • Security context can be set at the pod and/or container level
    • Pod level: Applies to all containers defined in this pod definition
    ...
    spec:
      securityContext:
        runAsUser: 1000  # Default is root, skip this if want to run as root
      containers:
        ...
    • Container level: Applies to specfic container. Note if applied at both pod and container level, container level is applicable
    ...
    spec:
      containers:
        securityContext:
          runAsUser: 1000
          capabilities:
            add: ["MAC_ADMIN"]
        ...
  • We can also set container capabilities which can only be at container level (as in above example)

Network policy

  • 2 types of traffic - Ingress and Egress
  • Ingress: Traffic coming into the server/network
  • Egress: Going out of server/network
  • Replying back to client does not matter - doesn't require egress configuration, this is enabled by default
  • Ingress or egress is always looked from that specific server perspective - like for DB we only require ingress traffic
  • K8s is by default configured with "All allow" means, any pod can communicate with any other pod/service within the cluster - using pod IP, name, etc.
  • To restrict traffic we apply network policies to pod, this is done using selectors using labels and using it in NetworkPolicy object. Below example shows to apply network policy on db so that only api-pods can connect to db on port 3306 - this will restrict others like web server pods from accessing db
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: db-policy
spec:
  podSelector:
    matchLabels:
      role: db
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          name: api-pod
    ports:
    - protocol: TCP
      port: 3306
  • Network policies are enforced by networking solutions implemented on k8s cluster and not all networking solutions support network policies. Solutions that support are - kube-router, calico, romana, weave-net
  • We can further filter down on whom to allow with namespace filter
...
  ingress:
  - from:
    - podSelector:
        matchLabels:
          name: api-pod
      namespaceSelector:
        matchLabels:
          name: prod
...
  • For situations like allowing backup server which is not deployed as cluster in pod we can allow specific IP address also
...
  ingress:
  - from:
    - podSelector:
        matchLabels:
          name: api-pod
    - ipBlock:
        cidr: 192.168.5.10/32
...
  • For configuring egress - from in ingress becomes to and rest remains same

Storage

Storage in docker

Container storage interface

  • Intially k8s only used to work with docker runtime and it's code was also embedded into k8s but with other container runtimes coming in (like rkt, crio), docker was moved out of k8s and container runtime interface which developed
  • CRI governs the interface that when a new runtime is developed, how it will communicate with k8s so that k8s don't have to change to support it
  • Similar to CRI, container networking interface(CNI) and container storage interface(CSI) is developed. CSI is a standard followed by storage drivers to work with any orchestration tool, some of storage drivers are portworx, Amazon EBS, etc.
  • CSI defines set of RPCs (like createVolume, deleteVolume) which will be called orchestrator and must be implemented by these storage drivers

Volumes

  • Like in containers, pod data is also gets deleted when a pod is deleted so to persist the data, we use volumes and mounts
  • We can attach volume to pod using volumeMounts to refer one of volumes created
apiVersion: v1
kind: Pod
metadata:
  name: random-num
spec:
  containers:
  - image: alpine
    name: alpine
    command: ["/bin/sh", "-c"]
    args: ["shuf -i 0-100 -n 1 >> /opt/number.out;]
    volumeMounts:
    - mountPath: /opt
      name: data-volume
  volumes:
  - name: data-volume
    hostpath:
      path: /data
      type: Directory
  • Now pod /opt maps to host /data directory whatever pod writes on path /opt will be present on host /data directory even if pod dies
  • This approach is not recommended if we have multi node cluster because directory will be specific to node so we use external storage solutions like NFS, AWS EBS, etc and specific option instead of hostpath, for example for AWS EBS, we use awsElasticBlockStore
...
volumes:
- name: data-volume
  awsElasticBlockStore:
    volumeID: <volume-id>
    fsType: ext4

Persistent volumes(PV)

  • In above section, we saw how volumes can be created the problem is it created with each pod definition. If we have lot of pods, it is hard to add/manage volumes with each pod so we create a PersistentVolume and use it with pods using PersistentVolumeClaim to claim the volumes persistent
apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-vol1
spec:
  accessModes:
    - ReadWriteOnce
  capacity:
    storage: 1Gi
  awsElasticBlockStore:
    volumeID: <volume-id>
    fsType: ext4

Persistent volume claims(PVC)

  • Admin creates PV and user creates PVC to use the storage
  • When PVC is created is gets maps to one of the PV which matches the PVC claim criteria. If user want to bind to specific PV - can provide additional filters also
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: myclaim
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50Mi
  • We can now use PVC with pod (or replicasets, deployments) definition
apiVersion: v1
kind: Pod
metadata:
  name: frontend
spec:
  containers:
    - name: myfrontend
      image: nginx
      volumeMounts:
      - mountPath: "/var/www/html"
        name: mypd
  volumes:
    - name: mypd
      persistentVolumeClaim:
        claimName: myclaim
  • We cannot delete PVC if it used by any pod - if we try it will be in terminating state until pod is deleted

Storage classes

  • Before creating PV, we must create volume in provider we are using like with AWS, we must provision EBS first before PV - this is called static provisioning
  • To solve above dependency we use storageClasses which takes the provider name and creates PV automatically for us like on AWS or GCP - this is dynamic provisioning
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: google-storage
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-ssd
  replication-type: regional-pd
  • Then in PVC, we can refer this storage using storageClassName: google-storage in spec section and rest PVC definition remains same

Networking

Switching routing

  • To connect 2 hosts, we need to connect boths hosts using swtich using host's interface. Using ip link command we can check interface(s) on host
  • Router connects 2 switches(networks) and creates. Router IP is the first one in the network
  • We can have several routers, so hosts should know to send a packet to a host in other network which router to use - for this we use gateways (if host is a room then gateway is the door). To configure a gateway on a host, we can use below command
# To reach any IP in network 192.168.2.0/24 use gateway(router) address 192.168.1.1
# Route should be added on all hosts to send packets to hosts on other n/w
ip route add 192.168.2.0/24 via 192.168.1.1

# We can add default route for all other IPs/NW which we don't know
# Any IP for which explicit route is not added, use 192.168.1.1
ip route add default via 192.168.1.1

# See routes added on host
ip route show
route
  • Using host as router: Linux by default doesn't forward packets received on one interface to other, this is disabled for security reasons. We can enable it using
echo 1 > /proc/sys/net/ipv4/ip_forward
  • Above setting is not retained across reboots, we can set net.ipv4.ip_forward=1 in /etc/sysctl.conf file

DNS

  • We can add custom IP to hostname mapping /etc/hosts file. This translation of hostname to IP is known as Name resolution
  • Managing host/IP mapping like above is hard when number of hosts increases (and IPs of host can also change) so we use DNS server for this and configure host to point to this DNS server for host to IP lookup
  • IP of DNS server can be added in /etc/resolv.conf file with field nameserver so when host doesn't know IP of a host it goes to this DNS server to get IP of a host
  • If entry for same hostname is present in /etc/hosts and nameserver(DNS sever) both then host first checks local /etc/hosts file if not found then goes to DNS server configured. This ordering can also be changed using /etc/nsswitch file
  • For public internet hosts like (google.com, fb.com, ...) we can configure global DNS servers like 8.8.8.8 (by google) to check for host IPs this can be added to /etc/hosts file or configure our local DNS server to check at 8.8.8.8 if not found
  • We can add another entry called search in /etc/hosts file which appends domain name with host we want to search like
...
search mycompany.com
...

# If we ping `gitlab`, it will change the domain name to `gitlab.mycompany.com` automatically if it exists
# We can have list in search to have multiple items
  • Record types
    • A: Maps IP to hostnames
    • AAAA: Maps IPv6 to hostnames
    • CNAME: Maps one name to another name (like fb.com is same as facebook.com)
  • Tools
    • ping: Simple, gives IP in ping traces
    • nslookup: Resolves using DNS server, it doesn't take into account local /etc/hosts mappings
    • dig: More detailed
  • Using hosts as DNS: We have various tools for this coreDNS is one of those. This runs on port 53, which is the default port of DNS server

Docker networking

Refer this section: https://gist.github.com/hansrajdas/d950ffd99c3ae817b08fd11592dc82eb#docker-networking

Cluster networking

  • In a k8s cluster we can have multiple nodes - master and workers with unique IPs and mac addresses, below are some ports required to be open for each component in a cluster
  • ETCD(on master node): Port 2379, all control plane components connect to
  • ETCD(on master node): Port 2380, is only for etcd peer-to-peer connectivity
  • kube-api(on master node): Port 6443
  • kubelet(on master and worker node): Port 10250
  • kube-scheduler(on master and worker node): Port 10250
  • kube-controller-manager(on master node): Port 10252
  • services(on master node): Port 30000-32767
  • NOTE: If things are not working, all ports are one the first things to verify

Pod networking and CNI

  • K8s don't have any networking solution but it requires that each IP gets an unique IP address and every pod is reachable from every other pod in a cluster (with multi node also) without having to configure any NAT rules. In smaller nodes with couple of nodes, we can configure networking/routing using scripts but for large cluster it becomes hard to manage so in for those we use networking solutions(plugins) available that does this like weaveworks, flannel, cilium, vmware nsx
  • We can specify the CNI/network-plugin options in kubelet component using below args
...
    --network-plugin=cni \
    --cni-bin-dir=/opt/cni/bin \
    --cni-conf-dir=/etc/cni/net.d \
...

CNI weaveworks

  • weavework agent runs on each node and communicate with each other regarding the nodes, networks and pod. Each agent stores the topology of the entire setup and know pods and there IPs on other nodes
  • weave creates its own bridge on each node and names it weave and assigns IP address to each n/w
  • Deployed as daemon set to run on each node

IP address management - weave

  • CNI plugin (like weave) assigns IPs to pods. In CNI config file /etc/cni/net.d/net-script.conf we specify IPAM configuration, subnets, routes, etc.
  • Weave creates interface on each host with name weave, use ifconfig command to check
  • Weave default subnet is 10.32.0.0/12 which is 10.32.0.1 to 10.47.255.254, around 1,048,574 IPs for pods

Service networking

  • For services refer this section. This section discusses about service networking
  • kube-proxy runs on each node which listens for changes from kubeapi server and everytime a new service is to created kube-proxy gets into action and assigns IP to the service. Unlike pod service spans across cluster
  • kube-proxy creates routing rules corresponding to each service created, in this routing rule port is also present like if packet comes on IP:PORT forward it to POD-IP. This routing can be set using 3 ways - userspace, iptables(default), ipvs, this can be configured by setting --proxy-mode in kube-proxy config
  • Service IP range is configured in kube-api-server
kube-api-server --service-cluster-ip-range ipNet  # Default 10.0.0.0/24

# We can see the rules from NAT tables using iptables
iptables -L -t nat | grep <service-name>

# Check kube-proxy logs for routing created and mode/proxier used
cat /var/log/kube-proxy.log

DNS in kubernetes

  • k8s deploys a built in DNS server by default when we setup is a cluster
  • All pods and services are reachable using IP address within the cluster
  • For each service k8s creates a DNS record by default which maps service name to service IP. Within same namespace, we can access the service using service names. From other namespace, we has to specify namespace also
  • All service names are sub domain under domain namespace name
  • All namespaces are sub domain under service svc
  • All svc are sub domain under root domain called cluster.local by default
service name: web-service
namespace: apps

# Within same namespace
curl http://web-service

# From other namespaces, we can use any
curl http://web-service.apps
curl http://web-service.apps.svc
curl http://web-service.apps.svc.cluster.local  # FQDN
  • DNS records for pods are not created by defualt but we can enable that, once enabled it's entry is made with dots replaced in IP with - to IP and not pod name to IP. If pod IP is 1.2.3.4 then entry would be 1-2-3-4 maps to 1.2.3.4
curl http://1-2-3-4.apps.pod.cluster.local

CoreDNS in kubernetes

  • Initial k8s DNS component was kube-dns but after v1.12 k8s recommended to use coreDNS
  • coreDNS is deployed as a replicaSet in cluster and takes a config using configMap, coreDNS config on host is placed at /etc/coredns/Corefile. coreDNS watches for any new service or pod (if enabled in coreDNS config file) created and adds an entry in its database
  • To access coreDNS, a service is also created with name kube-dns. Pods are configured (by kubelet) to have kube-dns IP in nameserver field in file /etc/resolv.conf. This file also has search fields to make FQDN from only service-name or service.namespace

Ingress

  • K8s object which acts as application load balancer(Layer 7) - directs request to different services based on URL path
  • This becomes single where SSL can be implemented - independent of all services
  • Ingress deployment - we need two things
    • Ingress controller: This is one of the third party solution like nginx, HA proxy, etc. K8s doesn't come with any default ingress controller so has to install one. We will use nginx as an example and see what all objects are required to deploy nginx igress controller
      • Deployment: Image used will modified version of nginx: quay.io/kubernetes-ingress-controller/nginx_ingress_controller
      • Service: Of type NodePort with selector of above ingress controller
      • ConfigMap: To store nginx config data
      • ServiceAccount: To access all objects - role, clusterBinding, roleBinding
    • Ingress resources: Configuration rules on ingress controller to route traffic to specific service based on URL like p1.domain.com should go to p1 service, p2... to p2 or domain.com/p1 to p1, and so on. This rsource is created using below definition file
    apiVersion: extensions/v1beta1
    kind: Ingress
    metadata:
      name: ingress-wear
    spec:
      backend:
        serviceName: wear-service  # Route all traffic to wear service
        servicePort: 80
    • We can define rules (with paths) in ingress resources to map traffic from different URLs to specific service
    ...
    spec:
      rules:
      - http:
          paths:
          - path: /wear
            backend:
              serviceName: wear-service
              servicePort: 80
          - path: /watch
            backend:
              serviceName: watch-service
              servicePort: 80
    • We can define rules (with host) in ingress resources to map traffic from different subdomains to specific service
    ...
    spec:
      rules:
      - host: wear.my-online-store.com
        http:
          paths:
          - backend:
              serviceName: wear-service
              servicePort: 80
      - host: watch.my-online-store.com
        http:
          paths:
          - backend:
              serviceName: watch-service
              servicePort: 80
    • Imperative way of creating ingress resources
    kubectl create ingress <ingress-name> --rule="host/path=service:port"
    
    # Example
    kubectl create ingress ingress-test --rule="wear.my-online-store.com/wear*=wear-service:80"

Designing a cluster

  • Designing a cluster would depend on what is the purpose of it, based on the purpose we can have design in different ways
    • minikube: Used to deploy single node cluster very easily. This provisions a VM and then runs k8s
    • kubeadm: Used to deploy multi node cluster. This expects VMs are already provisioned
  • There no solution available on windows to use k8s, we has to provision a linux based VM on windows to use k8s
  • For HA cluster, we use multiple master nodes, which is backed by a load balancer which directs the requests to one of the master nodes. Master nodes has below components running
    • API server: All API servers are active on all master nodes
    • Controller manager(replication & node): Only one is active others are on standby, active is elected using leader election
    • Scheduler: Only one is active others are on standby, active is elected using leader election
    • ETCD: It is distributed system so API server can reach to any of the ETCD instance running for read or write
  • ETCD runs with master node and generally on master node but for complex(and HA) clusters we can run ETCD on separate nodes and connect to master nodes
  • We can run cluster on-prem or cloud. In cloud, we have option to self manage cluster or use managed solutions like EKS(AWS), GKE(GCP), ...

ETCD in HA

  • ETCD is a distributed, reliable key value store that is simple, secure and fast
  • Client can connect to any instance of ETCD in cluster and perform read/write operation. If 2 writes come at the same time on 2 different ETCD instances then one is selected on the basis of leaders consent, write is complete when leader gets sconsent from other instanes in the instances
  • Leader is elected using raft algorithm - voting election kind of mechanism
  • Write is considered successful if quorum = N/2 + 1 has that write propogated, if cluster has instances less than quorum(majority nodes) then cluster will be down
  • It is recommended to have odd number of instances for better fault tolerance
  • For installation, we can download the latest binary from github. ETCDCTL utility can be used to access ETCD cluster

Install kubernetes the "kubeadm" way

Steps to setup cluster using kubeadm tool

  • Have multiple hosts to designate one or more as master nodes - we can also use vagrant for provision virtual machines, this vagrantfile provisions one master and 2 worker nodes
  • Install container runtime like docker on each host(master & worker)
  • Install kubeadm, kubelet and kubectl on all hosts(master & worker)
  • Initialze master nodes - setting up all master node components
  • Setup POD networking solution like calico, weave net, etc. on all nodes so all that all pods can communicate with each other
  • Join worker nodes to master nodes, command is printed on running kubeadm init to join master - run this command on each worker nodes
  • Launch applications - create pods

Debugging failures

JSON PATH

  • When dealing with cluster with large number of nodes and objects, it becomes hard to query each node/objests and check for relevant information. So we can get print only relevant result, filter and sort on specific field using jsonpath option in kubectl command
k get nodes -ojsonpath='{.items[*].metadata.name}'        # Prints only node name
k get nodes -ojsonpath='{.items[*].status.capacity.cpu}'  # Prints cpu
...

# Print node name and cpu info
k get nodes -ojsonpath='{.items[*].metadata.name}{"\n"}{.items[*].status.capacity.cpu}'

# We can format output using loops
k get nodes -ojsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.capacity.cpu}{"\n"}{end}'

# Using custom columns is another way to printing required information - as above
k get nodes -ocustom-columns=<COLUMN NAME>:<JSON PATH>

# Print node name and CPU
k get nodes -ocustom-columns=NODE:.metadata.name,CPU:.status.capacity.cpu

# We can also use sort-by option to sort according to some value (using json path)
k get nodes --sort-by=.metadata.name

# Filter based on specfic condition - get context name for user `aws-user`
kubectl config view --kubeconfig=/root/my-kube-config -ojsonpath='{.contexts[?(@.context.user=="aws-user")].name}'

Other stuff

Delete resource stuck in Terminating state

# Example to delete a namespace
kubectl get namespace "ns1" -o json | tr -d "\n" | sed "s/\"finalizers\": \[[^]]\+\]/\"finalizers\": []/" | kubectl replace --raw /api/v1/namespaces/ns1/finalize -f -

Last applied configuration

  • Last applied configuration is also kept with live yaml configuration. This helps k8s figure out if something is deleted then delete it from deployed version. Like a label is deleted from in new file applied, it will be checked if it was present in last applied config then will be deleted from deployed version.
  • Last applied configuration is only stored when we use kubectl apply command, with kubectl create/replace command this info is not stored.
  • So 3 things are compared when using kubectl apply command
    • New yaml file
    • Deployed yaml version
    • Last applied configuration

Labels, selectors and annotations

  • Labels can be applied in k8s objects which can be used as selector for filtering required objects
  • Like labels we can also have annotations which holds metadata info like buildversion, etc

Deployments vs stateful sets

  • Deployment - You specify a PersistentVolumeClaim that is shared by all pod replicas. In other words, shared volume. The backing storage obviously must have ReadWriteMany or ReadOnlyMany accessMode if you have more than one replica pod.
  • StatefulSet - You specify a volumeClaimTemplates so that each replica pod gets a unique PersistentVolumeClaim associated with it. In other words, no shared volume. Here, the backing storage can have ReadWriteOnce accessMode. StatefulSet is useful for running things in cluster e.g Hadoop cluster, MySQL cluster, where each node has its own storage.
  • Read more here: https://stackoverflow.com/questions/41583672/kubernetes-deployments-vs-statefulsets

Commands

Note: We have used pod name as nginx in all commands, this should be replaced with specific pod name. We have aliased kubectl command to k

alias k=kubectl

Create/run

# Create a pod
k run nginx --image=nginx

# Create pod with label
k run nginx --image=nginx -l tier=msg

# Create pod and expose port
kubectl run httpd --image=httpd:alpine --port=80 --expose

k create deployment httpd-frontend --image=https:2.4-alpine
k create namespace dev  # Create 'dev' namespace

# Doesn't create the object, only gives the yaml file
k run nginx --image=nginx --dry-run=client -o yaml > pod-definition.yaml
k create deployment nginx --image=nginx nginx --dry-run=client -o yaml > nginx-deployment.yaml
k create service clusterip redis --tcp=6379:6379 --dry-run=client -o yaml > service-definition.yaml

# Run a pod to debug or run some command like checking nslook from a pod for a service - we can use busybox image
# --rm will delete pod once command is completed or we exit from shell prompt
kubectl run --rm -it debug1 --image=<image>  --restart=Never -- <command>
kubectl run --rm -it debug1 --image=busybox:1.28  --restart=Never -- sh  # Attach with shell

Deploy yaml file

k apply -f filename.yaml
k create -f filename.yaml

# Deploy this in given namespace. This ns info can also be added in yaml definition itself
# to avoid giving in command always, like when creating a pod, it can be added in metadata section
k create -f filename.yaml -n my-namespace

Get

k get all                       # Get all k8s objects deployed
k get pods                      # Get list of all pods in current namespace like default
k get pods -n kube-system       # Get list of all pods in 'kube-system' namespace
k get pods --all-namespaces     # Get pods in all ns
k get pods -o wide              # Gives more info like IP, node, etc.
k get pods nginx                # Get specific pod info
k get pods --show-labels        # Get labels column also
k get pods --no-headers         # Don't print header
k get pods -selector app=App1   # Get pods having "app=App1" label
k get pods -l app=App1          # -l is same as -selector

# Pods running on a node
k get pods -A --field-selector spec.nodeName=<nodeName>

# Using jq - this general command can be used to filter any other parameter
k get pods -A --field-selector spec.nodeName=<nodeName> -o json | jq -r '.items[] | [.metadata.namespace, .metadata.name] | @tsv'

k get replicationcontrollers  # Get list of replica controllers

k get replicaset
k get deployments

k get services

k get daemonsets

k get events

Describe

k describe pod
k describe pod nginx

k describe replicaset myapp-replicaset

k describe deployments
k describe services

k describe daemonsets <name>

Edit

k edit pod nginx  # Opens this pods yaml file in editor and we can make the changes

k edit replicaset myapp-replicaset

Delete

k delete pod nginx

k delete replicaset myapp-replicaseet

Scale replicaSets

k replace -f replicaseet-definition.yml  # Update num of replicas and deploy yaml file

k scale --replicas=6 -f replicaseet-definition.yml
k scale --replicas=6 replicaset myapp-replicaset

k scale deployment -replicas=3 httpd-frontend

Others

  • Update image in a deployment (but take care, deployment file will have different image version - originally specified)
k set image deployment/myapp-deployment nginx=nginx:1.9.1
  • See all options available for a resource
k explain <kind>           # Format
k explain pod              # See top level options
k explain pod --recursive  # See all options

# See all tolerations options
k explain pod --recursive  | grep -A5 tolerations

# Get node summary like free persistent volume(pv) space, which we can't find with other commands
kubectl get --raw /api/v1/nodes/ip-10-3-9-207.us-west-2.compute.internal/proxy/stats/summary

Certification tip

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment