Skip to content

Instantly share code, notes, and snippets.

@HarshadRanganathan
Last active July 25, 2022 21:05
Show Gist options
  • Save HarshadRanganathan/1d2d934593739851c4e27a2cd5060c7c to your computer and use it in GitHub Desktop.
Save HarshadRanganathan/1d2d934593739851c4e27a2cd5060c7c to your computer and use it in GitHub Desktop.
EKS Best Practices
  • Access Control
    • Create the cluster with a dedicated IAM role (automatically granted system:masters permissions and cannot be removed)
    • Use IAM Roles when multiple users need identical access to the cluster
    • Employ least privileged access
    • IRSA (IAM Roles for Service Accounts)
      • Update the aws-node daemonset to use IRSA
    • Restrict Access to IMDS v1
    • Use dedicated service accounts for each application
  • Use PAC (Policy As Code) or PSS (Pod Security Standards)
  • Mitigate the risks from hostPath, configure the spec.containers.volumeMounts as readOnly
  • securityContext
    • Set allowPrivilegeEscalation to false
    • Set readOnlyRootFilesystem to true
    • Run the application as a non-root user (e.g. spec.securityContext.runAsUser)
  • seccompProfile.type: RuntimeDefault
  • Audit
    • user.extra.sessionName.0 records actual user assuming the role in CloudTrail
  • Set requests and limits
    • It is strongly recommended that container resource usage (a.k.a. Resource Footprints) be data-driven and accurate, based on load testing
    • resources.limits.memory could be padded 20-30% higher than observable maximums, to account for potential memory resource limit inaccuracies.
    • Set resource quota on a namespace by creating limit range
    • For critical applications, consider defining requests=limits for the container in the Pod.
    • Do not specify resource limits on CPU. In the absence of limits, the request acts as a weight on how much relative CPU time containers get.
    • For non-CPU resources, configuring requests=limits provides the most predictable behavior.
    • For non-CPU resources, do not specify a limit that is much larger than the request
  • Multi-Tenancy
    • Soft Multi-Tenancy
      • Quotas with limit ranges
      • Node affinity
      • Taints & Tolerations
      • Mutating Admission Webhook (CI/CD)
    • Hard Multi-Tenancy
  • Network Policies
    • Create a default deny policy
  • Service Mesh
    • Encryption In Transit (mTLS)
  • Storage
    • EFS
      • Access Points
      • Alternatively, you can use EFS can simplify cluster autoscaling when running applications that need persistent storage
    • EBS
      • Create Auto Scaling Group for each AZ with enough capacity to ensure that the cluster always has capacity to schedule pods in the same AZ as the EBS volumes they need.
  • Use volume mounts instead of environment variables (volumes are instantiated as tmpfs volumes (a RAM backed file system)
  • Use SSM Session Manager for Worker Node Access
    • Managed Node Groups
      • Use LaunchTemplates or SSM Agent Daemonset
  • Images
    • Create minimal images
      • Remove all binaries with the SETUID and SETGID bits as they can be used to escalate privilege (find / -perm /6000 -type f -exec ls -ld {} ;)
      • Remove special permissions from these files (RUN find / -xdev -perm /6000 -type f -exec chmod a-s {} ; || true)
    • Use multi-stage builds
    • Create a set of base images from which developers to create their own Dockerfiles
    • Add the USER directive to your Dockerfiles to run as a non-root user
    • Build images from Scratch
    • Update the packages in your container image
  • Autoscaling
    • Karpenter
      • Exclude instance types that do not fit your workload
      • Install the AWS Node Termination Handler when using Spot
      • Avoid overly constraining the Instance Types that Karpenter can provision, especially when utilizing Spot
    • Cluster Autoscaler
      • Set the --skip-nodes-with-local-storage flag to false to allow Cluster Autoscaler to scale-down these nodes
      • You should enable the --balance-similar-node-groups feature in Cluster Autoscaler
  • Avoid running singleton Pods
  • Run multiple replicas
  • Schedule replicas across nodes
    • Topology Spread Constraints for Pods
  • HPA
  • VPA
  • Consider adjusting max unavailable to ensure that a rollout doesn’t disrupt your customers.
  • Probes
    • If you choose an exec-based probe, which runs a shell script inside a container, ensure that the shell command exits before the timeoutSeconds value expires
    • Use Liveness Probe to remove unhealthy pod
    • Use Startup Probe for applications that take longer to start
    • Use Readiness Probe to detect partial unavailability
  • Monitoring
    • Monitor IP Address Inventory - CNI Metrics Helper
  • enable external SNAT for private IP address communication (e.g. VPC Peering)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment