Skip to content

Instantly share code, notes, and snippets.

@cgswong
Created January 27, 2020 21:09
Show Gist options
  • Save cgswong/2d446f2435119c90d62c22a11f04ebbd to your computer and use it in GitHub Desktop.
Save cgswong/2d446f2435119c90d62c22a11f04ebbd to your computer and use it in GitHub Desktop.
Kubernetes Well Architected Review

Kubernetes Well Architected Review (WAR)

Customer: Midas Touch

Date: February 02, 2020

tags: meeting kubernetes EKS WAR

Intent:

  • Increase awareness of architectural best practices
  • Addresses foundational areas that are often neglected
  • Provide consistent approach to evaluating architectures
  • Influence future architectures

Areas:

  • Application checklist for Kubernetes
  • Cluster ready checklist for Kubernetes
  • Operational consideration for Kubernetes

Standard WAR pillars:

  • Security
  • Cost Optimizationm
  • Operational Excellence
  • Performance
  • Reliability

Application Checklist for Kubernetes

  • Pod readiness checks
  • Liveness checks
  • Metric instrumentations (i.e. Prometheus, New Relic, Datadog, etc...)
  • Dashboards - standard K8, Grafana or alternatives
  • Playbooks and Runbooks
  • Limits and Requests
  • Labels and annotations
  • Pod placement
    • How many pods per application?
    • Taints and Tolerations
    • Pod affinity / anti-affinity
    • Node selectors
  • Alerting
  • Structured logging output (ELK stack or commercial options)
  • Tracing (X-Ray, Zipkin, Lightstep, Appdash, Jaeger)
  • Graceful shutdowns (i.e. how does app respond to SIGTERM)
  • Graceful dependencies (Apps should not assume dependencies are available)
  • Configmaps (Apps should use them for dependency injection)
  • Labeled images using commit SHA (do not use "latest" image)
  • Locked down runtime context (i.e. no root user)
    • Consider using Pod Security Policy (PSP)
    • Consider using AppArmor or SELinux security context

Cluster Ready Checklist for Kubernetes

  • Build Pipeline - CI portion (Jenkins, Travis, CircleCI, CodeBuild)
  • Deployment Pipeline - CD portion (GitOps using Weave Cloud and Flux)
  • Image Registry (DockerHub, JFrog or ECR)
    • Private Repos require credential storage
  • Monitoring infrastructure by collecting and storing metrics (Prometheus or CloudWatch)
  • Databases or Stateful Apps
  • Storage
    • CSI drivers for block storage
    • CSI drivers for shared file storage (EFS)
    • OpenEBS
    • Portworx
    • Rook / Ceph
  • Secrets Management (Bitnami Sealed Secrets, Hashicorp Vault, etc...)
  • Ingress Controller (ALB, nginx, Kong, Solo gloo, Traefik, HAProxy, Ambassador, etc...)
  • Service Mesh (AppMesh, Istio, Linkerd)
  • Service Catalog functionality
    • service catalog
    • AWS Operator
  • User and Pod Authorization
    • IAM users or roles
    • SAML Federation
    • IRSA for Pods, KIAM or Kube2IAM
  • Network Policies (Tigera Calico)
  • Static or Dynamic Image/Runtime Scanning
    • ECR (static only)
    • Twistlock
    • Aqua Security
    • Stack Rox
    • Sysdig
  • Log Aggregation
    • CloudWatch / Container Insights (i.e. Fluent bit or FluentD forwarder)
    • Splunk
    • Others

Operational Considerations for Kubernetes

  • Horizontal Pod Autoscaling
    • Metrics Server
    • Use AWS CloudWatch or external metrics
  • Vertical Pod Autoscaling
  • Cluster autoscaling or AWS native ASG
    • cluster autoscaler is not AZ aware
    • cluster autoscaler will dynamically move pods around and terminate instances
    • cluster autoscaler is reactive
    • cluster autoscaler assumes nodegroups are homogenous
  • How do you bootstrap a new cluster
  • Utilizing Namespaces for team/developer isolation
  • How to create clusters using IaC
    • eksctl
    • CloudFormation / CDK
    • Terraform
    • Pulumi
    • Other
  • How do you upgrade your node groups?
    • eksctl
    • AWS managed node groups
    • manually
  • Do you have SSH access into your worker nodes
    • SSH keys
    • SSM agent
  • AMI choice
    • AWS Linux 2 / EKS Optimized Linux
    • Ubuntu
    • Custom AMI
  • VPC design
    • AWS CNI requires large number of IP addresses
    • Overlay CIDR block possible
    • 6x /20 subnets recommended (3 public and 3 private) over 3x Availability Zones
    • Public or Private DNS settings
    • Private endpoints require additional work
    • Hybrid design
  • Worker Nodes
    • Instance sizing impacts number of pods/IPs
    • Fargate for EKS
  • DNS (CoreDNS)
    • EKS only uses 2 DNS pods by default
    • Daemon set may be better
    • Daemon set with local access only
    • External DNS
  • Control Plane logging - CloudTrail (off by default)
  • Disaster Recovery

Security

Includes the ability to protect information, systems, and assets while delivering business value through risk assessments and mitigation strategies

  1. Apply security at all layers

  2. Enable traceability

  3. Implement a principle of least privilege 

  4. Focus on securing your system

  5. AWS Shared Responsibility Model

  6. Automate security best practices

    • Detective Controls
    • Infrastructure Protection
    • Data Protection
    • Incident Response 
  • IAM
  • Root account
    • MFA
    • Not used
  • Key rotation
  • IAM role
  • Federation
  • Encryption
  • At rest
  • In transit
  • Key storage
    • KMS
    • CloudHSM
    • Other
  • Network / VPC
    • Security Groups
    • NACLS
  • Pen tests
  • Host based firewalls
  • WAF
  • Monitoring and Logging
    • Cloudtrail
    • CloudWatch logs
    • VPC flow logs
    • Third Party Systems (Splunk, AppMonitoring)

Reliability

The ability of a system to recover from infrastructure or service failures, dynamically acquire computing resources to meet demand, and mitigate disruptions such as misconfigurations or transient network issues

  1. Test recovery procedures
  2. Automatically recover from failure
  3. Scale horizontally to increase aggregate system availability
  4. Stop guessing capacity
  5. Manage change using automation
  • Limits monitoring
  • HA/Failover
  • Autoscaling
  • Monitoring
  • Change management
  • GitOps or Infrastructure as Code
  • Chaos Testing
  • Backup and recovery
  • Planning for DR
  • Did you have Enterprise support?

Performance

The ability to use computing resources efficiently to meet system requirements, and to maintain that efficiency as demand changes and technologies evolve

  1. Democratize advanced technologies
  2. Go global in minutes
  3. Use serverless architectures
  4. Experiment more often
  5. Mechanical sympathy
  • Instance selection
  • Instance monitoring
  • Autoscaling
  • Database selection
  • Load testing

Cost Optimization

The ability to avoid or eliminate unneeded cost or suboptimal resources while meeting your functional requirements

  1. Cost-effective resources 
  2. Matching supply with demand 
  3. Expenditure awareness 
  4. Optimizing over time
  • Governance
  • Spend monitoring
  • Usage to spend monitoring
  • Storage usage
  • CDN
  • RI's
  • Spot
  • Use higher level services
  • SQS
  • DDB
  • SNS
  • etc
  • Cleanup/decommissioning

Operational Excellence

The ability to run and monitor systems to deliver business value and to continually improve supporting processes and procedures

  1. Preparation
  2. Operation
  3. Response 
  • What best practices for cloud operations are you using?
  • How are you doing configuration management for your workload?
  • How are you evolving your workload while minimizing the impact of change?
  • How do you monitor your workload to ensure it is operating as expected?
  • How do you respond to unplanned operational events?
  • How is escalation managed when responding to unplanned operational events?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment