evilensky/kubernetes-war.md

## kubernetes-war.md

      
    Raw
  

              kubernetes-war.md
            
          
    Kubernetes Well Architected Review (WAR)

Customer: Midas Touch

Date: February 02, 2020

tags: meeting kubernetes EKS WAR


Intent:

Increase awareness of architectural best practices
Addresses foundational areas that are often neglected
Provide consistent approach to evaluating architectures
Influence future architectures

Areas:

Application checklist for Kubernetes
Cluster ready checklist for Kubernetes
Operational consideration for Kubernetes

Standard WAR pillars:

Security
Cost Optimizationm
Operational Excellence
Performance
Reliability


Application Checklist for Kubernetes


 Pod readiness checks
 Liveness checks
 Metric instrumentations (i.e. Prometheus, New Relic, Datadog, etc...)
 Dashboards - standard K8, Grafana or alternatives
 Playbooks and Runbooks
 Limits and Requests
 Labels and annotations
Pod placement

 How many pods per application?
 Taints and Tolerations
 Pod affinity / anti-affinity
 Node selectors


 Alerting
 Structured logging output (ELK stack or commercial options)
 Tracing (X-Ray, Zipkin, Lightstep, Appdash, Jaeger)
 Graceful shutdowns (i.e. how does app respond to SIGTERM)
 Graceful dependencies (Apps should not assume dependencies are available)
 Configmaps (Apps should use them for dependency injection)
 Labeled images using commit SHA (do not use "latest" image)
 Locked down runtime context (i.e. no root user)

 Consider using Pod Security Policy (PSP)
 Consider using AppArmor or SELinux security context


Cluster Ready Checklist for Kubernetes


 Build Pipeline - CI portion (Jenkins, Travis, CircleCI, CodeBuild)
 Deployment Pipeline - CD portion (GitOps using Weave Cloud and Flux)
 Image Registry (DockerHub, JFrog or ECR)

 Private Repos require credential storage


 Monitoring infrastructure by collecting and storing metrics (Prometheus or CloudWatch)
 Databases or Stateful Apps
Storage

 CSI drivers for block storage
 CSI drivers for shared file storage (EFS)
 OpenEBS
 Portworx
 Rook / Ceph


 Secrets Management (Bitnami Sealed Secrets, Hashicorp Vault, etc...)

 Bitnami Sealed Secrets
 GoDaddy External Secrets
 SOPS
 kubesec
 Jeremy's prototype
 HashiCorp Vault


 Ingress Controller (ALB, nginx, Kong, Solo gloo, Traefik, HAProxy, Ambassador, etc...)
 Service Mesh (AppMesh, Istio, Linkerd)
 Service Catalog functionality

 service catalog
 AWS Operator


 User and Pod Authorization

 IAM users or roles
 SAML Federation
 IRSA for Pods, KIAM or Kube2IAM


 Network Policies (Tigera Calico)
 Static or Dynamic Image/Runtime Scanning

 ECR (static only)
 Twistlock
 Aqua Security
 Stack Rox
 Sysdig


 Log Aggregation

 CloudWatch / Container Insights (i.e. Fluent bit or FluentD forwarder)
 Splunk
 Others


Operational Considerations for Kubernetes


Horizontal Pod Autoscaling

 Metrics Server
 Use AWS CloudWatch or external metrics


Vertical Pod Autoscaling
Cluster autoscaling or AWS native ASG

 cluster autoscaler is not AZ aware
 cluster autoscaler will dynamically move pods around and terminate instances
 cluster autoscaler is reactive
 cluster autoscaler assumes nodegroups are homogenous


How do you bootstrap a new cluster

 helm files
 GitOps
 scripts


 Utilizing Namespaces for team/developer isolation
How to create clusters using IaC

 eksctl
 CloudFormation / CDK
 Terraform
 Pulumi
 Other


How do you upgrade your node groups?

 eksctl
 AWS managed node groups
 manually


Do you have SSH access into your worker nodes

 SSH keys
 SSM agent


AMI choice

 AWS Linux 2 / EKS Optimized Linux
 Ubuntu
 Custom AMI


VPC design

 AWS CNI requires large number of IP addresses
 Overlay CIDR block possible
 6x /20 subnets recommended (3 public and 3 private) over 3x Availability Zones
 Public or Private DNS settings
 Private endpoints require additional work
 Hybrid design


Worker Nodes

 Instance sizing impacts number of pods/IPs
 Fargate for EKS


DNS (CoreDNS)

 EKS only uses 2 DNS pods by default
 Daemon set may be better
 Daemon set with local access only
 External DNS


Control Plane logging - CloudTrail (off by default)
Disaster Recovery

 Federation v2
 Velero
 GitOps


Security

Includes the ability to protect information, systems, and assets while delivering business value through risk assessments and mitigation strategies


Apply security at all layers


Enable traceability


Implement a principle of least privilege 


Focus on securing your system


AWS Shared Responsibility Model


Automate security best practices

Detective Controls
Infrastructure Protection
Data Protection
Incident Response 


IAM
Root account

 MFA
 Not used


 Key rotation
 IAM role
 Federation
 Encryption
 At rest
 In transit
Key storage

 KMS
 CloudHSM
 Other


Network / VPC

 Security Groups
 NACLS


 Pen tests
 Host based firewalls
 WAF
Monitoring and Logging

 Cloudtrail
 CloudWatch logs
 VPC flow logs
 Third Party Systems (Splunk, AppMonitoring)


Reliability

The ability of a system to recover from infrastructure or service failures,
dynamically acquire computing resources to meet demand, and mitigate disruptions such as misconfigurations or transient network issues

Test recovery procedures
Automatically recover from failure
Scale horizontally to increase aggregate system availability
Stop guessing capacity
Manage change using automation


 Limits monitoring
 HA/Failover
 Autoscaling
 Monitoring
 Change management
 GitOps or Infrastructure as Code
 Chaos Testing
 Backup and recovery
 Planning for DR
 Did you have Enterprise support?

Performance

The ability to use computing resources efficiently to meet system requirements,
and to maintain that efficiency as demand changes and technologies evolve

Democratize advanced technologies
Go global in minutes
Use serverless architectures
Experiment more often
Mechanical sympathy


 Instance selection
 Instance monitoring
 Autoscaling
 Database selection
 Load testing

Cost Optimization

The ability to avoid or eliminate unneeded cost or suboptimal resources while meeting your functional requirements

Cost-effective resources 
Matching supply with demand 
Expenditure awareness 
Optimizing over time


 Governance
 Spend monitoring
 Usage to spend monitoring
 Storage usage
 CDN
 RI's
 Spot
 Use higher level services
 SQS
 DDB
 SNS
 etc
 Cleanup/decommissioning

Operational Excellence

The ability to run and monitor systems to deliver business value and to continually improve supporting processes and procedures

Preparation
Operation
Response 


What best practices for cloud operations are you using?
How are you doing configuration management for your workload?
How are you evolving your workload while minimizing the impact of change?
How do you monitor your workload to ensure it is operating as expected?
How do you respond to unplanned operational events?
How is escalation managed when responding to unplanned operational events?