Task | Description | Example Tools |
---|---|---|
Install | Install the software binaries and all dependencies. | Bash, Chef, Ansible, Puppet |
Configure | Configure the software at runtime: e.g. configure port settings, file paths, users, leaders, followers, replication, etc. | Bash, Chef, Ansible, Puppet |
Provision | Provision the infrastructure: e.g. EC2 instances, load balancers, network topology, security groups, IAM permissions, etc. | Terraform, CloudFormation |
Deploy | Deploy the service on top of the infrastructure. Roll out updates with no downtime: e.g. blue-green, rolling, canary deployments. | Scripts, Orchestration tools (ECS, K8S, Nomad) |
Security | Encryption in transit (TLS) and on disk, authentication, authorization, secrets management, server hardening | ACM, EBS Volumes, Cognito, Vault, CiS |
Monitoring | Availability metrics, business metrics, app metrics, server metrics, events, observability, tracing, alerting. | CloudWatch, DataDog, New Relic, Honeycomb |
Logs | Rotate logs on disk. Aggregate log data to a central location | CloudWatch Logs, ELK, Sumo Logic, Papertrail |
Backup and restore | Make backups of DBs, caches, and other data on a scheduled basis. Replicate to separate region/account. | RDS, ElastiCache, ec2-snapper, Lambda |
Networking | VPCs, subnets, static and dynamic IPs, service discovery, service mesh, firewalls, DNS, SSH access, VPN access. | EIPs, ENIs, VPCs, NACLs, SGs, Route 53, OpenVPN |
High availability | Withstand outages of individual processes, EC2 Instances, services, Availability Zones, and regions. | Multi AZ, multi-region, replication, ASGs, ELBs |
Scalability | Scale up and down in response to load. Scale horizontally (more servers) and/or vertically (bigger servers). | ASGs, replication, sharding, caching, divide and conquer |
Performance | Optimize CPU, memory, disk, network, GPU and usage. Query tuning. Benchmarking, load testing, profiling. | Dynatrace, valgrind, VisualVM, ab, Jmeter |
Cost optimization | Pick proper instance types, use spot and reserved instances, use auto scaling, nuke unused resources | ASGs, spot instances, reserved instances |
Documentation | Document your code, architecture and practices. Create playbooks to respond to incidents. | READMEs, wikis, Slack |
Tests | Write automated tests for your infrastructure code. Run tests after every commit and nightly. | Terratest |
Last active
July 27, 2021 07:42
-
-
Save samloh84/da96456ffd1e8faeec7e99ef22d8d6c1 to your computer and use it in GitHub Desktop.
production-grade-checklist.md
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment