samloh84/production-grade-checklist.md

## production-grade-checklist.md

      
    Raw
  

              production-grade-checklist.md
            
          
Task
Description
Example Tools


Install
Install the software binaries and all dependencies.
Bash, Chef, Ansible, Puppet


Configure
Configure the software at runtime: e.g. configure port settings, file paths, users, leaders, followers, replication, etc.
Bash, Chef, Ansible, Puppet


Provision
Provision the infrastructure: e.g. EC2 instances, load balancers, network topology, security groups, IAM permissions, etc.
Terraform, CloudFormation


Deploy
Deploy the service on top of the infrastructure. Roll out updates with no downtime: e.g. blue-green, rolling, canary deployments.
Scripts, Orchestration tools (ECS, K8S, Nomad)


Security
Encryption in transit (TLS) and on disk, authentication, authorization, secrets management, server hardening
ACM, EBS Volumes, Cognito, Vault, CiS


Monitoring
Availability metrics, business metrics, app metrics, server metrics, events, observability, tracing, alerting.
CloudWatch, DataDog, New Relic, Honeycomb


Logs
Rotate logs on disk. Aggregate log data to a central location
CloudWatch Logs, ELK, Sumo Logic, Papertrail


Backup and restore
Make backups of DBs, caches, and other data on a scheduled basis. Replicate to separate region/account.
RDS, ElastiCache, ec2-snapper, Lambda


Networking
VPCs, subnets, static and dynamic IPs, service discovery, service mesh, firewalls, DNS, SSH access, VPN access.
EIPs, ENIs, VPCs, NACLs, SGs, Route 53, OpenVPN


High availability
Withstand outages of individual processes, EC2 Instances, services, Availability Zones, and regions.
Multi AZ, multi-region, replication, ASGs, ELBs


Scalability
Scale up and down in response to load. Scale horizontally (more servers) and/or vertically (bigger servers).
ASGs, replication, sharding, caching, divide and conquer


Performance
Optimize CPU, memory, disk, network, GPU and usage. Query tuning. Benchmarking, load testing, profiling.
Dynatrace, valgrind, VisualVM, ab, Jmeter


Cost optimization
Pick proper instance types, use spot and reserved instances, use auto scaling, nuke unused resources
ASGs, spot instances, reserved instances


Documentation
Document your code, architecture and practices. Create playbooks to respond to incidents.
READMEs, wikis, Slack


Tests
Write automated tests for your infrastructure code. Run tests after every commit and nightly.
Terratest


https://www.slideshare.net/brikis98/lessons-learned-from-writing-over-300000-lines-of-infrastructure-code-120597849
Task	Description	Example Tools
Install	Install the software binaries and all dependencies.	Bash, Chef, Ansible, Puppet
Configure	Configure the software at runtime: e.g. configure port settings, file paths, users, leaders, followers, replication, etc.	Bash, Chef, Ansible, Puppet
Provision	Provision the infrastructure: e.g. EC2 instances, load balancers, network topology, security groups, IAM permissions, etc.	Terraform, CloudFormation
Deploy	Deploy the service on top of the infrastructure. Roll out updates with no downtime: e.g. blue-green, rolling, canary deployments.	Scripts, Orchestration tools (ECS, K8S, Nomad)
Security	Encryption in transit (TLS) and on disk, authentication, authorization, secrets management, server hardening	ACM, EBS Volumes, Cognito, Vault, CiS
Monitoring	Availability metrics, business metrics, app metrics, server metrics, events, observability, tracing, alerting.	CloudWatch, DataDog, New Relic, Honeycomb
Logs	Rotate logs on disk. Aggregate log data to a central location	CloudWatch Logs, ELK, Sumo Logic, Papertrail
Backup and restore	Make backups of DBs, caches, and other data on a scheduled basis. Replicate to separate region/account.	RDS, ElastiCache, ec2-snapper, Lambda
Networking	VPCs, subnets, static and dynamic IPs, service discovery, service mesh, firewalls, DNS, SSH access, VPN access.	EIPs, ENIs, VPCs, NACLs, SGs, Route 53, OpenVPN
High availability	Withstand outages of individual processes, EC2 Instances, services, Availability Zones, and regions.	Multi AZ, multi-region, replication, ASGs, ELBs
Scalability	Scale up and down in response to load. Scale horizontally (more servers) and/or vertically (bigger servers).	ASGs, replication, sharding, caching, divide and conquer
Performance	Optimize CPU, memory, disk, network, GPU and usage. Query tuning. Benchmarking, load testing, profiling.	Dynatrace, valgrind, VisualVM, ab, Jmeter
Cost optimization	Pick proper instance types, use spot and reserved instances, use auto scaling, nuke unused resources	ASGs, spot instances, reserved instances
Documentation	Document your code, architecture and practices. Create playbooks to respond to incidents.	READMEs, wikis, Slack
Tests	Write automated tests for your infrastructure code. Run tests after every commit and nightly.	Terratest