Skip to content

Instantly share code, notes, and snippets.

@samloh84
Last active July 27, 2021 07:42
Show Gist options
  • Save samloh84/da96456ffd1e8faeec7e99ef22d8d6c1 to your computer and use it in GitHub Desktop.
Save samloh84/da96456ffd1e8faeec7e99ef22d8d6c1 to your computer and use it in GitHub Desktop.
production-grade-checklist.md
Task Description Example Tools
Install Install the software binaries and all dependencies. Bash, Chef, Ansible, Puppet
Configure Configure the software at runtime: e.g. configure port settings, file paths, users, leaders, followers, replication, etc. Bash, Chef, Ansible, Puppet
Provision Provision the infrastructure: e.g. EC2 instances, load balancers, network topology, security groups, IAM permissions, etc. Terraform, CloudFormation
Deploy Deploy the service on top of the infrastructure. Roll out updates with no downtime: e.g. blue-green, rolling, canary deployments. Scripts, Orchestration tools (ECS, K8S, Nomad)
Security Encryption in transit (TLS) and on disk, authentication, authorization, secrets management, server hardening ACM, EBS Volumes, Cognito, Vault, CiS
Monitoring Availability metrics, business metrics, app metrics, server metrics, events, observability, tracing, alerting. CloudWatch, DataDog, New Relic, Honeycomb
Logs Rotate logs on disk. Aggregate log data to a central location CloudWatch Logs, ELK, Sumo Logic, Papertrail
Backup and restore Make backups of DBs, caches, and other data on a scheduled basis. Replicate to separate region/account. RDS, ElastiCache, ec2-snapper, Lambda
Networking VPCs, subnets, static and dynamic IPs, service discovery, service mesh, firewalls, DNS, SSH access, VPN access. EIPs, ENIs, VPCs, NACLs, SGs, Route 53, OpenVPN
High availability Withstand outages of individual processes, EC2 Instances, services, Availability Zones, and regions. Multi AZ, multi-region, replication, ASGs, ELBs
Scalability Scale up and down in response to load. Scale horizontally (more servers) and/or vertically (bigger servers). ASGs, replication, sharding, caching, divide and conquer
Performance Optimize CPU, memory, disk, network, GPU and usage. Query tuning. Benchmarking, load testing, profiling. Dynatrace, valgrind, VisualVM, ab, Jmeter
Cost optimization Pick proper instance types, use spot and reserved instances, use auto scaling, nuke unused resources ASGs, spot instances, reserved instances
Documentation Document your code, architecture and practices. Create playbooks to respond to incidents. READMEs, wikis, Slack
Tests Write automated tests for your infrastructure code. Run tests after every commit and nightly. Terratest

https://www.slideshare.net/brikis98/lessons-learned-from-writing-over-300000-lines-of-infrastructure-code-120597849

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment