Skip to content

Instantly share code, notes, and snippets.

@steinbrueckri
Forked from adyrcz/OpsReadiness.md
Created June 19, 2019 14:17
Show Gist options
  • Save steinbrueckri/3b8fd7f4e32b91488e934d6fd10da0b6 to your computer and use it in GitHub Desktop.
Save steinbrueckri/3b8fd7f4e32b91488e934d6fd10da0b6 to your computer and use it in GitHub Desktop.
Operational Readiness Checklist

Operational Readiness Checklist

Types of Operability Requirements

  • Project Requirements
  • Configuration Management
  • CI/CD
  • Service Level Managment Requirements
  • Monitoring
  • Logging
  • Alerting
  • SLAs
  • Runbooks/Documentation
  • Robustness and Resilience/DR Requirements

Project Requirements

  • Software Stack
  • Hardware requirements
  • Environment Count
  • Instance Count
  • [ ]

Configuration Management

Chef

Infrastructure and tool install/configure

  • Server is launched through chef
  • Base is installed
  • application is installed through chef
  • app configuration is in CM recipe
  • config for each environment is cheffed
  • project hosts are cheffing on schedule
  • [ ]

CI/CD

Jenkins

Deploy code and apps

  • staging deploy built
  • prod deploy built
  • testing configured
  • testing gated
  • [ ]

Service Managment Requirements

RunDeck

Central Job scheduling Elevated Access for Devs

  • rundeck keys are on box
  • crons are stored on rundeck server
  • server application jobs are configured
  • [ ]

Monitoring

Graphite

Monitoring Requirements

  • hosts are sending system metrics (CPU, Memory, Disk, Network)
  • server apps are sending metrics (server: nginx, IIS; middleware: redis, rabbitmq...)
  • apps are sending specific metrics (process: cpu, memory; garbage collection; process forking...)
  • response times (every node, external)
  • [ ]

Logging

LogStash

  • system logs are forwarding
  • app logs are forwarding
  • log groks are configured
  • filters built
  • [ ]

Alerting

Sensu

  • system alerts configured
  • app alerts configured
  • app specific alerts configured
  • health check for each node in cluster
  • [ ]

Pingdom

  • App is added to pingdom
  • external service checks added
  • checks are set to update hipchat
  • checks are set to call pagerduty
  • [ ]

SLAs

PagerDuty

  • escalation process defined
  • groups configured in Pagerduty
  • [ ]

StatusPage

  • Application is added as a component
  • Critical Alerts are set to update statuspage
  • [ ]

Runbooks/Documentation

Confluence

CMDB

  • systems added to CMDB
  • runbook created in Confluence
  • Hosts added to asset mgmt
  • troubleshooting steps defined
  • [ ]

Robustness and Resilience/DR Requirements

Redundant Hardware / Cluster Performance/Scalability

  • Load balanced?
  • load tested
  • redundant hosts
  • config management
  • auto-scale
  • time to restore
  • [ ]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment