steinbrueckri/OpsReadiness.md

## OpsReadiness.md

      
    Raw
  

              OpsReadiness.md
            
          
    Operational Readiness Checklist

Types of Operability Requirements


Project Requirements
Configuration Management
CI/CD
Service Level Managment Requirements
Monitoring
Logging
Alerting
SLAs
Runbooks/Documentation
Robustness and Resilience/DR Requirements

Project Requirements


 Software Stack
 Hardware requirements
 Environment Count
 Instance Count
[ ]

Configuration Management

Chef

Infrastructure and tool install/configure

 Server is launched through chef
 Base is installed
 application is installed through chef
 app configuration is in CM recipe
 config for each environment is cheffed
 project hosts are cheffing on schedule
[ ]

CI/CD

Jenkins

Deploy code and apps

 staging deploy built
 prod deploy built
 testing configured
 testing gated
[ ]

Service Managment Requirements

RunDeck

Central Job scheduling
Elevated Access for Devs

 rundeck keys are on box
 crons are stored on rundeck server
 server application jobs are configured
[ ]

Monitoring

Graphite

Monitoring Requirements

 hosts are sending system metrics (CPU, Memory, Disk, Network)
 server apps are sending metrics (server: nginx, IIS; middleware: redis, rabbitmq...)
 apps are sending specific metrics (process: cpu, memory; garbage collection; process forking...)
 response times (every node, external)
[ ]

Logging

LogStash


 system logs are forwarding
 app logs are forwarding
 log groks are configured
 filters built
[ ]

Alerting

Sensu


 system alerts configured
 app alerts configured
 app specific alerts configured
 health check for each node in cluster
[ ]

Pingdom


 App is added to pingdom
 external service checks added
 checks are set to update hipchat
 checks are set to call pagerduty
[ ]

SLAs

PagerDuty


 escalation process defined
 groups configured in Pagerduty
[ ]

StatusPage


 Application is added as a component
 Critical Alerts are set to update statuspage
[ ]

Runbooks/Documentation

Confluence

CMDB


 systems added to CMDB
 runbook created in Confluence
 Hosts added to asset mgmt
 troubleshooting steps defined
[ ]

Robustness and Resilience/DR Requirements

Redundant Hardware / Cluster
Performance/Scalability

 Load balanced?
 load tested
 redundant hosts
 config management
 auto-scale
 time to restore
[ ]