ljfranklin/sre-book-club-3.md

## sre-book-club-3.md

      
    Raw
  

              sre-book-club-3.md
            
          
    Chapter 10 - 15 of Site Reliability Engineering

Possible discussion topics:


Thoughts on this recommendation: Being on-call should strike a balance between quantity (percent of time spent doing on-call activities) and the quality (number of incidents that occurred while on-call).

Quantity: Spend at least 50% of time doing engineering, no more that 25% of remainder should be on-call
Quality: If too many incidents occur on a given on-call shift, the SRE will not have time to properly perform the incident response responsibilities such as: root-cause analysis, remediation, and follow-up activities like writing a postmortem and fixing bugs. Google found these activities take ~6 hours on average, so there is a max of 2 incidents per 12 hour shift of on-call.


One Emergency Response recommendation was to intentionally break your systems to see if they fail in the way you expect.

Anyone want to share experiences of doing this at Pivotal?


How can we build a stronger post-mortem culture?

One suggestion: "In a monthly newsletter, an interesting and well-written postmortem is shared with the entire organization."
Does CF keep a list canonical list of past incidents?

Should engineering teams be required to review this list?