Skip to content

Instantly share code, notes, and snippets.

@ljfranklin
Last active April 21, 2017 15:35
Show Gist options
  • Save ljfranklin/f34e894bb2f4730a5b80daaa92016c38 to your computer and use it in GitHub Desktop.
Save ljfranklin/f34e894bb2f4730a5b80daaa92016c38 to your computer and use it in GitHub Desktop.
Chapter 10 - 15 of Site Reliability Engineering

Chapter 10 - 15 of Site Reliability Engineering

Possible discussion topics:

  • Thoughts on this recommendation: Being on-call should strike a balance between quantity (percent of time spent doing on-call activities) and the quality (number of incidents that occurred while on-call).

    • Quantity: Spend at least 50% of time doing engineering, no more that 25% of remainder should be on-call
    • Quality: If too many incidents occur on a given on-call shift, the SRE will not have time to properly perform the incident response responsibilities such as: root-cause analysis, remediation, and follow-up activities like writing a postmortem and fixing bugs. Google found these activities take ~6 hours on average, so there is a max of 2 incidents per 12 hour shift of on-call.
  • One Emergency Response recommendation was to intentionally break your systems to see if they fail in the way you expect.

    • Anyone want to share experiences of doing this at Pivotal?
  • How can we build a stronger post-mortem culture?

    • One suggestion: "In a monthly newsletter, an interesting and well-written postmortem is shared with the entire organization."
    • Does CF keep a list canonical list of past incidents?
      • Should engineering teams be required to review this list?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment