Skip to content

Instantly share code, notes, and snippets.

@nathenharvey
Last active October 4, 2016 21:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save nathenharvey/8bd957ffc15438ed921ff11cebd6d875 to your computer and use it in GitHub Desktop.
Save nathenharvey/8bd957ffc15438ed921ff11cebd6d875 to your computer and use it in GitHub Desktop.
Incident Documentation for the yum_resource escaped defect.

14-Sep-2016 - Escaped Defect - yum_repository resource

Meeting

Start every PM stating the following

  1. This is a blameless Post Mortem.
  2. We will not focus on the past events as they pertain to "could've", "should've", etc.
  3. All follow up action items will be assigned to a team/individual before the end of the meeting. If the item is not going to be top priority leaving the meeting, don't make it a follow up item.

Incident Leader: Jeremy Werkau

Post Mortem Leader: Nathen Harvey

Description

The yum_repository resource was added and released in chef-client version 12.14.60. The resource did not fully support the custom resource shipped as part of the yum cookbook.

The 12.14.60 release of Chef Client included a number of other regressions as well. We will use the specific regressions around the yum_repository resource as a proxy for the release and not dig into the specifics of the other regressions though they will be captured in this incident report.

Timeline

All times listed in UTC.

Time to Detect and Resolve

  • Time to detect - 70 minutes - 18:19 - Chef Client released, 19:09 - GitHub issue 5317 opened.
  • Time to resolve - 6 days, 5 hours, 1 minute
    • 6 hours, 52 minutes - 14-Sep-2016 18:19 Chef Client 12.14.60 released, 15-Sep-2016 01:11 current build of chef-client released that includes the fixes.
    • 5 days, 5 hours, 27 minutes - 14-Sep-2016 18:19 Chef Client 12.14.60 released, 19-Sep-2016 23:46 Chef Client 12.14.77 released
    • 6 days, 5 hours, 1 minute - 14-Sep-2016 18:19 Chef Client 12.14.60 released, 20-Sep-2016 23:20 Doc site includes yum_repository resource

Contributing Factor(s)

  • GitHub issue 5282 - yum_repository action :delete doesn't seem to work was still open at time of release.
    • Not recognized as a regression.
      • Expected: Core provider overrides the cookbook. Actual: The core provider won out.
      • The core provider's provides method will always win out on systems with yum.
    • No clear communication between release & community engineering
  • Moving custom resources from cookbooks into core chef reduces our test coverage on the resources.
  • CHANGELOG for 12.14.60 was unclear about the scope of the change at time of release.
  • Chef documentation site did not include release notes until 20-Sep-2016, six days after release.
  • Other regressions and broken travis and Jenkins builds which
    • UID and GID collisions in Jenkins clogged up the build pipeline
    • Resolved late Thursday before a "no-release" Friday
  • Engineering was not aware of the Zendesk / Customer Support issues being opened.
  • Bumping Ruby in the same release - that's a big change!

Stabilization Steps

  • Release Notes and CHANGELOG on GitHub updated to reflect the changes.
  • Updates made to master branch of source code to address the issues.
  • Nightly release of chef-client with the required changes.
  • Released a version of the yum cookbook that added deprecation warnings

Impact

  • Failed chef-client runs for anyone using a yum_repository resource with a url parameter or a delete action and chef-client version 12.14.60.

Other regressions found in Chef Client 12.14.60 release:

GitHub Issues

Build Failures

Corrective Actions

  • Decide and document a process for recommending our customers hold off on upgrading. COOL team and Product Management
    • Suggest: Announce all regressions that are going to trigger a bug fix release.
  • Discuss moving release target days of the week: Pre-release announcement moves to Wednesday. Target release moves to Monday. COOL team and Product Management
  • Migrate tests from cookbooks with custom providers when migrating providers to core chef-client Community Engineering / Tim Smith & Lamont Granquist
  • Document how the provider resolver works and provide guidance when migrating providers to core chef-client. Lamont Granquist
  • Research other projects' documentation practices. How can we get better release notes at time of PR? Ryan Hass
  • Additional automation for creating docs from source code David Wrede
    • Ideas, suggestions, and possibilities:
      • Autogenerate resource documentation from code in a similar fashion to InSpec team
      • Move docs for chef-client into the code base

Link to Blog Post

A post outlining this incident is available on the Chef blog. TODO: Update this with the link.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment