Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Dependency API Postmortem

2014-08-12 - Dependency API Issues - Community

Start every PM stating the following

  1. This is a blameless Postmortem.
  2. We will not focus on the past events as they pertain to "could've" "should've"...
  3. All follow up action items will be assigned to a team/individual before the end of the meeting. If the item is not going to be top priority leaving the meeting, don't make it a follow up item.

Incident Leader: Christopher Webber

Description

Users got an error when trying to use the /universe endpoint.

Timeline

All Times UTC

  • 2014-08-07 - @cwebber releases 2.7.1 of the supermarket cookbook and updates prod and staging to use it
  • 2014-08-12 15:04 - @cwebber begins deploy of supermarket
  • 2014-08-12 15:10 - Mark Harrison (@mh) receives pages that 500s are elevated and that sidekick is unreachable
  • 2014-08-12 15:12 - deploy is completed
  • 2014-08-12 15:13 - @mh and @cwebber discuss alerts during deploy being normal and that the fix to unicorn is being worked on
  • 2014-08-12 15:25 - @svanharmelen makes mention of /universe returning an error.
  • 2014-08-12 15:26 - Incident begins
  • 2014-08-12 15:30 - Jeremiah Snapp (@JHS) updates http://status.getchef.com to reflect that we are working on an issue
  • 2014-08-12 15:30 - In incident room, we discuss that chef-boneyard/supermarket#50 exists because of what we believe to be a similar issue on staging.
  • 2014-08-12 15:38 - We discuss rolling back the cookbook to a previous version.
  • 2014-08-12 15:42 - We determine that the way forward is to manually update the .env.production symlink.
  • 2014-08-12 15:43 - @mh begins deploying fix.
  • 2014-08-12 15:48 - @mh confirms that the fix is good on one node.
  • 2014-08-12 15:53 - Seth Vargo (@sethvargo) and @JHS discuss using Dependency API instead of Berkshelf API to disambiguate what we are talking about.
  • 2014-08-12 15:55 - @mh confirms fix is complete.
  • 2014-08-12 15:57 - @sethvargo verifies that fix is complete.
  • 2014-08-12 15:57 - @JHS updates http://status.getchef.com to note that incident is resolved.

Contributing Factors

The 2.7.1 version of the supermarket cookbook broke the symlinking of .env. This wasn't initially discovered when the update was done for two reasons:

  • Staging uses a redis instance in the default location, prod does not.
  • The symlink code only gets executed upon a deploy_revision and this was the first time the revision was deployed to prod since the cookbook update.

Stabilization Steps

Manually added a symlink from /srv/supermarket/current/.env.production to /srv/supermarket/shared/.env.production.

Impact

Users were unable to resolve dependencies using Berkshelf. Total Time: ~50 mins.

Corrective Actions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment