- This is a blameless Postmortem.
- We will not focus on the past events as they pertain to "could've" "should've"...
- All follow up action items will be assigned to a team/individual before the end of the meeting. If the item is not going to be top priority leaving the meeting, don't make it a follow up item.
Users got an error when trying to use the /universe
endpoint.
All Times UTC
- 2014-08-07 - @cwebber releases 2.7.1 of the supermarket cookbook and updates prod and staging to use it
- 2014-08-12 15:04 - @cwebber begins deploy of supermarket
- 2014-08-12 15:10 - Mark Harrison (@mh) receives pages that 500s are elevated and that sidekick is unreachable
- 2014-08-12 15:12 - deploy is completed
- 2014-08-12 15:13 - @mh and @cwebber discuss alerts during deploy being normal and that the fix to unicorn is being worked on
- 2014-08-12 15:25 - @svanharmelen makes mention of
/universe
returning an error. - 2014-08-12 15:26 - Incident begins
- 2014-08-12 15:30 - Jeremiah Snapp (@JHS) updates http://status.getchef.com to reflect that we are working on an issue
- 2014-08-12 15:30 - In incident room, we discuss that chef-boneyard/supermarket#50 exists because of what we believe to be a similar issue on staging.
- 2014-08-12 15:38 - We discuss rolling back the cookbook to a previous version.
- 2014-08-12 15:42 - We determine that the way forward is to manually update the .env.production symlink.
- 2014-08-12 15:43 - @mh begins deploying fix.
- 2014-08-12 15:48 - @mh confirms that the fix is good on one node.
- 2014-08-12 15:53 - Seth Vargo (@sethvargo) and @JHS discuss using Dependency API instead of Berkshelf API to disambiguate what we are talking about.
- 2014-08-12 15:55 - @mh confirms fix is complete.
- 2014-08-12 15:57 - @sethvargo verifies that fix is complete.
- 2014-08-12 15:57 - @JHS updates http://status.getchef.com to note that incident is resolved.
The 2.7.1 version of the supermarket cookbook broke the symlinking of .env. This wasn't initially discovered when the update was done for two reasons:
- Staging uses a redis instance in the default location, prod does not.
- The symlink code only gets executed upon a deploy_revision and this was the first time the revision was deployed to prod since the cookbook update.
Manually added a symlink from /srv/supermarket/current/.env.production
to /srv/supermarket/shared/.env.production
.
Users were unable to resolve dependencies using Berkshelf. Total Time: ~50 mins.
- Correct the issue with the symlinking - Released as part of v2.7.2 of the supermarket cookbook
- Make deploys not noisy - https://trello.com/c/ddg4h6NL
- Create a playbook on using the cookbook straight from GitHub. - Chef Ops
- Throw an exception if Redis is inaccessible - Fullstack (https://trello.com/c/bDXIjdVT)
/status
should return not ok status if a service is unreachable. - Fullstack (https://trello.com/c/Ho4fHdXy)