2015-08-13 - Supermarket and berkshelf outage - CUSTOMER
The post mortem meeting was held at 3:30PM EDT on Friday, August 14, 2015. The meeting was be held via a Google Hangout that was live streamed to YouTube. Internally, we used #_postmortem_20150814 to discuss the post mortem.
- This is a blameless Post Mortem.
- We will not focus on the past events as they pertain to "could've", "should've", etc.
- All follow up action items will be assigned to a team/individual before the end of the meeting. If the item is not going to be top priority leaving the meeting, don't make it a follow up item.
Incident Leader: Nathen Harvey
- Berkshelf unable to download cookbook dependencies.
- Unable to login to the Supermarket.
- Opsworks failing during provisioning and any other stage that utilizes Berkshelf.
- Sporadic page rendering issues on the Supermarket.
This incident began at 5:22AM UTC on Thursday, August 13, 2015. It was resolved at 11:49AM the same day.
Time to detect: 248 minutes, 5:22AM - 9:30AM
Time to resolve: 387 minutes, 5:22AM - 11:49AM
- 05:14 UTC - Page rendering issues reported in #supermarket channel https://chefio.slack.com/archives/supermarket/p1439443183000327
- 05:22 UTC - ELB fully-cut over to new instances, INCIDENT begins
- 06:10 UTC - reported in #chef IRC
- 07:18 UTC - Zendesk #5639 opened
- 07:20 UTC - Incident reported on AWS forum - https://forums.aws.amazon.com/thread.jspa?messageID=667893
- 09:03 UTC - Amazon sends an email to some Chef employees asking about the issue.
- 09:20 UTC - Zendesk #5641 opened
- 9:30 UTC - Kimball Johnson & Thom May discussing the issue, unsure how to escalate.
- 09:46 UTC - Zendesk #5642 opened
- 09:50 UTC - Zendesk #5643 opened
- 10:19 UTC - Zendesk #5644 opened
- 10:30 UTC - Thom May and Eric Alwais report the issue in #supermarket, aren't sure how to escalate
- 10:43 UTC - Nathen called INCIDENT in #customer-support, took over as Incident Commander
- 10:44 UTC - Eric Alwais paged Paul Mooring
- 10:47 UTC - Nathen called INCIDENT in #operations and claimed IC
- 10:49 UTC - Nathen provides #incident with information about a recent code deploy that might be causing the issue
- 10:52 UTC - https://twitter.com/opscode_status/status/631780765668372480
- 11:06 UTC - Zendesk #5645 opened
- 11:08 UTC - Paul confirms that all Supermarket servers behind the ELB are new
- 11:23 UTC - https://twitter.com/opscode_status/status/631788292506300416
- 11:23 UTC - Nell Shamrell, Supermarket Engineer, joined the incident call and helped provide context and troubleshooting.
- 11:24 UTC - Nell confirmed that no mointors were reporting trouble.
- 11:27 UTC - Paul confirms that all of the Supermarket servers are running the new code (
- 11:29 UTC - Zoom launched for synchronous troubleshooting an screen sharing. The recording is available in drive.
- 11:30 UTC - One old instance was added to the ELB, one new one was removed in an effort to resolve the issue. The one removed was the one that was being displayed in the /universe response.
- 11:45 UTC - Old supermarket instances have been placed behind the ELB, new ones taken out of service (Paul Mooring)
- 11:46 UTC - redis cache for /universe cleared by uploading a new cookbook to the supermarket and GET request to /universe (Chris Webber)
- 11:49 UTC - Pagerduty incident resolved (Paul Mooring)
- 11:49 UTC - Incident resolved
- 11:49 UTC - https://twitter.com/opscode_status/status/631795218350784512
Supermarket was switched to using the new omnibus build for production (
supermarket 1.12.1-alpha.0+git.46.6db4a91). The newly deployed servers were missing a variable for
HOST key in
supermarket.json and falling back to the hostname of the server. As a result, the
download_url keys are giving the real hostname of the server (app-supermarket-prod-i-c663c400.opscode.us:443) instead of the supermarket hostname (https://supermarket.chef.io).
The sporadic page rendering issues on the Supermarket can be attributed to a period in time when both the new and old Supermarket servers were behind the ELB. Two different versions of the Supermarket were running at the same time.
We were working under a very tight deadline.
From Amazon regarding Berkshelf:
OpsWorks has an optional feature where customers can enable a berkshelf run before OpsWorks executes the actual Chef run. If enabled, the berkshelf run would be done before each Chef run, e.g. setup or application deployment.
From Allan Webb on YouTube
The choice to use Berkshelf on a stack is monolithic for the user. That's the "optional" part. You either have it turned on or turned off. There is no user choice as to whether Berkshelf runs in a given lifecycle event; that's all controlled by the OpsWorks cookbooks. While it makes sense to me that Berkshelf would run during the Setup lifecycle, it makes less sense that it runs for a Deploy lifecycle (I assume that the cookbooks were already pulled in when Berkshelf ran under Setup). But as an end user, I have no control over that.
Monitoring validates that a 200 response is being returned by the /universe API.
Monitoring system is inside AWS behind the firewall so the FQDN works but, as this outage shows, the FQDN does not work for the outside world.
No functional test to confirm that Berkshelf works with the /universe endpoint.
The first people to notice the issue at Chef did not know how to respond to or report the issue.
The old Supermarket instances were put in service behind the ELB. The new Supermarket instances were marked out-of-service in the ELB. A new cookbook was uploaded to the Supermarket and a GET request was made against the
/universe endpoint. This cleared the redis cache of the
Berkshelf unable to download cookbook dependencies.
Unable to login to the Supermarket.
Opsworks failing during provisioning and any other stage that utilizes Berkshelf.
Sporadic page rendering issues on the Supermarket.
The time to resolve was 387 minutes.
Add external monitor that validates berkshelf functions properly (Nell)
Add an functional test to the Supermarket deployment process that validates berkshelf is functioning properly. (Nell)
Refactor the Supermarket codebase so that only
HOSTneeds to be set in the
Work with Employee Experience team to get Incident Response as part of on-boarding and on-going education (Ben)
Plan and coordinate Incident response drills (Ben)
Schedule follow-up meeting for release and change management process for Supermarket (Nathen)