nathenharvey/supermarket-and-berkshelf-outage.md

## supermarket-and-berkshelf-outage.md

      
    Raw
  

              supermarket-and-berkshelf-outage.md
            
          
    2015-08-13 - Supermarket and berkshelf outage - CUSTOMER

Meeting

The post mortem meeting was held at 3:30PM EDT on Friday, August 14, 2015.  The meeting was be held via a Google Hangout that was live streamed to YouTube.  Internally, we used #_postmortem_20150814 to discuss the post mortem.

This is a blameless Post Mortem.
We will not focus on the past events as they pertain to "could've", "should've", etc.
All follow up action items will be assigned to a team/individual before the end of the meeting. If the item is not going to be top priority leaving the meeting, don't make it a follow up item.

Incident Leader: Nathen Harvey

Description


Berkshelf unable to download cookbook dependencies.
Unable to login to the Supermarket.
Opsworks failing during provisioning and any other stage that utilizes Berkshelf.
Sporadic page rendering issues on the Supermarket.

Timeline

This incident began at 5:22AM UTC on Thursday, August 13, 2015.  It was resolved at 11:49AM the same day.
Time to detect:  248 minutes, 5:22AM - 9:30AM
Time to resolve:  387 minutes, 5:22AM - 11:49AM

05:14 UTC - Page rendering issues reported in #supermarket channel https://chefio.slack.com/archives/supermarket/p1439443183000327
05:22 UTC - ELB fully-cut over to new instances, INCIDENT begins
06:10 UTC - reported in #chef IRC
07:18 UTC - Zendesk #5639 opened
07:20 UTC - Incident reported on AWS forum - https://forums.aws.amazon.com/thread.jspa?messageID=667893
09:03 UTC - Amazon sends an email to some Chef employees asking about the issue.
09:20 UTC - Zendesk #5641 opened
9:30 UTC - Kimball Johnson & Thom May discussing the issue, unsure how to escalate.
09:46 UTC - Zendesk #5642 opened
09:50 UTC - Zendesk #5643 opened
10:19 UTC - Zendesk #5644 opened
10:30 UTC - Thom May and Eric Alwais report the issue in #supermarket, aren't sure how to escalate
10:43 UTC - Nathen called INCIDENT in #customer-support, took over as Incident Commander
10:44 UTC - Eric Alwais paged Paul Mooring
10:47 UTC - Nathen called INCIDENT in #operations and claimed IC
10:49 UTC - Nathen provides #incident with information about a recent code deploy that might be causing the issue
10:52 UTC - https://twitter.com/opscode_status/status/631780765668372480
11:06 UTC - Zendesk #5645 opened
11:08 UTC - Paul confirms that all Supermarket servers behind the ELB are new
11:23 UTC - https://twitter.com/opscode_status/status/631788292506300416
11:23 UTC - Nell Shamrell, Supermarket Engineer, joined the incident call and helped provide context and troubleshooting.
11:24 UTC - Nell confirmed that no mointors were reporting trouble.
11:27 UTC - Paul confirms that all of the Supermarket servers are running the new code (supermarket 1.12.1-alpha.0+git.46.6db4a91)
11:29 UTC - Zoom launched for synchronous troubleshooting an screen sharing.  The recording is available in drive.
11:30 UTC - One old instance was added to the ELB, one new one was removed in an effort to resolve the issue.  The one removed was the one that was being displayed in the /universe response.
11:45 UTC - Old supermarket instances have been placed behind the ELB, new ones taken out of service (Paul Mooring)
11:46 UTC - redis cache for /universe cleared by uploading a new cookbook to the supermarket and GET request to /universe  (Chris Webber)
11:49 UTC - Pagerduty incident resolved (Paul Mooring)
11:49 UTC - Incident resolved
11:49 UTC - https://twitter.com/opscode_status/status/631795218350784512

Contributing Factor(s)

Supermarket was switched to using the new omnibus build for production (supermarket 1.12.1-alpha.0+git.46.6db4a91).  The newly deployed servers were missing a variable for HOST  key in supermarket.json and falling back to the hostname of the server.  As a result, the location_path and download_url keys are giving the real hostname of the server (app-supermarket-prod-i-c663c400.opscode.us:443) instead of the supermarket hostname (https://supermarket.chef.io).
The sporadic page rendering issues on the Supermarket can be attributed to a period in time when both the new and old Supermarket servers were behind the ELB.  Two different versions of the Supermarket were running at the same time.
We were working under a very tight deadline.
From Amazon regarding Berkshelf:
OpsWorks has an optional feature where customers can enable a berkshelf run before OpsWorks executes the actual Chef run.

If enabled, the berkshelf run would be done before each Chef run, e.g. setup or application deployment.

From Allan Webb on YouTube
The choice to use Berkshelf on a stack is monolithic for the user. That's the "optional" part. You either have it turned on or turned off.

There is no user choice as to whether Berkshelf runs in a given lifecycle event; that's all controlled by the OpsWorks cookbooks. While it makes sense to me that Berkshelf would run during the Setup lifecycle, it makes less sense that it runs for a Deploy lifecycle (I assume that the cookbooks were already pulled in when Berkshelf ran under Setup). But as an end user, I have no control over that.

Monitoring validates that a 200 response is being returned by the /universe API.
Monitoring system is inside AWS behind the firewall so the FQDN works but, as this outage shows, the FQDN does not work for the outside world.
No functional test to confirm that Berkshelf works with the /universe endpoint.
The first people to notice the issue at Chef did not know how to respond to or report the issue.
Stabilization Steps

The old Supermarket instances were put in service behind the ELB.  The new Supermarket instances were marked out-of-service in the ELB.  A new cookbook was uploaded to the Supermarket and a GET request was made against the /universe endpoint.  This cleared the redis cache of the /universe endpoint.
Impact


Berkshelf unable to download cookbook dependencies.


Unable to login to the Supermarket.


Opsworks failing during provisioning and any other stage that utilizes Berkshelf.


Sporadic page rendering issues on the Supermarket.


The time to resolve was 387 minutes.


Corrective Actions


Add external monitor that validates berkshelf functions properly (Nell)


Add an functional test to the Supermarket deployment process that validates berkshelf is functioning properly. (Nell)


Refactor the Supermarket codebase so that only FQDN or HOST needs to be set in the supermarket.json (Nell)


Work with Employee Experience team to get Incident Response as part of on-boarding and on-going education (Ben)


Plan and coordinate Incident response drills (Ben)


Schedule follow-up meeting for release and change management process for Supermarket (Nathen)