Skip to content

Instantly share code, notes, and snippets.

@cwebberOps

cwebberOps/postmortem.md

Last active Aug 29, 2015
Embed
What would you like to do?
2014-07-08 - Berkshelf v2 outage - Community

2014-07-08 - Berkshelf v2 Outage - Community

Start every PM stating the following

  1. This is a blameless Post Mortem.
  2. We will not focus on the past events as they pertain to "could've" "should've"...
  3. All follow up action items will be assigned to a team/individual before the end of the meeting. If the item is not going to be top priority leaving the meeting, don't make it a follow up item.

Incident Leader: Christopher Webber

Description

Users of Berkshelf v2.x were unable to follow redirects to https and as a result, unable to do work.

Timeline

All Times UTC

  • 2014-07-07 19:15:00 - DNS is updated in Dyn making community.opscode.com and cookbooks.opscode.com be served by Supermarket.
  • 2014-07-07 19:40:00 - Launch is declared complete.
  • 2014-07-07 19:51:24 - pcorliss reports in #berkshelf that there are issues with Berkshelf v2. Chef response did not contain a JSON body
  • 2014-07-07 20:17:48 - pcorliss reports the same issue in #chef
  • 2014-07-07 20:20:20 - davidordave confirms the issue pcorliss is seeing
  • 2014-07-07 20:31:35 - cubed confirms issue as well.
  • 2014-07-07 20:38:39 - Adam reaches out to cwebber and jtimberman to look at issue in #chef.
  • 2014-07-07 20:40:37 - cwebber starts discusion with pcorliss about the issue in #berkshelf.
  • 2014-07-07 20:42:54 - pcorliss provides a gist of the error output.
  • 2014-07-07 20:44:13 - cwebber makes note of the http -> https uplift issue we had seen in testing.
  • 2014-07-07 20:47:09 - cwebber reaches out to Adam to confirm escalation path
  • 2014-07-07 20:48:00 - cwebber notifies the Ops, Dev and Community rooms in HipChat to notify that incident is being started
  • 2014-07-07 20:48:00 - Ops On-Call is notified via HipChat that we are starting an incident for a Berkshelf v2 Outage
  • 2014-07-07 20:50:00 - cwebber recaps issue in Incident room in HipChat
  • 2014-07-07 20:51:00 - sethvargo updates http://status.opscode.com
  • 2014-07-07 20:52:00 - sethvargo dives into code to verifiy the issue from the code
  • 2014-07-07 20:53:00 - jtimberman begins work on allowing cookbooks.opscode.com to pass-thru without the 301 to https.
  • 2014-07-07 20:59:00 - jtimberman explains fix prior to implementation
  • 2014-07-07 21:03:00 - @jgoldschrafe reports that he is seeing similar issues with an on-prem berkshelf-api server https://twitter.com/jgoldschrafe/status/486254022890778624
  • 2014-07-07 21:03:00 - @sethvargo responds with a fix to use https://supermarket.getchef.com/api/v1 instead of http://cookbooks.opscode.com/api/v1
  • 2014-07-07 21:04:00 - jtimberman posts diff of changes for review
  • 2014-07-07 21:09:00 - cwebber updates the attributes for the staging environment
  • 2014-07-07 21:09:00 - jtimberman uploads v2.4.2 of the supermarket cookbook to Chef Server
  • 2014-07-07 21:09:00 - jtimberman uploads changes to the supermarket-app role
  • 2014-07-07 21:10:00 - CCR (chef-client run) on supermarket-app in prod
  • 2014-07-07 21:14:00 - jtimberman begins process of attempting to install berkshelf 2.0.17 for testing
  • 2014-07-07 21:16:00 - cwebber confirms calls to http://cookbooks.opscode.com pass through as http
  • 2014-07-07 21:19:05 - cwebber reaches out to pcorliss, cubed, davidordave to confirm the fix
  • 2014-07-07 21:21:16 - pcorliss repots that things are working again
  • 2014-07-07 21:22:48 - davidordave responds that he is still seeing errors: http://pastebin.com/NmLF00a4
  • 2014-07-07 21:26:09 - icarus reports that https://supermarket.getchef.com/cookbooks/ant is white paging
  • 2014-07-07 21:26:00 - Focus swiches away from this incident to Supermarket being unresponsive
  • 2014-07-07 21:50:00 - status.opscode.com updated to reflect the ongoing issue http://status.opscode.com/post/91084396786/berkshelf-2-outage-ongoing
  • 2014-07-07 22:06:00 - status.opscode.com updated with resolved status. http://status.opscode.com/post/91085603896/berkshelf-2-outage-resolved
  • 2014-07-07 23:58:00 - Josh Glass reports in Incident room that he is still seeing issues with berkshelf
  • 2014-07-08 00:00:00 - Ryan Cragun notes that the issue is with https://github.com/ruby/ruby/blob/v1_9_3_547/lib/open-uri.rb#L235-244
  • 2014-07-08 00:08:00 - sethvargo begins work on correcting the bug with Berkshelf 2.
  • 2014-07-08 00:09:00 - jtimberman points to http://mislav.uniqpath.com/2011/07/faraday-advanced-http/ for more info.
  • 2014-07-08 00:11:00 - sethvargo makes note that this is actually a bug in ruby. http://stackoverflow.com/questions/10013293/open-uri-is-not-redirecing-http-to-https
  • 2014-07-08 00:20:00 - sethvargo opens https://github.com/berkshelf/berkshelf/pull/1251
  • 2014-07-08 00:41:00 - reset releases Berkshelf 2.0.18

Root Cause

Ruby has a bug in the open-uri library that doesn't handle the redirect from http to https.

Stabilization Steps

  • Allowed cookbooks.opscode.com to be served via http
  • Released Berkshelf v2.0.18
  • Adam advised users in irc to update their sources for berkshelf

Impact

Users that use Berkshelf v 2.x were unable to use cookbooks.opscode.com until a new version was released.

Duration: ~ 5.5 hrs

Corrective Actions

  • Make note in the README for supermarket cookbook to visit berkshelf 2 for testing (cwebber)
  • Post updates to status.getchef.com to IRC (cwebber)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment