Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@nellshamrell
Created October 23, 2015 21:01
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save nellshamrell/967c162503efd2fdc9c4 to your computer and use it in GitHub Desktop.
Save nellshamrell/967c162503efd2fdc9c4 to your computer and use it in GitHub Desktop.
# 2015-10-05
## Meeting
1. This is a blameless Post Mortem.
2. We will not focus on the past events as they pertain to "could've", "should've", etc.
3. All follow up action items will be assigned to a team/individual before the end of the meeting. If the item is not going to be top priority leaving the meeting, don't make it a follow up item.
### Incident Leader: Nell Shamrell-Harrington
## Description
A module within Ridley that Berkshelf API Client was dependent on was removed from Ridley (was not a part of Ridley's public API). Unfortunately, Berkshelf API Client was still looking for this module. Fix was available of the Berkshelf API client's master branch, but had not been released as ruby gem.
## Timeline
Timeline of events, including exact duration of downtime.
The timeline should be in chronological order, showing what happened when, but
it should also explain what the team knew at the time.
For example, someone deploys a bad build that triggers an alert, but no one
initially realizes this is what happened. The timeline should list first that the
bad build was deployed, but that the oncall person was not aware of this at the
time it occured. Later the timeline might list an event where the oncall person
becomes aware this is the case.
This incident began at 00:12UTC on Tuesday, October 6, 2015. It was resolved at 00:44UTC the same day.
**Time to detect**: 54 minutes, 23:18UTC Monday, October 5, 2015 - 00:12UTC on Tuesday, October 6, 2015
**Time to resolve**: 32 minutes, 00:12UTC Tuesday, October 6, 2015
* **00:11UTC** - Noah Kantrowitz (community member) tweeted at Nell Shamrell-Harrington (Supermarket engineer) that "Someone just changed the config on Supermarket and broke Berks. https://travis-ci.org/poise/poise/jobs/83801720" (https://twitter.com/kantrn/status/651187991688286208)
* **00:12UTC** - Nell Shamrell-Harrington declared an incident and assumed the role of Incident Commander
* **00:14UTC** - Nell Shamrell-Harrington replied to Noah's tweet, indicating that she was investigating (https://twitter.com/nellshamrell/status/651188830339383296)
* **00:15UTC** - Mark Anderson ran berks on one of his repos and noticed he was getting this error "gzip is not registered on Faraday::Response;"
* **00:16UTC** - Nell Shamrell-Harrington tried running berks update on one of her repos and did not see the error
* **00:20UTC** - Dan DeLeo pointed out that Noah may be installing Berkshelf from a gem, and could have newer deps than ChefDK and be hitting a bug in a newly released version of faraday
* **00:22UTC** - Mark Anderson reported he was seeing the same error when running bundle exec berks update, but not berks update with ChefDK
* **00:25UTC** - Dan DeLeo posted repro steps in the #incident slack room: 1) git clone poise 2) bundle install 3) then bundle exec berks install -d, which showed the same error ":gzip is not registered on Faraday::Response"
* **00:31UTC** - Dan DeLeo began looking into whether a recently released library might be breaking things. He checked these three and did not see any recent releases: https://rubygems.org/gems/berkshelf-api-client, https://rubygems.org/gems/faraday_middleware, https://rubygems.org/gems/faraday
* **00:35UTC** - Nell Shamrell-Harrington confirmed that no deployments or changes to Supermarket's config had occurred for over a week
* **00:36UTC** - Dan DeLeo found the line in the Berkshelf API that was likely returning the error: https://github.com/berkshelf/berkshelf-api-client/blob/6793f473a8031df28767109220f7f0e59dfd0ead/lib/berkshelf/api_client/connection.rb#L35-L45
* **00:36UTC* - Steven Danna discovered a commit in ridley which removed the internal middleware for managing gzip https://github.com/reset/ridley/commit/385bfd9a0c58024b8e1824810151662d226e05a1 and noticed that it had been released an hour previously as v4.2.1
* **00:38UTC** - Chris Webber brought Jamie Windsor, maintainer of the ridley gem, into the #incident chat room
* **00:44UTC** - Jamie Windsor yanked the ridley 4.2.1 gem
* **00:47UTC** - Nell Shamrell-Harrington tweeted the result of the investigation to Noah (https://twitter.com/nellshamrell/status/651197135631724546)
* **01:19UTC** - Jamie Windsor released ridley 4.3.0
## Stabilization Steps
Jamie Windsor yanked the ridley 4.2.1 gem
## Impact
All users of the gem installed version of Berkshelf saw a ":gzip is not registered on Faraday::Response" error when using Berkshelf.
## Corrective Actions
Action items going forward to fix the issue and reduce chance of contributing factors being an issue.
This **MUST** include owners/teams assigned to these actions to see them through, and have an issue tracked in this repository (or otherwise linked to external team kanban/issue tracker).
RFC - how to test using multiple versions of Chef - how are people doing it? What problems are there? (Noah)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment