Skip to content

Instantly share code, notes, and snippets.

@cwebberOps
Created July 10, 2014 11:36
Show Gist options
  • Save cwebberOps/402f60e32a80fab3e6de to your computer and use it in GitHub Desktop.
Save cwebberOps/402f60e32a80fab3e6de to your computer and use it in GitHub Desktop.
2014-07-08 - Supermarket Unresponsive - Community

2014-07-08 - Supermarket Unresponsive - Community

Start every PM stating the following

  1. This is a blameless Post Mortem.
  2. We will not focus on the past events as they pertain to "could've" "should've"...
  3. All follow up action items will be assigned to a team/individual before the end of the meeting. If the item is not going to be top priority leaving the meeting, don't make it a follow up item.

Incident Leader: Christopher Webber

Description

Supermarket was intermittently unresponsive approximately 2 hours after launch. This affected Supermarket and the Berkshelf API server

Timeline

All Times UTC

  • 2014-07-07 19:15:00 - community.opscode.com and cookbooks.opscode.com updated to point at Supermarket
  • 2014-07-07 19:18:00 - api.berkshelf.com pointed at Supermarket
  • 2014-07-07 19:40:00 - Deploy declared complete
  • 2014-07-07 21:26:09 - In #chef, icarus reports that https://supermarket.getchef.com/cookbooks/ant is white paging. Also reports getting 502
  • 2014-07-07 21:27:00 - Seth Chisamore reports issues with berkshelf api returning errors.
  • 2014-07-07 21:35:00 - Open communication moved to Sococo to help with more realtime conversation
  • 2014-07-07 21:36:00 - Discussion around the worker from the community site sync causing a fair amount of traffic because it was connecting to itself
  • 2014-07-07 21:36:00 - CCR on supermarket-prod to pull in code that removed the background sync worker
  • 2014-07-07 21:38:00 - jtimberman notes that all three servers are out of the ELB
  • 2014-07-07 21:42:00 - Sean Horn updates status.opscode.com with status
  • 2014-07-07 21:49:00 - Ian Garrison notes that the health check is on port 443 and not 80
  • 2014-07-07 21:49:00 - This is confirmed as normal by cwebber and jtimberman
  • 2014-07-07 21:50:00 - jtimberman spins two additional instances (m3.medium)
  • 2014-07-07 21:59:00 - Paul Mooring suggests backing off the timeout to 15s temporarily
  • 2014-07-07 22:00:00 - Adam makes note that we need to change the status to reflect that this is a second issue
  • 2014-07-07 22:01:00 - Paul Mooring makes the change and bumps the timeout to 25 seconds.
  • 2014-07-07 22:12:00 - Status confirmed as updated
  • 2014-07-07 22:14:00 - Latency starts to decrease: https://s3.amazonaws.com/uploads.hipchat.com/7557/78724/Bw7QEAgGO1rQzvS/AWS_Management_Console.png
  • 2014-07-07 22:20:00 - cwebber makes note of decreased postgres connections http://i.cwebber.net/RDS__AWS_Console_2014-07-07_15-19-54_2014-07-07_15-20-05.jpg
  • 2014-07-07 22:25:00 - Discussion in Ops around the fact that unicorn was configured with 3 workers and m3.mediums only have one core
  • 2014-07-07 22:36:00 - We discussed in the Ops Sococo room bumping the size of the instances. Decision was to proceed with two parallel options
    • Start spinning two additional instances at m3.xlarge
    • Cycle, one at a time and resize each instance to m3.xlarge
  • 2014-07-07 22:50:00 - reset pings cwebber on IRC to discuss possible rollback
  • 2014-07-07 22:52:00 - cwebber and reset agree to wait until 23:15 to evaluate the need for that.
  • 2014-07-07 22:55:00 - Running with five instances seems to be alleviating the pressure on the backends. http://i.cwebber.net/AWS_Management_Console_2014-07-07_15-54-59_2014-07-07_15-55-01.jpg
  • 2014-07-07 22:56:00 - Mark Harrison works to get sethvargo access to the backend to verify we are not seeing issues related to the app itself.
  • 2014-07-07 23:00:00 - All nodes in pool are now m3.xlarge
  • 2014-07-07 23:06:00 - Incident resolved
  • 2014-07-07 23:27:00 - jtimberman deploys a change to increment the number of unicorn workers
  • 2014-07-07 23:39:00 - jtimberman determines that the change didn't work
  • 2014-07-07 23:58:00 - jtimberman deploys a change that actually corrects the number of unicorn workers
  • 2014-07-08 01:42:00 - Reflecting back on traffic... http://i.cwebber.net/AWS_Management_Console_2014-07-07_18-40-29_2014-07-07_18-41-35.jpg

Root Cause

  • A mismatch between the number of unicorn workers and number of cores caused traffic to back up. (may be a red herring)
  • The nodes were not sized to meet the demand
  • App nodes were being taken out of ELB for too quickly and increasing load on other nodes
  • No guidelines from dev on how to configure
  • load planning done before addition of /universe endpoint

Stabilization Steps

  • Background job that syncs to the old site was disabled
  • Resized three existing instances to m3.xlarge
  • Added two additional nodes

Impact

Total impact: ~2hrs (Graphs indicate that increased maximum latency started at ~21:10) Made api.berkshelf.com and Supermarket unusable during that time.

Corrective Actions

  • berkshelf-api server will use /universe on supermarket (reset/sethvargo)
  • evaluate number of unicorns so service can meet that demand (jtimberman)
  • berkshelf team to provide traffic for api.berkshelf.com
  • investigate caching of download urls (maybe varnish) (cwebber/fullstack)
    • make sure the metric counter is non-blocking (cwebber/fullstack)
  • Updates to status.getchef.com will propagate to #chef irc (cwebber)
  • Add cloud watch alarms for when nodes drop out of ELB (Chef Ops)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment