cwebberOps/postmortem.md

## postmortem.md

      
    Raw
  

              postmortem.md
            
          
    2014-07-08 - Supermarket Unresponsive - Community

Start every PM stating the following


This is a blameless Post Mortem.
We will not focus on the past events as they pertain to "could've" "should've"...
All follow up action items will be assigned to a team/individual before the end of the meeting. If the item is not going to be top priority leaving the meeting, don't make it a follow up item.

Incident Leader: Christopher Webber

Description

Supermarket was intermittently unresponsive approximately 2 hours after launch. This affected Supermarket and the Berkshelf API server
Timeline

All Times UTC

2014-07-07 19:15:00 - community.opscode.com and cookbooks.opscode.com updated to point at Supermarket
2014-07-07 19:18:00 - api.berkshelf.com pointed at Supermarket
2014-07-07 19:40:00 - Deploy declared complete
2014-07-07 21:26:09 - In #chef, icarus reports that https://supermarket.getchef.com/cookbooks/ant is white paging. Also reports getting 502
2014-07-07 21:27:00 - Seth Chisamore reports issues with berkshelf api returning errors.
2014-07-07 21:35:00 - Open communication moved to Sococo to help with more realtime conversation
2014-07-07 21:36:00 - Discussion around the worker from the community site sync causing a fair amount of traffic because it was connecting to itself
2014-07-07 21:36:00 - CCR on supermarket-prod to pull in code that removed the background sync worker
2014-07-07 21:38:00 - jtimberman notes that all three servers are out of the ELB
2014-07-07 21:42:00 - Sean Horn updates status.opscode.com with status
2014-07-07 21:49:00 - Ian Garrison notes that the health check is on port 443 and not 80
2014-07-07 21:49:00 - This is confirmed as normal by cwebber and jtimberman
2014-07-07 21:50:00 - jtimberman spins two additional instances (m3.medium)
2014-07-07 21:59:00 - Paul Mooring suggests backing off the timeout to 15s temporarily
2014-07-07 22:00:00 - Adam makes note that we need to change the status to reflect that this is a second issue
2014-07-07 22:01:00 - Paul Mooring makes the change and bumps the timeout to 25 seconds.
2014-07-07 22:12:00 - Status confirmed as updated
2014-07-07 22:14:00 - Latency starts to decrease: https://s3.amazonaws.com/uploads.hipchat.com/7557/78724/Bw7QEAgGO1rQzvS/AWS_Management_Console.png
2014-07-07 22:20:00 - cwebber makes note of decreased postgres connections http://i.cwebber.net/RDS__AWS_Console_2014-07-07_15-19-54_2014-07-07_15-20-05.jpg
2014-07-07 22:25:00 - Discussion in Ops around the fact that unicorn was configured with 3 workers and m3.mediums only have one core
2014-07-07 22:36:00 - We discussed in the Ops Sococo room bumping the size of the instances. Decision was to proceed with two parallel options

Start spinning two additional instances at m3.xlarge
Cycle, one at a time and resize each instance to m3.xlarge


2014-07-07 22:50:00 - reset pings cwebber on IRC to discuss possible rollback
2014-07-07 22:52:00 - cwebber and reset agree to wait until 23:15 to evaluate the need for that.
2014-07-07 22:55:00 - Running with five instances seems to be alleviating the pressure on the backends. http://i.cwebber.net/AWS_Management_Console_2014-07-07_15-54-59_2014-07-07_15-55-01.jpg
2014-07-07 22:56:00 - Mark Harrison works to get sethvargo access to the backend to verify we are not seeing issues related to the app itself.
2014-07-07 23:00:00 - All nodes in pool are now m3.xlarge
2014-07-07 23:06:00 - Incident resolved
2014-07-07 23:27:00 - jtimberman deploys a change to increment the number of unicorn workers
2014-07-07 23:39:00 - jtimberman determines that the change didn't work
2014-07-07 23:58:00 - jtimberman deploys a change that actually corrects the number of unicorn workers
2014-07-08 01:42:00 - Reflecting back on traffic... http://i.cwebber.net/AWS_Management_Console_2014-07-07_18-40-29_2014-07-07_18-41-35.jpg

Root Cause


A mismatch between the number of unicorn workers and number of cores caused traffic to back up. (may be a red herring)
The nodes were not sized to meet the demand
App nodes were being taken out of ELB for too quickly and increasing load on other nodes
No guidelines from dev on how to configure
load planning done before addition of /universe endpoint

Stabilization Steps


Background job that syncs to the old site was disabled
Resized three existing instances to m3.xlarge
Added two additional nodes

Impact

Total impact:  ~2hrs (Graphs indicate that increased maximum latency started at ~21:10)
Made api.berkshelf.com and Supermarket unusable during that time.
Corrective Actions


berkshelf-api server will use /universe on supermarket (reset/sethvargo)
evaluate number of unicorns so service can meet that demand (jtimberman)
berkshelf team to provide traffic for api.berkshelf.com
investigate caching of download urls (maybe varnish) (cwebber/fullstack)

make sure the metric counter is  non-blocking (cwebber/fullstack)


Updates to status.getchef.com will propagate to #chef irc (cwebber)
Add cloud watch alarms for when nodes drop out of ELB (Chef Ops)