- This is a blameless Post Mortem.
- We will not focus on the past events as they pertain to "could've" "should've"...
- All follow up action items will be assigned to a team/individual before the end of the meeting. If the item is not going to be top priority leaving the meeting, don't make it a follow up item.
Supermarket was intermittently unresponsive approximately 2 hours after launch. This affected Supermarket and the Berkshelf API server
All Times UTC
- 2014-07-07 19:15:00 - community.opscode.com and cookbooks.opscode.com updated to point at Supermarket
- 2014-07-07 19:18:00 - api.berkshelf.com pointed at Supermarket
- 2014-07-07 19:40:00 - Deploy declared complete
- 2014-07-07 21:26:09 - In #chef, icarus reports that https://supermarket.getchef.com/cookbooks/ant is white paging. Also reports getting 502
- 2014-07-07 21:27:00 - Seth Chisamore reports issues with berkshelf api returning errors.
- 2014-07-07 21:35:00 - Open communication moved to Sococo to help with more realtime conversation
- 2014-07-07 21:36:00 - Discussion around the worker from the community site sync causing a fair amount of traffic because it was connecting to itself
- 2014-07-07 21:36:00 - CCR on supermarket-prod to pull in code that removed the background sync worker
- 2014-07-07 21:38:00 - jtimberman notes that all three servers are out of the ELB
- 2014-07-07 21:42:00 - Sean Horn updates status.opscode.com with status
- 2014-07-07 21:49:00 - Ian Garrison notes that the health check is on port 443 and not 80
- 2014-07-07 21:49:00 - This is confirmed as normal by cwebber and jtimberman
- 2014-07-07 21:50:00 - jtimberman spins two additional instances (m3.medium)
- 2014-07-07 21:59:00 - Paul Mooring suggests backing off the timeout to 15s temporarily
- 2014-07-07 22:00:00 - Adam makes note that we need to change the status to reflect that this is a second issue
- 2014-07-07 22:01:00 - Paul Mooring makes the change and bumps the timeout to 25 seconds.
- 2014-07-07 22:12:00 - Status confirmed as updated
- 2014-07-07 22:14:00 - Latency starts to decrease: https://s3.amazonaws.com/uploads.hipchat.com/7557/78724/Bw7QEAgGO1rQzvS/AWS_Management_Console.png
- 2014-07-07 22:20:00 - cwebber makes note of decreased postgres connections http://i.cwebber.net/RDS__AWS_Console_2014-07-07_15-19-54_2014-07-07_15-20-05.jpg
- 2014-07-07 22:25:00 - Discussion in Ops around the fact that unicorn was configured with 3 workers and m3.mediums only have one core
- 2014-07-07 22:36:00 - We discussed in the Ops Sococo room bumping the size of the instances. Decision was to proceed with two parallel options
- Start spinning two additional instances at m3.xlarge
- Cycle, one at a time and resize each instance to m3.xlarge
- 2014-07-07 22:50:00 - reset pings cwebber on IRC to discuss possible rollback
- 2014-07-07 22:52:00 - cwebber and reset agree to wait until 23:15 to evaluate the need for that.
- 2014-07-07 22:55:00 - Running with five instances seems to be alleviating the pressure on the backends. http://i.cwebber.net/AWS_Management_Console_2014-07-07_15-54-59_2014-07-07_15-55-01.jpg
- 2014-07-07 22:56:00 - Mark Harrison works to get sethvargo access to the backend to verify we are not seeing issues related to the app itself.
- 2014-07-07 23:00:00 - All nodes in pool are now m3.xlarge
- 2014-07-07 23:06:00 - Incident resolved
- 2014-07-07 23:27:00 - jtimberman deploys a change to increment the number of unicorn workers
- 2014-07-07 23:39:00 - jtimberman determines that the change didn't work
- 2014-07-07 23:58:00 - jtimberman deploys a change that actually corrects the number of unicorn workers
- 2014-07-08 01:42:00 - Reflecting back on traffic... http://i.cwebber.net/AWS_Management_Console_2014-07-07_18-40-29_2014-07-07_18-41-35.jpg
- A mismatch between the number of unicorn workers and number of cores caused traffic to back up. (may be a red herring)
- The nodes were not sized to meet the demand
- App nodes were being taken out of ELB for too quickly and increasing load on other nodes
- No guidelines from dev on how to configure
- load planning done before addition of /universe endpoint
- Background job that syncs to the old site was disabled
- Resized three existing instances to m3.xlarge
- Added two additional nodes
Total impact: ~2hrs (Graphs indicate that increased maximum latency started at ~21:10) Made api.berkshelf.com and Supermarket unusable during that time.
- berkshelf-api server will use /universe on supermarket (reset/sethvargo)
- evaluate number of unicorns so service can meet that demand (jtimberman)
- berkshelf team to provide traffic for api.berkshelf.com
- investigate caching of download urls (maybe varnish) (cwebber/fullstack)
- make sure the metric counter is non-blocking (cwebber/fullstack)
- Updates to status.getchef.com will propagate to #chef irc (cwebber)
- Add cloud watch alarms for when nodes drop out of ELB (Chef Ops)