RogerHub Monday 12/7/2015 Outage
The Final Grade Calculator was broken from approximately 2:21AM PST to 11:22AM PST (9 hours 1 minute) on December 7th 2015.
Problem 1: At around 2AM PST, I began seeing increased ping delays and dropped packets to the Dallas, TX Linode datacenter where RogerHub.com is hosted. At 2:21AM PST, I decided to invoke the failover mechanism and transfer the live site to a standby server running in the Fremont, CA Linode datacenter. HTTP traffic from the Dallas, TX server was routed to the Fremont, CA server transparently, while the DNS records were updated on Route53.
I verified that the site worked and the administration backend was consistent, and after the Dallas, TX server became available again, I set up the Dallas, TX server in standby mode, so another failover could be performed if needed.
Problem 4: I did not try to use the Final Grade Calculator after the failover operation, or else I would have noticed it did not work.
It was good to have tested the failover mechanism during a low traffic period (early in the morning for the United States). However, there are more steps that can be done to improve this.
- Add monitoring for network latency or dropped packets to RogerHub.com.
- Add monitoring that actually tries to use the Final Grade Calculator.
- Add monitoring for 404's and other HTTP errors on RogerHub.