- zeus level monitoring of outbound response codes
- our fragile database setup (single writer with cross-colo comms)
- personastatus.org - religious use builds confidence
- our "playbook"
- Scaling work - currently at 2.5M ADU
- Email vendor capacity analysis
- easy Instead of 1 master/5 slaves, make one slave in PHX the master of other 2 PHX slaves.
- if we lose the main datacenter, we can switch to PHX without having to manually synchronize & reconfigure PHX.
- easy Get Sheeri access & she will tune the boxen this week
- easy Add Nagios monitors that systems team has already written for their MySQL fleet
- AWS as third datacenter
- could start with a DB slave, move up to a full stack
- Jared getting this up to date today.
- add CPU via EC2 instances added to LB pool--either webhead or keysigner instances
- need provisioning, monitoring for a box outside our colo. lot of ops unk-unks here
- easy S-labs claims (?) we could get them at most hundreds of thousands of emails/hour
- Given 1 day notice, their server team will prepare our existing account for high throughput by mapping a dozen IPs to it.
- This isn't great news. But they won't charge us extra.
- mmayo opened a ticket to connect me to Amazon SES team, going to get a second opinion
- Zeus outage prevention/mitigation
- Do we have redundant LBs? Cost to add?
- Steps to take if Zeus goes down? Monitoring in place?
- If we want near-realtime metrics, we could consider taking one slave out of rotation for use as a reporting machine during our spike
- AWS as third datacenter for DB, compute, or full backup stack
- This is a great failover option, if we have the time to build it.
- Possibly start with DB slaves & build towards a full stack
- Playbook
- Pre-existing diagnostic advice for individual alerts here: https://intranet.mozilla.org/Services/Ops/BrowserID/Alerts