jaredhirsch/gist:4248451

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    High-level items from the first meeting:


zeus level monitoring of outbound response codes
our fragile database setup (single writer with cross-colo comms)
personastatus.org - religious use builds confidence
our "playbook"
Scaling work - currently at 2.5M ADU
Email vendor capacity analysis

2. Database:


easy Instead of 1 master/5 slaves, make one slave in PHX the master of other 2 PHX slaves.

if we lose the main datacenter, we can switch to PHX without having to manually synchronize & reconfigure PHX.


easy Get Sheeri access & she will tune the boxen this week
easy Add Nagios monitors that systems team has already written for their MySQL fleet
AWS as third datacenter

could start with a DB slave, move up to a full stack


3. Personastatus.org


Jared getting this up to date today.

5. Scaling-related:


add CPU via EC2 instances added to LB pool--either webhead or keysigner instances

need provisioning, monitoring for a box outside our colo. lot of ops unk-unks here


6. Email-related:


easy S-labs claims (?) we could get them at most hundreds of thousands of emails/hour

Given 1 day notice, their server team will prepare our existing account for high throughput by mapping a dozen IPs to it.
This isn't great news. But they won't charge us extra.


mmayo opened a ticket to connect me to Amazon SES team, going to get a second opinion

New questions:


Zeus outage prevention/mitigation

Do we have redundant LBs? Cost to add?
Steps to take if Zeus goes down? Monitoring in place?


If we want near-realtime metrics, we could consider taking one slave out of rotation for use as a reporting machine during our spike
AWS as third datacenter for DB, compute, or full backup stack

This is a great failover option, if we have the time to build it.
Possibly start with DB slaves & build towards a full stack


Playbook

Pre-existing diagnostic advice for individual alerts here: https://intranet.mozilla.org/Services/Ops/BrowserID/Alerts