Skip to content

Instantly share code, notes, and snippets.

@jaredhirsch
Created December 10, 2012 04:55
Show Gist options
  • Save jaredhirsch/4248451 to your computer and use it in GitHub Desktop.
Save jaredhirsch/4248451 to your computer and use it in GitHub Desktop.
stumbling towards high availability

High-level items from the first meeting:

  1. zeus level monitoring of outbound response codes
  2. our fragile database setup (single writer with cross-colo comms)
  3. personastatus.org - religious use builds confidence
  4. our "playbook"
  5. Scaling work - currently at 2.5M ADU
  6. Email vendor capacity analysis

2. Database:

  • easy Instead of 1 master/5 slaves, make one slave in PHX the master of other 2 PHX slaves.
    • if we lose the main datacenter, we can switch to PHX without having to manually synchronize & reconfigure PHX.
  • easy Get Sheeri access & she will tune the boxen this week
  • easy Add Nagios monitors that systems team has already written for their MySQL fleet
  • AWS as third datacenter
    • could start with a DB slave, move up to a full stack

3. Personastatus.org

  • Jared getting this up to date today.

5. Scaling-related:

  • add CPU via EC2 instances added to LB pool--either webhead or keysigner instances
    • need provisioning, monitoring for a box outside our colo. lot of ops unk-unks here

6. Email-related:

  • easy S-labs claims (?) we could get them at most hundreds of thousands of emails/hour
    • Given 1 day notice, their server team will prepare our existing account for high throughput by mapping a dozen IPs to it.
    • This isn't great news. But they won't charge us extra.
  • mmayo opened a ticket to connect me to Amazon SES team, going to get a second opinion

New questions:

  • Zeus outage prevention/mitigation
    • Do we have redundant LBs? Cost to add?
    • Steps to take if Zeus goes down? Monitoring in place?
  • If we want near-realtime metrics, we could consider taking one slave out of rotation for use as a reporting machine during our spike
  • AWS as third datacenter for DB, compute, or full backup stack
    • This is a great failover option, if we have the time to build it.
    • Possibly start with DB slaves & build towards a full stack
  • Playbook
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment