Skip to content

Instantly share code, notes, and snippets.

@djmitche
Last active August 29, 2015 14:11
Show Gist options
  • Save djmitche/b7ce438c207dad355afe to your computer and use it in GitHub Desktop.
Save djmitche/b7ce438c207dad355afe to your computer and use it in GitHub Desktop.
Releng Web Cluster in AWS

Problem Summary

Here's what is currently running on the releng web cluster, and needs to be migrated:

Connecting to LDAP

  • if there's a route back to internal LDAP (via VPC/IPSec), we can use that
  • there's a public ldap.mozilla.org:636? (IP whitelisted, but that's problematic with AWS)
  • and a second port w/o port, but with client cert auth -- hard with Apache, but functional otherwise
  • Q1/Q2 IT will be overhauling identity stuff; may use Amazon AD or set up nodes in Amazon

Disabling

  • just run a crontask against the LDAP server

MozLDAP

  • not going to be deployed

Okta

  • Does SAML
  • There's an Apache module (mod_saml) for it, maybe?
  • Jabba can set up adding "box" (an endpoint) to Okta (and is tech contact in general)
  • Pay per user - talk to dtorre re allowing non-employees - seems OK
  • jbraddock is project lead

Overall plan:

Architecture

  • General

    • Everything is within the Releng VPC; this gets us access to the hosts via "normal" methods (admin hosts, VPN) and a flat IP space
    • Prefer many, smaller pieces (stacks) over one "monolithic" stack
  • S3

    • each bucket syncs from one region to glacier in the other
    • assume proxxy eliminates the need to have this per-region
    • use SSL to access all of these, using https://bucketname.s3.amazonaws.com
    • buckets:
      • mockbuild-repos (exists)
      • tooltool
      • pypi
      • runtime-binaries (hopefully without tooltool - bug 882712)
      • talos-bundles
  • Stacks

    • All deployed with CloudFormation and OpsWorks
    • Because DBs are shared, serve them from a dedicated stack (how can this be resilient to a failed region?)
    • Stacks with a '*.pvt.build.mozilla.org' equivalent will have an internal ELB located in the buildbot VPCs with a proper SG
    • Stacks:
      • RelengAPI
        • One App per relengapi component
        • Frontend Layer
        • Celery Layer
      • Crontasks (to run all of the old-school crontasks)
        • One App per crontask suite:
          • Jacuzzi Allocator
          • Builddata
          • Etc.
      • Slavealloc
        • Single App
        • Frontend Layer
      • BuildAPI
        • Single App
        • Frontend Layer
        • Memcached Layer
        • Uses SQS
        • Note: talks to buildbot, buildbot_scheduler DBs in scl3
      • Tooltool Uploads
        • Single App
        • Frontend Layer
        • Worker Layer (validate, copy to S3)
        • Uses SQS
      • Static (redirects / proxies https://secure.pub.build.mozilla.org)
        • Single App
        • Frontend Layer
      • DB
        • DB's: buildapi, relengapi, clobberer, slavealloc, mobile-imaging
        • Implement as different DBs on the same instances, but in such a way that we can easily split them out later
        • Extra -- timed? -- layer to do snapshots to S3

Random Points

  • what was one "cluster" in scl3 will be several services in AWS
    • dynamic stuff -> OpsWorks stacks
    • pile-of-files stuff -> S3 buckets
    • totally independent stuff: separate project entirely
  • simplify pile-of-files stuff by assuming proxxy will limit inter-region traffic
  • get rid of secure.pub as much as possible; replace with specific sites and certs

Resources Required

(work in progress)

  • AWS::ElastiCache::CacheCluster

    • buildapi -- backend for buildapi
  • AWS::SQS::Queue

    • buildapi-request, buildapi-response -- channel for buildapi-to-master communication
      • +AWS::SQS::QueuePolicy

SSL

We want SSL for everything, but for S3 buckets that requires using the *.s3.amazonaws.com hostname - is that OK? Does that work with region failover (where the URL will change)? That limits us to buckets without dots -- is that OK?

Failover

When we fail from region to region, we need to be sure we don't run into instance starvation in the new region. Do reserved instances help with this?

Could we incorporate some kind of inter-region proxy so that when region A becomes the standby region, its endpoints keep working but proxying to region B?

DB Changes

None of the deployment tools seem to handle this. How is it typically handled?

Authentication

We have two requirements:

  • simple "yes/no" authentication for most stuff
  • RelengAPI needs to integrate group membership, disabled accounts, etc.

We may be able to use Okta for both, but I'm verifying that.

  • How can we authenticate private pile-of-files stuff like tooltool and talos-bundles? Can tooltool grow the ability to do AWS auth? Or can we just make tooltool public?

  • How can we provide access to stuff that's typically been authenticated merely by IP address (*.pvt.build.mozilla.org)

Unsolved

  • How can we implement the tooltool uploader? Maybe not by rsync? HTTP App? Who will write that?
  • How do we manage configuration on the hosts, especially the bastion host (which needs its SSH keys managed)
    • Hosts speak directly to puppet?
    • AMI generation from puppet?
    • Something else?

Can use S3 buckets for redirects, but by BUCKET not object

Disaster Recovery

S3 Bucket DR

  • there's a "cross-region copy" operation
  • not sure if AWS can schedule those, but we can do it in a periodic task of some sort

Service Discovery and Failover

Think of regions as entirely separate, and use Route 53 to find the endpoint in the right region for public and private access. Consider delegating subdomains of mozilla.org to Route 53.

Within a region, use ELBs as service endpoints within regions, with either a fixed IP or using Amazon's dynamic DNS names

Best practice is not to automatically spin up an entire region - region failover should be human-triggered (but automated)

It's OK for a recovery (but not operational) dependency to be geographically non-redundant. In particular, as long as the git repositories that may be required for recovery are not in the same geographic area that failed, we're OK. Also, mirroring git repositories would be pretty easy now, and may be much easier with new AWS CodeCommit.

Deployment

  • CloudFormation + CodeDeploy is most versatile
    • CF for resources
      • nice separation of cfg from parameters
      • not much UI, but probably want to do some kind of automation, and it's built for that
    • CD does server config after that (details hazy)

Best Practice: Keep VPCs and subnets and sec groups in a separate template

  • they don't change too often
  • they're cheap/free, so pre-allocate them in each region
  • reference with ARNs from the other templates (apparently there's a way to pick arn by region?)

Databases

Richard prefers RDS over MySQL on EC2

  • very mature
  • lots of features
  • cross-region copy (async)
  • can do sync r/o within region (different AZ) with auto-promotion

Aurora is still in preview, but looks much nicer. In particular, better (immediate) failover

Model

  • Application -- contains
    • Deployment Group -- contains
      • EC2 Instances
    • Deployments -- each has
      • Revision

If we set this up the natural way, with each relengweb application being its own CodeDeploy application, then we need to re-define the deployment group for each application. That kinda sucks, but since it will for the most part just specify autoscale groups, it's not too bad. Note that it appears to detect files from one app overwriting files from another app, although I suspect that's pretty rudimentary.

  • Can deploy either from tarballs in S3 or from github. Deploying from github requires a "one-time" manual connect with github via the UI, which seems buggy.

  • Deployment is just copying files and running scripts. Given that all of the source is in other repos already, we probably want to do deployment of small "revisions" that just reference the software to be installed. In which case, a tarball is just as easy as GitHub and requires less clicking.

PROBLEM: CodeDeploy apps cannot be configured by CloudFormation. OpsWorks can. Hmm.

I set up an OpsWorks stack (the PHP demo) with CloudFormation.

Not everything OpsWorks is configurable with CloudFormation. From the simple (stack color) to the more important (IAM roles and SSH keys).

But, OpsWorks has a nice way of associating SSH keys to IAM users and then dynamically adding those keys to instances (or entire stacks) to allow that user to login as a non-shared account. That's pretty cool.

All in all, we'd need to use the OpsWorks UI and API quite a bit, which is probably OK -- that's better than re-implementing a UI, after all.

We'd also need to use all custom layers, and thus a lot of custom chef recipes. That may be less OK - we could probably build a better deployment strategy using our own tooling at about the same cost in code written.

Inter-region Copies

Inter-region copies are supported without download/re-upload, but are per-object (and so synchronizing will be tricky - maybe there's a tool for that already?)

http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html can do s3-to-s3 copies, and appears to use the copy operation! It doesn't say it supports the equivalent of rsync's --delete, though

static sites

A bucket can have attached redirects, with flexible rules for what redirects where

But you can't do this with HTTPS and a mozilla hostname (so, https://secure.pub.build.mozilla.org is out)

https://cloudnative.io/docs/blue-green-deployment/ How does Etsy do it?

Etsy have a thing called "Schema Change Thursday". They perform all database schema changes once a week only, yet deploy to production 50 times per day. How do they do this? A few things:

  • Any schema change must be an addition with a default. No field is removed during a normal migration, and nothing is renamed.
  • Features are deployed "dark", meaning while the code to use the feature may be on production, it is disabled. They use "feature flags" to enable the feature first for internal employees, then 1% of the user base, the 10% all the way up to 100%.

This is what is considered the Best Practice for deploying new features and handling DB changes.

@djmitche
Copy link
Author

Need feature flags - they are critical to schema migrations:

  • add new DB tables
  • write to both
  • feature flag to switch reading to new DBs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment