djmitche/10_sites.md

## 10_sites.md

      
    Raw
  

              10_sites.md
            
          
    Problem Summary

Here's what is currently running on the releng web cluster, and needs to be migrated:

https://api.pub.build.mozilla.org (RelengAPI)

DBs (local and remote)
Worker cluster
Badpenny system of some sort


http://builddata.pub.build.mozilla.org

Static data
Crontasks


http://jacuzzi-allocator.pub.build.mozilla.org

Static data
Crontasks


http://mockbuild-repos.pub.build.mozilla.org

pile-of-files
Already in S3 (2 regions)


https://secure.pub.build.mozilla.org

/buildapi

user only (redirects OK)
WSGI app
DB (buildapi)
MQ


/clobberer*

user only (redirects OK)
just redirects now


/builddata

user only (redirects OK)
static data


/slavealloc

user only (redirects OK)
proxy to twisted app
slavealloc DB


/slaveapi*

user only (redirects OK)
proxy into scl3


/tooltool

user only (redirects OK)
pile-of-files


/jacuzzi-allocator

pile-of-files
only used by slave-health; easy to change URL


http://trychooser.pub.build.mozilla.org

Static data
Automatic deploy on push?


http://buildapi.pvt.build.mozilla.org

same deployment as secure
WSGI app
DB (buildapi)
MQ
memcached


http://clobberer.pvt.build.mozilla.org et al.

Just redirects now


http://pypi.pub.build.mozilla.org / pypi.pvt.build.mozilla.org

Mirrors of one another - just CNAME pvt to pub and call it a day
pile-of-files
consider pypicloud?


http://runtime-binaries.pvt.build.mozilla.org

pile-of-files
actually, just one file, updateservice.zip - move to tooltool?
note, private
plus a few lingering tooltool requests


http://slavealloc.pvt.build.mozilla.org

proxy to twisted app
slavealloc DB


http://talos-bundles.pvt.build.mozilla.org

pile-of-files


http://talos-remote.pvt.build.mozilla.org

leave in place


http://tooltool.pvt.build.mozilla.org & tooltool.pub.build.mozilla.org

pile-of-files
uploads


## 40_ldap-info.md

      
    Raw
  

              40_ldap-info.md
            
          
    Connecting to LDAP

if there's a route back to internal LDAP (via VPC/IPSec), we can use that
there's a public ldap.mozilla.org:636? (IP whitelisted, but that's problematic with AWS)
and a second port w/o port, but with client cert auth -- hard with Apache, but functional otherwise
Q1/Q2 IT will be overhauling identity stuff; may use Amazon AD or set up nodes in Amazon

Disabling

just run a crontask against the LDAP server

MozLDAP

not going to be deployed

Okta

Does SAML
There's an Apache module (mod_saml) for it, maybe?
Jabba can set up adding "box" (an endpoint) to Okta (and is tech contact in general)
Pay per user - talk to dtorre re allowing non-employees - seems OK
jbraddock is project lead

Overall plan:

Use Okta for authentication
Query LDAP directly (with bind user and cert) for group membership and to discover expired user

https://github.com/mozilla-it/mozlibldap


## 50_architecture.md

      
    Raw
  

              50_architecture.md
            
          
    Architecture


General

Everything is within the Releng VPC; this gets us access to the hosts via "normal" methods (admin hosts, VPN) and a flat IP space
Prefer many, smaller pieces (stacks) over one "monolithic" stack


S3

each bucket syncs from one region to glacier in the other
assume proxxy eliminates the need to have this per-region
use SSL to access all of these, using https://bucketname.s3.amazonaws.com
buckets:

mockbuild-repos (exists)
tooltool
pypi
runtime-binaries (hopefully without tooltool - bug 882712)
talos-bundles


Stacks

All deployed with CloudFormation and OpsWorks
Because DBs are shared, serve them from a dedicated stack (how can this be resilient to a failed region?)
Stacks with a '*.pvt.build.mozilla.org' equivalent will have an internal ELB located in the buildbot VPCs with a proper SG
Stacks:

RelengAPI

One App per relengapi component
Frontend Layer
Celery Layer


Crontasks (to run all of the old-school crontasks)

One App per crontask suite:

Jacuzzi Allocator
Builddata
Etc.


Slavealloc

Single App
Frontend Layer


BuildAPI

Single App
Frontend Layer
Memcached Layer
Uses SQS
Note: talks to buildbot, buildbot_scheduler DBs in scl3


Tooltool Uploads

Single App
Frontend Layer
Worker Layer (validate, copy to S3)
Uses SQS


Static (redirects / proxies https://secure.pub.build.mozilla.org)

Single App
Frontend Layer


DB

DB's: buildapi, relengapi, clobberer, slavealloc, mobile-imaging
Implement as different DBs on the same instances, but in such a way that we can easily split them out later
Extra -- timed? -- layer to do snapshots to S3


Random Points


what was one "cluster" in scl3 will be several services in AWS

dynamic stuff -> OpsWorks stacks
pile-of-files stuff -> S3 buckets
totally independent stuff: separate project entirely


simplify pile-of-files stuff by assuming proxxy will limit inter-region traffic
get rid of secure.pub as much as possible; replace with specific sites and certs

S3 sites should use SSL with https://bucket-name.s3.amazonaws.com


## 51_resources.md

      
    Raw
  

              51_resources.md
            
          
    Resources Required

(work in progress)


AWS::ElastiCache::CacheCluster

buildapi -- backend for buildapi


AWS::SQS::Queue

buildapi-request, buildapi-response -- channel for buildapi-to-master communication

+AWS::SQS::QueuePolicy


## 90_questions.md

      
    Raw
  

              90_questions.md
            
          
    SSL

We want SSL for everything, but for S3 buckets that requires using the *.s3.amazonaws.com hostname - is that OK?
Does that work with region failover (where the URL will change)?
That limits us to buckets without dots -- is that OK?
Failover

When we fail from region to region, we need to be sure we don't run into instance starvation in the new region.
Do reserved instances help with this?
Could we incorporate some kind of inter-region proxy so that when region A becomes the standby region, its endpoints keep working but proxying to region B?
DB Changes

None of the deployment tools seem to handle this.  How is it typically handled?
Authentication

We have two requirements:

simple "yes/no" authentication for most stuff
RelengAPI needs to integrate group membership, disabled accounts, etc.

We may be able to use Okta for both, but I'm verifying that.


How can we authenticate private pile-of-files stuff like tooltool and talos-bundles?  Can tooltool grow the ability to do AWS auth?  Or can we just make tooltool public?


How can we provide access to stuff that's typically been authenticated merely by IP address (*.pvt.build.mozilla.org)


Unsolved


How can we implement the tooltool uploader?  Maybe not by rsync?  HTTP App?  Who will write that?
How do we manage configuration on the hosts, especially the bastion host (which needs its SSH keys managed)

Hosts speak directly to puppet?
AMI generation from puppet?
Something else?


## 91_notes_from_sotar.md

      
    Raw
  

              91_notes_from_sotar.md
            
          
    Can use S3 buckets for redirects, but by BUCKET not object
Disaster Recovery

S3 Bucket DR


there's a "cross-region copy" operation
not sure if AWS can schedule those, but we can do it in a periodic task of some sort

Service Discovery and Failover

Think of regions as entirely separate, and use Route 53 to find the endpoint in the right region for public and private access. Consider delegating subdomains of mozilla.org to Route 53.
Within a region, use ELBs as service endpoints within regions, with either a fixed IP or using Amazon's dynamic DNS names
Best practice is not to automatically spin up an entire region - region failover should be human-triggered (but automated)
It's OK for a recovery (but not operational) dependency to be geographically non-redundant.  In particular, as long as the git repositories that may be required for recovery are not in the same geographic area that failed, we're OK.  Also, mirroring git repositories would be pretty easy now, and may be much easier with new AWS CodeCommit.
Deployment


CloudFormation + CodeDeploy is most versatile

CF for resources

nice separation of cfg from parameters
not much UI, but probably want to do some kind of automation, and it's built for that


CD does server config after that (details hazy)


Best Practice: Keep VPCs and subnets and sec groups in a separate template

they don't change too often
they're cheap/free, so pre-allocate them in each region
reference with ARNs from the other templates (apparently there's a way to pick arn by region?)

Databases

Richard prefers RDS over MySQL on EC2

very mature
lots of features
cross-region copy (async)
can do sync r/o within region (different AZ) with auto-promotion

Aurora is still in preview, but looks much nicer.  In particular, better (immediate) failover

  
## 92_codedeploy_notes.md

      
    Raw
  

              92_codedeploy_notes.md
            
          
    Model


Application -- contains

Deployment Group -- contains

EC2 Instances


Deployments -- each has

Revision


If we set this up the natural way, with each relengweb application being its own CodeDeploy application, then we need to re-define the deployment group for each application.  That kinda sucks, but since it will for the most part just specify autoscale groups, it's not too bad.  Note that it appears to detect files from one app overwriting files from another app, although I suspect that's pretty rudimentary.


Can deploy either from tarballs in S3 or from github.  Deploying from github requires a "one-time" manual connect with github via the UI, which seems buggy.


Deployment is just copying files and running scripts.  Given that all of the source is in other repos already, we probably want to do deployment of small "revisions" that just reference the software to be installed.  In which case, a tarball is just as easy as GitHub and requires less clicking.


PROBLEM: CodeDeploy apps cannot be configured by CloudFormation.  OpsWorks can.  Hmm.

  
## 92_opsworks_notes.md

      
    Raw
  

              92_opsworks_notes.md
            
          
    I set up an OpsWorks stack (the PHP demo) with CloudFormation.
Not everything OpsWorks is configurable with CloudFormation.  From the simple (stack color) to the more important (IAM roles and SSH keys).
But, OpsWorks has a nice way of associating SSH keys to IAM users and then dynamically adding those keys to instances (or entire stacks) to allow that user to login as a non-shared account.  That's pretty cool.
All in all, we'd need to use the OpsWorks UI and API quite a bit, which is probably OK -- that's better than re-implementing a UI, after all.
We'd also need to use all custom layers, and thus a lot of custom chef recipes.  That may be less OK - we could probably build a better deployment strategy using our own tooling at about the same cost in code written.

  
## 92_s3_notes.md

      
    Raw
  

              92_s3_notes.md
            
          
    Inter-region Copies

Inter-region copies are supported without download/re-upload, but are per-object (and so synchronizing will be tricky - maybe there's a tool for that already?)

http://docs.aws.amazon.com/AmazonS3/latest/dev/CopyingObjectsExamples.html
Can change the storage class at the same time - we could back up to a backup region with RRS, but we don't want a live failover region to be RRS

http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html can do s3-to-s3 copies, and appears to use the copy operation!  It doesn't say it supports the equivalent of rsync's --delete, though
static sites

A bucket can have attached redirects, with flexible rules for what redirects where

http://docs.aws.amazon.com/AmazonS3/latest/dev/HowDoIWebsiteConfiguration.html

But you can't do this with HTTPS and a mozilla hostname (so, https://secure.pub.build.mozilla.org is out)

  
## 95_db_deployments_bp.md

      
    Raw
  

              95_db_deployments_bp.md
            
          
    https://cloudnative.io/docs/blue-green-deployment/
How does Etsy do it?
Etsy have a thing called "Schema Change Thursday". They perform all database schema changes once a week only, yet deploy to production 50 times per day. How do they do this? A few things:

Any schema change must be an addition with a default. No field is removed during a normal migration, and nothing is renamed.
Features are deployed "dark", meaning while the code to use the feature may be on production, it is disabled. They use "feature flags" to enable the feature first for internal employees, then 1% of the user base, the 10% all the way up to 100%.

This is what is considered the Best Practice for deploying new features and handling DB changes.

  
## 99_links.md

      
    Raw
  

              99_links.md
            
          
https://wblinks.com/notes/aws-tips-i-wish-id-known-before-i-started/
http://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html
https://www.airpair.com/aws/posts/building-a-scalable-web-app-on-amazon-web-services-p1
https://github.com/rmmeans/S3-Private-Downloader