jwhitlock/mdn-cloud-infra-2015.txt

## mdn-cloud-infra-2015.txt
Very first discussion about potentially moving MDN to Cloud Services infrastructure in 2015

Context via cyliang:

"I'm part of the IT WebOps team, which is separate from the infra team spearheaded by Corey (cshields).  jthomas and oremj are part of a completely different operations team (Service Operations is part of Cloud Services) and reports to a different VP.  Right now, MDN is part of Cloud Services, but their infrastructure and operations is currently managed by the IT WebOps group."

Feb 4

Repository + stable parameter
tag, commit, branch wants to go to prod

* What is the value for Continuous Delivery?
   *
 * What is the risk of something breaking?
   *
 * Decisions based on the above ...
   * Automated QA tests should block production pushes

Config -> environment variables
Apache -> gunicorn & nginx

Self-serving PaaS?
Deis: probably never run
AWS ElasticBeanstalk?
AWS Lambda?
AWS ECS?

ACTION:
 * Luke email mdn-dev re: values & risks of Continuous Delivery
 * Luke email Stuart re: Intern testing

Jan 14
Cleaning up backend work; affecting deployments; want to make everyone aware of ideas for future MDN platform and how we'd like it to run on AWS

jezdez:
Mozilla's traditional deployments in IT/WebOps have been very traditional clusters of web servers, etc.

Current deployment process like chief is more of a macguyver/band-aid

Big fan of stateless web app; independent components & resources; *12 factor app* (http://12factor.net/) and Heroku; biggest advantage: clear separation of concern for better maintainability

Technically, would entail many refactorings in both code & deployment

Maris & Jannis: we are at the decision point now before we go down any given technology path

Travis:

Differences
 * How much control over the environment each person has

Use Jenkins as automation and to build environments for Dev, QA, Production. E.g., go to jenkins interface and push a button to deploy a hash to AWS

CloudOps always pushes the production environment

Stateful storage for app in external resources (Redis, S3, etc.)

Cloud spins up new nodes and changes DNS over

Want soft-launch techniques

How to get branch to production?

0. Config update needed; alert CloudOps
1. Some trigger of nightly build - tag, branch, something
2. Spins up environment exactly the same as production
3. Go to Jenkins - parameterized submission form: branch name or revision & puppet/ansible config repo


Dec 3
 * [DECISION] Target end of Q2 move
   * Complete move
     * (groovecoder) file meta bug, NetApp bug, DNS ... blocked by kuma-lib
     * NetApp -> S3
       * (Travis) create stage & prod S3 buckets for MDN
       * (cturra) move MDN files to S3 buckets
       * (mdndev) change demo & wiki code to use S3
       * (mdndev) send static assets to S3
     * DNS -> AWS Route53 :)
     * Django web-heads -> AWS EC2
       * (Travis) spin up stage & prod EC2 instances
     * KumaScript nodes -> AWS EC2
       * (Travis) spin up stage & prod EC2 instances
     * RabbitMQ -> AWS Redis
       * (Travis) spin up stage & prod instances
     * celery node -> AWS EC2
       * (Travis) spin up stage & prod instances
       * (mdndev) change code from RabbitMQ to Redis
     * ElasticSearch -> AWS EC2 instances (custom ES cluster)
       * (travis) create ES cluster
       * (mdndev) re-build index on AWS ES cluster
       * (mdndev) change code to use AWS ES cluster
     * MySQL Database -> AWS RDS
       * (mdndev) test read-only
       * (IT, WebOps, mdndev, CloudOps) dump & import
       * (mdndev) code testing
     * Memcache -> AWS Redis
   * [todo] NetOps + WebOps + CloudOps: Monitoring
     * Nagios Monitoring -> drop from MOC?
   * Sentry Monitoring -> Sentry node in CloudOps
   * No Logging -> heka (https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Logging+Standard)
   * (groovecoder) schedule meeting w/ mdndev & CloudOps to discuss deployment
   * (groovecoder) file bug - Product: Mozilla Services; Component: Operations assign to ckolos for AWS dev access to mdn-dev@mozilla.com


Nov 25
 * IT/WebOps still planning for 2015: questions about the benefits of shifting how Ops does deployments, and the benefit of doing this work now, or at a later time when the 2015 plan is clearer.
   * WebOps is leading the discussion on this
 * Q2 is preferred timing for MDN, so that we can wrap up our current project of site stability fixes (and not intermix infra changes with major code landings)
   * Q2 is good for Service Ops, too
 * Travis - AWS & ElasticSearch - hosted AWS or custom?
   * 2 custom/raw 6-node ES clusters
   * MDN currently runs 5 nodes (prod)
 * (groovecoder) send summary email including Sean & Stephanie
   * (groovecoder) schedule meeting for Mozlandia
 * What happens to self-service deployments, CI infrastructure?
   * should be no problem sticking with Travis for CI
   * need to examine options for self-service deploys, MDN currently uses chief
     * MDN OK with losing chief, wants to keep self-service part
     * needs discussion at Mozlandia

Nov 11

 * Who could work on the project?
   * Sean Rich, C Liang, Stephanie Chan
 * When could we do it?
   * IT/WebOps: planning 2015 quarters Nov 18th-19th
   * MDN: Q3 services
   * CloudOps: Prefer Q1-Q2
 * (groovecoder) owns the project
 * (cyliang) will get a list of infra + 3rd party services together
 * (groovecoder) get DB backup info from sheeri (see below)
 * (groovecoder) will schedule a follow-up meeting for week of Nov 25th

Optionally discuss:

 * List of infrastructure (web-heads, db servers, search nodes, celery nodes, rabbitmq, etc.)
   * https://mana.mozilla.org/wiki/display/websites/developer.mozilla.org+Cluster
   * (only update to above is that MDN ES is now on a separate set of clusters)
   * [ no metrics ingestion of logs ]
   * Any other IT integration points (i.e storage)
     * /mnt/netapp for demo & wiki uploads
       * change to S3 (will need dev work)
 * List of 3rd-party services (socketlabs, etc.)
   * SocketLabs
   * NewRelic
   * Recaptcha?
   * Bitly?
 * Monitoring
   * Nagios
   * collectd / graphite(move to statsd/heka/graphite)
   * errormill (sentry)
 * Database Backup Retention
   * Sheeri
   * daily backups kept for a month
   * monthly backups (1st of the month) kept since 7/1/2014
 * CI/CD infrastructure
   * Switch to AWS + Jenkins?

Notes:
 * MDN will(?) be standing up new services in 2015, but they don't depend on the wiki
   * new services can go to CSO, MDN can stay with IT
   * probably start standing these up for public consumption in Q3
 * travis has some directive to assist with MDN Ops management
   * if MDN moves, best to move the infra as well - just moving management + processes won't be any gain
   *
	Very first discussion about potentially moving MDN to Cloud Services infrastructure in 2015

	Context via cyliang:

	"I'm part of the IT WebOps team, which is separate from the infra team spearheaded by Corey (cshields). jthomas and oremj are part of a completely different operations team (Service Operations is part of Cloud Services) and reports to a different VP. Right now, MDN is part of Cloud Services, but their infrastructure and operations is currently managed by the IT WebOps group."

	Feb 4

	Repository + stable parameter
	tag, commit, branch wants to go to prod

	* What is the value for Continuous Delivery?
	*
	* What is the risk of something breaking?
	*
	* Decisions based on the above ...
	* Automated QA tests should block production pushes

	Config -> environment variables
	Apache -> gunicorn & nginx

	Self-serving PaaS?
	Deis: probably never run
	AWS ElasticBeanstalk?
	AWS Lambda?
	AWS ECS?

	ACTION:
	* Luke email mdn-dev re: values & risks of Continuous Delivery
	* Luke email Stuart re: Intern testing

	Jan 14
	Cleaning up backend work; affecting deployments; want to make everyone aware of ideas for future MDN platform and how we'd like it to run on AWS

	jezdez:
	Mozilla's traditional deployments in IT/WebOps have been very traditional clusters of web servers, etc.

	Current deployment process like chief is more of a macguyver/band-aid

	Big fan of stateless web app; independent components & resources; 12 factor app (http://12factor.net/) and Heroku; biggest advantage: clear separation of concern for better maintainability

	Technically, would entail many refactorings in both code & deployment

	Maris & Jannis: we are at the decision point now before we go down any given technology path

	Travis:

	Differences
	* How much control over the environment each person has

	Use Jenkins as automation and to build environments for Dev, QA, Production. E.g., go to jenkins interface and push a button to deploy a hash to AWS

	CloudOps always pushes the production environment

	Stateful storage for app in external resources (Redis, S3, etc.)

	Cloud spins up new nodes and changes DNS over

	Want soft-launch techniques

	How to get branch to production?

	0. Config update needed; alert CloudOps
	1. Some trigger of nightly build - tag, branch, something
	2. Spins up environment exactly the same as production
	3. Go to Jenkins - parameterized submission form: branch name or revision & puppet/ansible config repo


	Dec 3
	* [DECISION] Target end of Q2 move
	* Complete move
	* (groovecoder) file meta bug, NetApp bug, DNS ... blocked by kuma-lib
	* NetApp -> S3
	* (Travis) create stage & prod S3 buckets for MDN
	* (cturra) move MDN files to S3 buckets
	* (mdndev) change demo & wiki code to use S3
	* (mdndev) send static assets to S3
	* DNS -> AWS Route53 :)
	* Django web-heads -> AWS EC2
	* (Travis) spin up stage & prod EC2 instances
	* KumaScript nodes -> AWS EC2
	* (Travis) spin up stage & prod EC2 instances
	* RabbitMQ -> AWS Redis
	* (Travis) spin up stage & prod instances
	* celery node -> AWS EC2
	* (Travis) spin up stage & prod instances
	* (mdndev) change code from RabbitMQ to Redis
	* ElasticSearch -> AWS EC2 instances (custom ES cluster)
	* (travis) create ES cluster
	* (mdndev) re-build index on AWS ES cluster
	* (mdndev) change code to use AWS ES cluster
	* MySQL Database -> AWS RDS
	* (mdndev) test read-only
	* (IT, WebOps, mdndev, CloudOps) dump & import
	* (mdndev) code testing
	* Memcache -> AWS Redis
	* [todo] NetOps + WebOps + CloudOps: Monitoring
	* Nagios Monitoring -> drop from MOC?
	* Sentry Monitoring -> Sentry node in CloudOps
	* No Logging -> heka (https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Logging+Standard)
	* (groovecoder) schedule meeting w/ mdndev & CloudOps to discuss deployment
	* (groovecoder) file bug - Product: Mozilla Services; Component: Operations assign to ckolos for AWS dev access to mdn-dev@mozilla.com


	Nov 25
	* IT/WebOps still planning for 2015: questions about the benefits of shifting how Ops does deployments, and the benefit of doing this work now, or at a later time when the 2015 plan is clearer.
	* WebOps is leading the discussion on this
	* Q2 is preferred timing for MDN, so that we can wrap up our current project of site stability fixes (and not intermix infra changes with major code landings)
	* Q2 is good for Service Ops, too
	* Travis - AWS & ElasticSearch - hosted AWS or custom?
	* 2 custom/raw 6-node ES clusters
	* MDN currently runs 5 nodes (prod)
	* (groovecoder) send summary email including Sean & Stephanie
	* (groovecoder) schedule meeting for Mozlandia
	* What happens to self-service deployments, CI infrastructure?
	* should be no problem sticking with Travis for CI
	* need to examine options for self-service deploys, MDN currently uses chief
	* MDN OK with losing chief, wants to keep self-service part
	* needs discussion at Mozlandia

	Nov 11

	* Who could work on the project?
	* Sean Rich, C Liang, Stephanie Chan
	* When could we do it?
	* IT/WebOps: planning 2015 quarters Nov 18th-19th
	* MDN: Q3 services
	* CloudOps: Prefer Q1-Q2
	* (groovecoder) owns the project
	* (cyliang) will get a list of infra + 3rd party services together
	* (groovecoder) get DB backup info from sheeri (see below)
	* (groovecoder) will schedule a follow-up meeting for week of Nov 25th

	Optionally discuss:

	* List of infrastructure (web-heads, db servers, search nodes, celery nodes, rabbitmq, etc.)
	* https://mana.mozilla.org/wiki/display/websites/developer.mozilla.org+Cluster
	* (only update to above is that MDN ES is now on a separate set of clusters)
	* [ no metrics ingestion of logs ]
	* Any other IT integration points (i.e storage)
	* /mnt/netapp for demo & wiki uploads
	* change to S3 (will need dev work)
	* List of 3rd-party services (socketlabs, etc.)
	* SocketLabs
	* NewRelic
	* Recaptcha?
	* Bitly?
	* Monitoring
	* Nagios
	* collectd / graphite(move to statsd/heka/graphite)
	* errormill (sentry)
	* Database Backup Retention
	* Sheeri
	* daily backups kept for a month
	* monthly backups (1st of the month) kept since 7/1/2014
	* CI/CD infrastructure
	* Switch to AWS + Jenkins?

	Notes:
	* MDN will(?) be standing up new services in 2015, but they don't depend on the wiki
	* new services can go to CSO, MDN can stay with IT
	* probably start standing these up for public consumption in Q3
	* travis has some directive to assist with MDN Ops management
	* if MDN moves, best to move the infra as well - just moving management + processes won't be any gain
	*