tav/mission-control.txt

## mission-control.txt
The service dispatcher within the App Engine app needs to log accounting info
for the current request, e.g. `log.Infof("T:%s:queries:%d", user, n)`. This
data will later be parsed out and used for quota and billing purposes.

-------------------------------------------------------------------------------

In order to enable offline log processing, we need to implement a `/_logs`
handler on the App Engine app. This should validate against a shared secret
`key` and expose the [Log Query API] to callers.

[Log Query API]: https://developers.google.com/appengine/docs/go/log/reference

-------------------------------------------------------------------------------

For error tracing on GAE we will be relying upon App Engine's builtin logging
support. To facilitate this, we need to enable logs retention up to a large
limit. A petabyte should be plenty :)

-------------------------------------------------------------------------------

An `aws2stat` daemon needs to be written which enables [Amazon CloudWatch] for
all of our EC2 instances, Elastic Load Balancers and DynamoDB. The daemon
should also download all of the CloudWatch metrics and upload them to TempoDB
at regular intervals.

[Amazon CloudWatch]: http://aws.amazon.com/cloudwatch/

-------------------------------------------------------------------------------

A generic mechanism should be added to `amp/runtime` that will allow our
daemons to reload their config files when sent a `SIGUSR1` signal.

-------------------------------------------------------------------------------

A `tweet2stat` daemon needs to be written which will take a config file
specifying queries, e.g.

    ```yaml
    espra: #espra OR espra.com
    all: espra OR espians OR https://alpha.espra.com
    ```

The script should then regularly query Twitter for tweets matching the
specified queries and update TempoDB with counts of any new tweets.

-------------------------------------------------------------------------------

Daemons like `doozerd` and `statsy` should have fixed addresses within
clusters. Unfortunately, communicating with Elastic IPs from within EC2 incurs
charges. But it seems that we might be able to [get the internal ip address]
from the Public DNS name of an Elastic IP.

If this works, write a `get-host-for-elastic-ip` script which when given an
Elastic IP address:

* Assigns it to a temporary EC2 instance
* Uses that to discover the Public DNS name for the IP

[get the internal ip address]: http://alestic.com/2009/06/ec2-elastic-ip-internal

-------------------------------------------------------------------------------

An `amp/tempodb` package needs to be written to support the [TempoDB API].

[TempoDB API]: http://tempo-db.com/docs/api/

-------------------------------------------------------------------------------

Write an `amp/dynamodb` package that supports reading and writing data to [DynamoDB].

[DynamoDB]: http://aws.amazon.com/dynamodb/

-------------------------------------------------------------------------------

Implement an `amp/statsy` package that provides a clean API for sending
metrics to a `statsy` daemon over UDP. In addition, a `statsy.ProcInfo()`
function should be provided which sends info about the current process's
resource usage, e.g. cpu, resident memory, &c.

Since sending a message for everything could get overwhelming, an API should
be provided to sample the data, e.g.

    NewTimer("upload").Sample(100).Every(5 * time.Seconds)

The sampling rate should then adapt in real-time to reflect changes in load.

-------------------------------------------------------------------------------

Implement a `statsy` daemon which receives metrics data over UDP within a
cluster and uploads the data to TempoDB. When sending the data, it should
aggregate certain classes of metrics and account for sampling rates. And it
should signal its own resource usage by sending `statsy.RawProcInfo()`.

-------------------------------------------------------------------------------

Add a `DynaLog` network logging option to the `amp/log` package so that it can
persist log entries to DynamoDB. It should automatically buffer unsent log
items to disk, so that they can be resent once DynamoDB is responsive again.

Also, add an option to automatically nuke standard file logs within
`amp/runtime`.

[DyanmoDB]: http://aws.amazon.com/dynamodb/

-------------------------------------------------------------------------------

Setup `statsy` within our EC2 clusters. They should be given a perma-hostname
corresponding to their cluster name, e.g. `st-us1.espra.com`.

-------------------------------------------------------------------------------

Setup `doozerd` within our EC2 clusters. They should be given a perma-hostname
corresponding to their cluster name, e.g. `dz-us1.espra.com` and have 3
instances with distinct elastic IPs attached to it.

-------------------------------------------------------------------------------

We need 4 elastic IPs in each cluster:

* doozerd (3)
* statsy (1)

We therefore need to ask Amazon to [increase the address limit] to 8 for us.

[increase the address limit]: http://aws.amazon.com/contact-us/eip_limit_request/

-------------------------------------------------------------------------------

Write a `remonit` daemon that can be deployed at multiple locations outside of
our core infrastructure in order to monitor uptime, latency and response
times. It should support:

* DNS lookups
* HTTPS requests

The results should be collated and uploaded to TempoDB. Any network failure
should result in a timestamped error file being written with additional info
like traceroutes, name servers and their ips, host ip, certificates, times
taken, partial contents, &c.

-------------------------------------------------------------------------------

We need to decide between [harvestd] and [Diamond] for monitoring core server
metrics on our EC2 instances. In either case, a standard configuration needs
to be put together and a handler needs to be written that sends the metrics to
our `statsy` daemons.

[diamond]: https://github.com/BrightcoveOS/Diamond
[harvestd]: https://github.com/mk-fg/graphite-metrics

-------------------------------------------------------------------------------

An `espra/statsy` package needs to be written which exposes something similar
to the `amp/statsy` interface for capturing metrics. But since this will be
running inside of App Engine, instead of sending the metrics to a `statsy`
daemon over UDP, we need to capture the info in `memcache` which then gets
aggregated and uploaded to TempoDB using a `/_statsy` task queue handler.

-------------------------------------------------------------------------------

Key business metrics like the following must be captured within the app:

* Sign ups
* Upgrades/Downgrades/Cancellations
* Logins
* Successful Payments and Overdues

Metrics must support additional flags specifying any content optimisation
factors, e.g. "blue-button", "orange-button.upgrade-now-text", &c.

-------------------------------------------------------------------------------

There needs to be a `_timings` handler on an App Engine backend instance which
receives info from browsers relating to Navigation Timing. This should
aggregate the data elements and send the data to TempoDB.

-------------------------------------------------------------------------------

We can implement `timings.coffee` which uses the Navigation Timing API in
browsers like Firefox and Chrome to send back network/page load times to the
App Engine app. Whilst this data can't be fully trusted, it gives us enough
data to work with for now.

-------------------------------------------------------------------------------

Write an `espra-mission-control` app that aggregates all of our devops
metrics, logging and analytics in one central place.

Care should be taken to host this on infrastructure independent to the rest of
our infrastructure, i.e. not on Route53, DNSMadeEasy, EC2 or GAE. Even the
domain used should be different, i.e. not `espra.com`. Perhaps Hetzner and
Linode DNS would be viable hosts.

Mission control will grab the time-series data stored in TempoDB and display
them as sexy graphs and aggregated counts on the browser. It should be
possible to transform the displayed data with custom aggregate and map
functions as well as correlate metrics against one another. It might be useful
to have an option that kills outliers above the 90% threshold for timing-
related metrics.

Since we will be looking at this all day, Mission Control must look pretty. At
least as pretty as [Librato Metrics] and [Geckoboard]. For all of the time-
series graphing, the sexy [Cubism.js] library along with [Rickshaw] should
help in this regard.

Metrics could be associated with additional metadata for display purposes —
including custom icons for triggers (e.g. Flurry of tweets, Performance
killing deploy, &c.). It should be possible to save custom views with a given
name. All of this info should be persisted to config for use on reload.

The info/error logs stored to both DynamoDB and within App Engine should be
viewable directly from Mission Control. Though, given that it's possible for
App Engine frontends to be down whilst their admin dashboard is still up, a
direct link to the dashboard for error logs wouldn't hurt as a backup.

For request logs, Mission Control should cache and serve from BigQuery. With
the ability to drill down into standard request analytics around user, ip,
service, resource, geo-location, &c. through both batch queries and ad-hoc
interactive queries.

The relative number of requests to our origin and CDN servers can be used to
spit out a CDN hit/miss metric. And metrics from `aws2stat` could be used to
suggest adding extra instances within a given cluster.

The `/_get_all_states` handler on the App Engine app can be used to provide
specific info on the recent-ness and the level of backlogged-ness of services
like `aws2stat`, `log2stat`, `remonit`, `tweet2stat`, &c.

It should be possible to configure alerts via e-mail (Mandrill), SMS (Nexmo)
or web hooks when certain conditions are met within a certain time period:

* No metrics for a given service.
* No metrics for a given service from at least N subs.
* Metrics above or below a given threshold.

The Mandrill and Nexmo accounts used must be independent of the accounts used
on our main App Engine app. And, finally, mission control should also ensure
that all of our SSL certificates are checked for expiry and send out an alert
every day for the 15 days before.

[Cubism.js]: http://square.github.com/cubism/
[Geckoboard]: http://www.geckoboard.com/
[Librato Metrics]: https://metrics.librato.com/
[Rickshaw]: http://code.shutterstock.com/rickshaw/

-------------------------------------------------------------------------------

There needs to be an `/_accounting` handler on the App Engine app which our
services outside of GAE can call to update regarding resource usage by users.
This needs to be accompanied by a taskqueue handler which then updates the
billing and invoices for those users.

-------------------------------------------------------------------------------

Implement `logs2stat` which:

* Routinely grabs the App Engine logs exposed via the `/_logs` handler, parses
  out the requests and accounting info, then aggregates them before uploading
  metrics to TempoDB (e.g. req/s, browser, &c.) and structured data for
  analytics to BigQuery.

* Does the same by grabbing request logs from DynamoDB in our various
  clusters and then uploading to TempoDB and BigQuery.

* Does the same again by grabbing request logs for our CDN.

* Aggregates accounting info and calls the `/_accounting` handler on our App
  Engine app with data relating to user accounts.

It should be possible to provide sharding factors to `logs2stat` so that if
our log data becomes too much for a single server to sync, it can be done on
multiple machines at the same time.

-------------------------------------------------------------------------------

Implement a set of handlers on our App Engine app for storing and retrieving
state/config data by our external daemons/apps like `log2stat` and mission
control in case they go down and have to resume from a certain point:

* `/_init_state` should take a `key` and `secret` and initialise a given
  state. An html form should be presented if no parameters are set and the
  handler should only be callable by admins, i.e. `user.IsAdmin()`.

* `/_set_state` should take a `key`, `secret` and `value` and store the
  key/value with a timestamp.

* `/_get_state` should take a `key`, `secret` and return the stored value and
  timestamp.

* `/_get_all_states` should return a list of the keys, values and timestamps
  for all stored states using a special master secret.

-------------------------------------------------------------------------------

Write a minimal `dme-route53` tool that lets us CRUD records to both DNS Made
Easy and Route53 simultaneously.

-------------------------------------------------------------------------------

Our production `bolt` deployment script on the deployment server should
automatically add deployment metrics to TempoDB so that overall system
performance can be correlated to:

* App Engine Deploys
* DNS Updates
* EC2 Cluster Deploys
* Provisioning of EC2 Instances

-------------------------------------------------------------------------------

Auditing should be enabled on the deployment server.

-------------------------------------------------------------------------------

Both the deployment server and the mission control app server should be
firewalled off from the wider internet and only be available over secure
channels.
	The service dispatcher within the App Engine app needs to log accounting info
	for the current request, e.g. `log.Infof("T:%s:queries:%d", user, n)`. This
	data will later be parsed out and used for quota and billing purposes.

	-------------------------------------------------------------------------------

	In order to enable offline log processing, we need to implement a `/_logs`
	handler on the App Engine app. This should validate against a shared secret
	`key` and expose the [Log Query API] to callers.

	[Log Query API]: https://developers.google.com/appengine/docs/go/log/reference

	-------------------------------------------------------------------------------

	For error tracing on GAE we will be relying upon App Engine's builtin logging
	support. To facilitate this, we need to enable logs retention up to a large
	limit. A petabyte should be plenty :)

	-------------------------------------------------------------------------------

	An `aws2stat` daemon needs to be written which enables [Amazon CloudWatch] for
	all of our EC2 instances, Elastic Load Balancers and DynamoDB. The daemon
	should also download all of the CloudWatch metrics and upload them to TempoDB
	at regular intervals.

	[Amazon CloudWatch]: http://aws.amazon.com/cloudwatch/

	-------------------------------------------------------------------------------

	A generic mechanism should be added to `amp/runtime` that will allow our
	daemons to reload their config files when sent a `SIGUSR1` signal.

	-------------------------------------------------------------------------------

	A `tweet2stat` daemon needs to be written which will take a config file
	specifying queries, e.g.

	```yaml
	espra: #espra OR espra.com
	all: espra OR espians OR https://alpha.espra.com
	```

	The script should then regularly query Twitter for tweets matching the
	specified queries and update TempoDB with counts of any new tweets.

	-------------------------------------------------------------------------------

	Daemons like `doozerd` and `statsy` should have fixed addresses within
	clusters. Unfortunately, communicating with Elastic IPs from within EC2 incurs
	charges. But it seems that we might be able to [get the internal ip address]
	from the Public DNS name of an Elastic IP.

	If this works, write a `get-host-for-elastic-ip` script which when given an
	Elastic IP address:

	* Assigns it to a temporary EC2 instance
	* Uses that to discover the Public DNS name for the IP

	[get the internal ip address]: http://alestic.com/2009/06/ec2-elastic-ip-internal

	-------------------------------------------------------------------------------

	An `amp/tempodb` package needs to be written to support the [TempoDB API].

	[TempoDB API]: http://tempo-db.com/docs/api/

	-------------------------------------------------------------------------------

	Write an `amp/dynamodb` package that supports reading and writing data to [DynamoDB].

	[DynamoDB]: http://aws.amazon.com/dynamodb/

	-------------------------------------------------------------------------------

	Implement an `amp/statsy` package that provides a clean API for sending
	metrics to a `statsy` daemon over UDP. In addition, a `statsy.ProcInfo()`
	function should be provided which sends info about the current process's
	resource usage, e.g. cpu, resident memory, &c.

	Since sending a message for everything could get overwhelming, an API should
	be provided to sample the data, e.g.

	NewTimer("upload").Sample(100).Every(5 * time.Seconds)

	The sampling rate should then adapt in real-time to reflect changes in load.

	-------------------------------------------------------------------------------

	Implement a `statsy` daemon which receives metrics data over UDP within a
	cluster and uploads the data to TempoDB. When sending the data, it should
	aggregate certain classes of metrics and account for sampling rates. And it
	should signal its own resource usage by sending `statsy.RawProcInfo()`.

	-------------------------------------------------------------------------------

	Add a `DynaLog` network logging option to the `amp/log` package so that it can
	persist log entries to DynamoDB. It should automatically buffer unsent log
	items to disk, so that they can be resent once DynamoDB is responsive again.

	Also, add an option to automatically nuke standard file logs within
	`amp/runtime`.

	[DyanmoDB]: http://aws.amazon.com/dynamodb/

	-------------------------------------------------------------------------------

	Setup `statsy` within our EC2 clusters. They should be given a perma-hostname
	corresponding to their cluster name, e.g. `st-us1.espra.com`.

	-------------------------------------------------------------------------------

	Setup `doozerd` within our EC2 clusters. They should be given a perma-hostname
	corresponding to their cluster name, e.g. `dz-us1.espra.com` and have 3
	instances with distinct elastic IPs attached to it.

	-------------------------------------------------------------------------------

	We need 4 elastic IPs in each cluster:

	* doozerd (3)
	* statsy (1)

	We therefore need to ask Amazon to [increase the address limit] to 8 for us.

	[increase the address limit]: http://aws.amazon.com/contact-us/eip_limit_request/

	-------------------------------------------------------------------------------

	Write a `remonit` daemon that can be deployed at multiple locations outside of
	our core infrastructure in order to monitor uptime, latency and response
	times. It should support:

	* DNS lookups
	* HTTPS requests

	The results should be collated and uploaded to TempoDB. Any network failure
	should result in a timestamped error file being written with additional info
	like traceroutes, name servers and their ips, host ip, certificates, times
	taken, partial contents, &c.

	-------------------------------------------------------------------------------

	We need to decide between [harvestd] and [Diamond] for monitoring core server
	metrics on our EC2 instances. In either case, a standard configuration needs
	to be put together and a handler needs to be written that sends the metrics to
	our `statsy` daemons.

	[diamond]: https://github.com/BrightcoveOS/Diamond
	[harvestd]: https://github.com/mk-fg/graphite-metrics

	-------------------------------------------------------------------------------

	An `espra/statsy` package needs to be written which exposes something similar
	to the `amp/statsy` interface for capturing metrics. But since this will be
	running inside of App Engine, instead of sending the metrics to a `statsy`
	daemon over UDP, we need to capture the info in `memcache` which then gets
	aggregated and uploaded to TempoDB using a `/_statsy` task queue handler.

	-------------------------------------------------------------------------------

	Key business metrics like the following must be captured within the app:

	* Sign ups
	* Upgrades/Downgrades/Cancellations
	* Logins
	* Successful Payments and Overdues

	Metrics must support additional flags specifying any content optimisation
	factors, e.g. "blue-button", "orange-button.upgrade-now-text", &c.

	-------------------------------------------------------------------------------

	There needs to be a `_timings` handler on an App Engine backend instance which
	receives info from browsers relating to Navigation Timing. This should
	aggregate the data elements and send the data to TempoDB.

	-------------------------------------------------------------------------------

	We can implement `timings.coffee` which uses the Navigation Timing API in
	browsers like Firefox and Chrome to send back network/page load times to the
	App Engine app. Whilst this data can't be fully trusted, it gives us enough
	data to work with for now.

	-------------------------------------------------------------------------------

	Write an `espra-mission-control` app that aggregates all of our devops
	metrics, logging and analytics in one central place.

	Care should be taken to host this on infrastructure independent to the rest of
	our infrastructure, i.e. not on Route53, DNSMadeEasy, EC2 or GAE. Even the
	domain used should be different, i.e. not `espra.com`. Perhaps Hetzner and
	Linode DNS would be viable hosts.

	Mission control will grab the time-series data stored in TempoDB and display
	them as sexy graphs and aggregated counts on the browser. It should be
	possible to transform the displayed data with custom aggregate and map
	functions as well as correlate metrics against one another. It might be useful
	to have an option that kills outliers above the 90% threshold for timing-
	related metrics.

	Since we will be looking at this all day, Mission Control must look pretty. At
	least as pretty as [Librato Metrics] and [Geckoboard]. For all of the time-
	series graphing, the sexy [Cubism.js] library along with [Rickshaw] should
	help in this regard.

	Metrics could be associated with additional metadata for display purposes —
	including custom icons for triggers (e.g. Flurry of tweets, Performance
	killing deploy, &c.). It should be possible to save custom views with a given
	name. All of this info should be persisted to config for use on reload.

	The info/error logs stored to both DynamoDB and within App Engine should be
	viewable directly from Mission Control. Though, given that it's possible for
	App Engine frontends to be down whilst their admin dashboard is still up, a
	direct link to the dashboard for error logs wouldn't hurt as a backup.

	For request logs, Mission Control should cache and serve from BigQuery. With
	the ability to drill down into standard request analytics around user, ip,
	service, resource, geo-location, &c. through both batch queries and ad-hoc
	interactive queries.

	The relative number of requests to our origin and CDN servers can be used to
	spit out a CDN hit/miss metric. And metrics from `aws2stat` could be used to
	suggest adding extra instances within a given cluster.

	The `/_get_all_states` handler on the App Engine app can be used to provide
	specific info on the recent-ness and the level of backlogged-ness of services
	like `aws2stat`, `log2stat`, `remonit`, `tweet2stat`, &c.

	It should be possible to configure alerts via e-mail (Mandrill), SMS (Nexmo)
	or web hooks when certain conditions are met within a certain time period:

	* No metrics for a given service.
	* No metrics for a given service from at least N subs.
	* Metrics above or below a given threshold.

	The Mandrill and Nexmo accounts used must be independent of the accounts used
	on our main App Engine app. And, finally, mission control should also ensure
	that all of our SSL certificates are checked for expiry and send out an alert
	every day for the 15 days before.

	[Cubism.js]: http://square.github.com/cubism/
	[Geckoboard]: http://www.geckoboard.com/
	[Librato Metrics]: https://metrics.librato.com/
	[Rickshaw]: http://code.shutterstock.com/rickshaw/

	-------------------------------------------------------------------------------

	There needs to be an `/_accounting` handler on the App Engine app which our
	services outside of GAE can call to update regarding resource usage by users.
	This needs to be accompanied by a taskqueue handler which then updates the
	billing and invoices for those users.

	-------------------------------------------------------------------------------

	Implement `logs2stat` which:

	* Routinely grabs the App Engine logs exposed via the `/_logs` handler, parses
	out the requests and accounting info, then aggregates them before uploading
	metrics to TempoDB (e.g. req/s, browser, &c.) and structured data for
	analytics to BigQuery.

	* Does the same by grabbing request logs from DynamoDB in our various
	clusters and then uploading to TempoDB and BigQuery.

	* Does the same again by grabbing request logs for our CDN.

	* Aggregates accounting info and calls the `/_accounting` handler on our App
	Engine app with data relating to user accounts.

	It should be possible to provide sharding factors to `logs2stat` so that if
	our log data becomes too much for a single server to sync, it can be done on
	multiple machines at the same time.

	-------------------------------------------------------------------------------

	Implement a set of handlers on our App Engine app for storing and retrieving
	state/config data by our external daemons/apps like `log2stat` and mission
	control in case they go down and have to resume from a certain point:

	* `/_init_state` should take a `key` and `secret` and initialise a given
	state. An html form should be presented if no parameters are set and the
	handler should only be callable by admins, i.e. `user.IsAdmin()`.

	* `/_set_state` should take a `key`, `secret` and `value` and store the
	key/value with a timestamp.

	* `/_get_state` should take a `key`, `secret` and return the stored value and
	timestamp.

	* `/_get_all_states` should return a list of the keys, values and timestamps
	for all stored states using a special master secret.

	-------------------------------------------------------------------------------

	Write a minimal `dme-route53` tool that lets us CRUD records to both DNS Made
	Easy and Route53 simultaneously.

	-------------------------------------------------------------------------------

	Our production `bolt` deployment script on the deployment server should
	automatically add deployment metrics to TempoDB so that overall system
	performance can be correlated to:

	* App Engine Deploys
	* DNS Updates
	* EC2 Cluster Deploys
	* Provisioning of EC2 Instances

	-------------------------------------------------------------------------------

	Auditing should be enabled on the deployment server.

	-------------------------------------------------------------------------------

	Both the deployment server and the mission control app server should be
	firewalled off from the wider internet and only be available over secure
	channels.