Skip to content

Instantly share code, notes, and snippets.

@tav
Created September 12, 2012 01:54
Show Gist options
  • Save tav/3703693 to your computer and use it in GitHub Desktop.
Save tav/3703693 to your computer and use it in GitHub Desktop.
The service dispatcher within the App Engine app needs to log accounting info
for the current request, e.g. `log.Infof("T:%s:queries:%d", user, n)`. This
data will later be parsed out and used for quota and billing purposes.
-------------------------------------------------------------------------------
In order to enable offline log processing, we need to implement a `/_logs`
handler on the App Engine app. This should validate against a shared secret
`key` and expose the [Log Query API] to callers.
[Log Query API]: https://developers.google.com/appengine/docs/go/log/reference
-------------------------------------------------------------------------------
For error tracing on GAE we will be relying upon App Engine's builtin logging
support. To facilitate this, we need to enable logs retention up to a large
limit. A petabyte should be plenty :)
-------------------------------------------------------------------------------
An `aws2stat` daemon needs to be written which enables [Amazon CloudWatch] for
all of our EC2 instances, Elastic Load Balancers and DynamoDB. The daemon
should also download all of the CloudWatch metrics and upload them to TempoDB
at regular intervals.
[Amazon CloudWatch]: http://aws.amazon.com/cloudwatch/
-------------------------------------------------------------------------------
A generic mechanism should be added to `amp/runtime` that will allow our
daemons to reload their config files when sent a `SIGUSR1` signal.
-------------------------------------------------------------------------------
A `tweet2stat` daemon needs to be written which will take a config file
specifying queries, e.g.
```yaml
espra: #espra OR espra.com
all: espra OR espians OR https://alpha.espra.com
```
The script should then regularly query Twitter for tweets matching the
specified queries and update TempoDB with counts of any new tweets.
-------------------------------------------------------------------------------
Daemons like `doozerd` and `statsy` should have fixed addresses within
clusters. Unfortunately, communicating with Elastic IPs from within EC2 incurs
charges. But it seems that we might be able to [get the internal ip address]
from the Public DNS name of an Elastic IP.
If this works, write a `get-host-for-elastic-ip` script which when given an
Elastic IP address:
* Assigns it to a temporary EC2 instance
* Uses that to discover the Public DNS name for the IP
[get the internal ip address]: http://alestic.com/2009/06/ec2-elastic-ip-internal
-------------------------------------------------------------------------------
An `amp/tempodb` package needs to be written to support the [TempoDB API].
[TempoDB API]: http://tempo-db.com/docs/api/
-------------------------------------------------------------------------------
Write an `amp/dynamodb` package that supports reading and writing data to [DynamoDB].
[DynamoDB]: http://aws.amazon.com/dynamodb/
-------------------------------------------------------------------------------
Implement an `amp/statsy` package that provides a clean API for sending
metrics to a `statsy` daemon over UDP. In addition, a `statsy.ProcInfo()`
function should be provided which sends info about the current process's
resource usage, e.g. cpu, resident memory, &c.
Since sending a message for everything could get overwhelming, an API should
be provided to sample the data, e.g.
NewTimer("upload").Sample(100).Every(5 * time.Seconds)
The sampling rate should then adapt in real-time to reflect changes in load.
-------------------------------------------------------------------------------
Implement a `statsy` daemon which receives metrics data over UDP within a
cluster and uploads the data to TempoDB. When sending the data, it should
aggregate certain classes of metrics and account for sampling rates. And it
should signal its own resource usage by sending `statsy.RawProcInfo()`.
-------------------------------------------------------------------------------
Add a `DynaLog` network logging option to the `amp/log` package so that it can
persist log entries to DynamoDB. It should automatically buffer unsent log
items to disk, so that they can be resent once DynamoDB is responsive again.
Also, add an option to automatically nuke standard file logs within
`amp/runtime`.
[DyanmoDB]: http://aws.amazon.com/dynamodb/
-------------------------------------------------------------------------------
Setup `statsy` within our EC2 clusters. They should be given a perma-hostname
corresponding to their cluster name, e.g. `st-us1.espra.com`.
-------------------------------------------------------------------------------
Setup `doozerd` within our EC2 clusters. They should be given a perma-hostname
corresponding to their cluster name, e.g. `dz-us1.espra.com` and have 3
instances with distinct elastic IPs attached to it.
-------------------------------------------------------------------------------
We need 4 elastic IPs in each cluster:
* doozerd (3)
* statsy (1)
We therefore need to ask Amazon to [increase the address limit] to 8 for us.
[increase the address limit]: http://aws.amazon.com/contact-us/eip_limit_request/
-------------------------------------------------------------------------------
Write a `remonit` daemon that can be deployed at multiple locations outside of
our core infrastructure in order to monitor uptime, latency and response
times. It should support:
* DNS lookups
* HTTPS requests
The results should be collated and uploaded to TempoDB. Any network failure
should result in a timestamped error file being written with additional info
like traceroutes, name servers and their ips, host ip, certificates, times
taken, partial contents, &c.
-------------------------------------------------------------------------------
We need to decide between [harvestd] and [Diamond] for monitoring core server
metrics on our EC2 instances. In either case, a standard configuration needs
to be put together and a handler needs to be written that sends the metrics to
our `statsy` daemons.
[diamond]: https://github.com/BrightcoveOS/Diamond
[harvestd]: https://github.com/mk-fg/graphite-metrics
-------------------------------------------------------------------------------
An `espra/statsy` package needs to be written which exposes something similar
to the `amp/statsy` interface for capturing metrics. But since this will be
running inside of App Engine, instead of sending the metrics to a `statsy`
daemon over UDP, we need to capture the info in `memcache` which then gets
aggregated and uploaded to TempoDB using a `/_statsy` task queue handler.
-------------------------------------------------------------------------------
Key business metrics like the following must be captured within the app:
* Sign ups
* Upgrades/Downgrades/Cancellations
* Logins
* Successful Payments and Overdues
Metrics must support additional flags specifying any content optimisation
factors, e.g. "blue-button", "orange-button.upgrade-now-text", &c.
-------------------------------------------------------------------------------
There needs to be a `_timings` handler on an App Engine backend instance which
receives info from browsers relating to Navigation Timing. This should
aggregate the data elements and send the data to TempoDB.
-------------------------------------------------------------------------------
We can implement `timings.coffee` which uses the Navigation Timing API in
browsers like Firefox and Chrome to send back network/page load times to the
App Engine app. Whilst this data can't be fully trusted, it gives us enough
data to work with for now.
-------------------------------------------------------------------------------
Write an `espra-mission-control` app that aggregates all of our devops
metrics, logging and analytics in one central place.
Care should be taken to host this on infrastructure independent to the rest of
our infrastructure, i.e. not on Route53, DNSMadeEasy, EC2 or GAE. Even the
domain used should be different, i.e. not `espra.com`. Perhaps Hetzner and
Linode DNS would be viable hosts.
Mission control will grab the time-series data stored in TempoDB and display
them as sexy graphs and aggregated counts on the browser. It should be
possible to transform the displayed data with custom aggregate and map
functions as well as correlate metrics against one another. It might be useful
to have an option that kills outliers above the 90% threshold for timing-
related metrics.
Since we will be looking at this all day, Mission Control must look pretty. At
least as pretty as [Librato Metrics] and [Geckoboard]. For all of the time-
series graphing, the sexy [Cubism.js] library along with [Rickshaw] should
help in this regard.
Metrics could be associated with additional metadata for display purposes —
including custom icons for triggers (e.g. Flurry of tweets, Performance
killing deploy, &c.). It should be possible to save custom views with a given
name. All of this info should be persisted to config for use on reload.
The info/error logs stored to both DynamoDB and within App Engine should be
viewable directly from Mission Control. Though, given that it's possible for
App Engine frontends to be down whilst their admin dashboard is still up, a
direct link to the dashboard for error logs wouldn't hurt as a backup.
For request logs, Mission Control should cache and serve from BigQuery. With
the ability to drill down into standard request analytics around user, ip,
service, resource, geo-location, &c. through both batch queries and ad-hoc
interactive queries.
The relative number of requests to our origin and CDN servers can be used to
spit out a CDN hit/miss metric. And metrics from `aws2stat` could be used to
suggest adding extra instances within a given cluster.
The `/_get_all_states` handler on the App Engine app can be used to provide
specific info on the recent-ness and the level of backlogged-ness of services
like `aws2stat`, `log2stat`, `remonit`, `tweet2stat`, &c.
It should be possible to configure alerts via e-mail (Mandrill), SMS (Nexmo)
or web hooks when certain conditions are met within a certain time period:
* No metrics for a given service.
* No metrics for a given service from at least N subs.
* Metrics above or below a given threshold.
The Mandrill and Nexmo accounts used must be independent of the accounts used
on our main App Engine app. And, finally, mission control should also ensure
that all of our SSL certificates are checked for expiry and send out an alert
every day for the 15 days before.
[Cubism.js]: http://square.github.com/cubism/
[Geckoboard]: http://www.geckoboard.com/
[Librato Metrics]: https://metrics.librato.com/
[Rickshaw]: http://code.shutterstock.com/rickshaw/
-------------------------------------------------------------------------------
There needs to be an `/_accounting` handler on the App Engine app which our
services outside of GAE can call to update regarding resource usage by users.
This needs to be accompanied by a taskqueue handler which then updates the
billing and invoices for those users.
-------------------------------------------------------------------------------
Implement `logs2stat` which:
* Routinely grabs the App Engine logs exposed via the `/_logs` handler, parses
out the requests and accounting info, then aggregates them before uploading
metrics to TempoDB (e.g. req/s, browser, &c.) and structured data for
analytics to BigQuery.
* Does the same by grabbing request logs from DynamoDB in our various
clusters and then uploading to TempoDB and BigQuery.
* Does the same again by grabbing request logs for our CDN.
* Aggregates accounting info and calls the `/_accounting` handler on our App
Engine app with data relating to user accounts.
It should be possible to provide sharding factors to `logs2stat` so that if
our log data becomes too much for a single server to sync, it can be done on
multiple machines at the same time.
-------------------------------------------------------------------------------
Implement a set of handlers on our App Engine app for storing and retrieving
state/config data by our external daemons/apps like `log2stat` and mission
control in case they go down and have to resume from a certain point:
* `/_init_state` should take a `key` and `secret` and initialise a given
state. An html form should be presented if no parameters are set and the
handler should only be callable by admins, i.e. `user.IsAdmin()`.
* `/_set_state` should take a `key`, `secret` and `value` and store the
key/value with a timestamp.
* `/_get_state` should take a `key`, `secret` and return the stored value and
timestamp.
* `/_get_all_states` should return a list of the keys, values and timestamps
for all stored states using a special master secret.
-------------------------------------------------------------------------------
Write a minimal `dme-route53` tool that lets us CRUD records to both DNS Made
Easy and Route53 simultaneously.
-------------------------------------------------------------------------------
Our production `bolt` deployment script on the deployment server should
automatically add deployment metrics to TempoDB so that overall system
performance can be correlated to:
* App Engine Deploys
* DNS Updates
* EC2 Cluster Deploys
* Provisioning of EC2 Instances
-------------------------------------------------------------------------------
Auditing should be enabled on the deployment server.
-------------------------------------------------------------------------------
Both the deployment server and the mission control app server should be
firewalled off from the wider internet and only be available over secure
channels.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment