Skip to content

Instantly share code, notes, and snippets.

@bergerx
Last active December 2, 2016 13:37
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bergerx/6ecd9e7fb2cc6b5c5f26fede5397d7de to your computer and use it in GitHub Desktop.
Save bergerx/6ecd9e7fb2cc6b5c5f26fede5397d7de to your computer and use it in GitHub Desktop.
DC/OS metrics system
1. cluster-level metrics and health (mesos-master, mesos-slave,
marathon, marathon-lb, mesos-dns, kafka ...)
Metrics for cluster components like mesos-master, mesos-slave,
frameworks (DC/OS services like zookeeper, marathon, marathon-lb,
mesos-dns, kafka,...).
These will be used to troubleshoot any problems at cluster-level.
Having each component's version as a metric label could help with
troubleshooting, for example seeing a modified marathon-lb's impact on
cluster (having a graph with both old an updated releases).
Cluster-level metric collection, storage and also representation
should not have a hard dependency on any DC/OS cluster component
(marathon, marathon-lb, mesos-dns, zookeeper), since it would also be
used to troubleshoot cluster-outage problems. E.g. if zookeeper is
down, mesos control plane stops and so mesos-dns and marathon.
2. node-level metrics (node resources)
Classical old style host based metrics.
* metric labels: nodes could have related labels and other
metadata by their executors (host/slave-id/ip, node attributes, )
* metric values: Usual resource utilisation (cpu/mem utilisation,
net bandwith, ...).
But these should be
properly labeled (mesos node-id, ip, node attributes, ...) so
that one can generate aggregated metrics like:
* "This single app is assigned %30 of total CPU resource in cluster and utilising %43 in reality"
* "This application is using %90 of netowork bandwith on all nodes it has instances"
3. application-level metrics by containeriser (resource usage
collected from outside of application's context)
Each executor should configure the app to run and limit the
resources for each task they run:
* metric labels: tasks could have related labels and other
metadata by their executors (host/slave-id/ip, marathon
id/labels, docker id/image/labels, specific ENV values...)
* metric values: the resource utilisation (cpu/mem utilisation,
net bandwith), also other framework/executor metrics at
containerizer level (marathon cpu/mem limit)
Labels should be in sync with node-level metric labels so that
they can be used to aggregate different metrics.
4. application-level metrics from application (components to let
application push their metrics or let them expose metrics and
get them collected)
Metrics generated by application, should also be properly labeled
with mesos/framework/node metadata so that they can be used to
generate different levels of aggregation.
Solution should allow metrics to be pushed or exposed by and
endpoint to be pulled by another component.
Applications should not be expected to be aware about their
orchestrator/containerizer level metadata, these should be
auto-populated during collection.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment