bergerx/gist:6ecd9e7fb2cc6b5c5f26fede5397d7de

## gistfile1.txt
1. cluster-level metrics and health (mesos-master, mesos-slave,
   marathon, marathon-lb, mesos-dns, kafka ...)

Metrics for cluster components like mesos-master, mesos-slave,
frameworks (DC/OS services like zookeeper, marathon, marathon-lb,
mesos-dns, kafka,...).

These will be used to troubleshoot any problems at cluster-level.

Having each component's version as a metric label could help with
troubleshooting, for example seeing a modified marathon-lb's impact on
cluster (having a graph with both old an updated releases).

Cluster-level metric collection, storage and also representation
should not have a hard dependency on any DC/OS cluster component
(marathon, marathon-lb, mesos-dns, zookeeper), since it would also be
used to troubleshoot cluster-outage problems. E.g. if zookeeper is
down, mesos control plane stops and so mesos-dns and marathon.


2. node-level metrics (node resources)

Classical old style host based metrics.

* metric labels: nodes could have related labels and other
  metadata by their executors (host/slave-id/ip, node attributes, )

* metric values: Usual resource utilisation (cpu/mem utilisation,
  net bandwith, ...).

But these should be
properly labeled (mesos node-id, ip, node attributes, ...) so
that one can generate aggregated metrics like:
* "This single app is assigned %30 of total CPU resource in cluster and utilising %43 in reality"
* "This application is using %90 of netowork bandwith on all nodes it has instances"


3. application-level metrics by containeriser (resource usage
   collected from outside of application's context)

Each executor should configure the app to run and limit the
resources for each task they run:

* metric labels: tasks could have related labels and other
  metadata by their executors (host/slave-id/ip, marathon
  id/labels, docker id/image/labels, specific ENV values...)

* metric values: the resource utilisation (cpu/mem utilisation,
  net bandwith), also other framework/executor metrics at
  containerizer level (marathon cpu/mem limit)

Labels should be in sync with node-level metric labels so that
they can be used to aggregate different metrics.


4. application-level metrics from application (components to let
   application push their metrics or let them expose metrics and
   get them collected)

Metrics generated by application, should also be properly labeled
with mesos/framework/node metadata so that they can be used to
generate different levels of aggregation.

Solution should allow metrics to be pushed or exposed by and
endpoint to be pulled by another component.

Applications should not be expected to be aware about their
orchestrator/containerizer level metadata, these should be
auto-populated during collection.
	1. cluster-level metrics and health (mesos-master, mesos-slave,
	marathon, marathon-lb, mesos-dns, kafka ...)

	Metrics for cluster components like mesos-master, mesos-slave,
	frameworks (DC/OS services like zookeeper, marathon, marathon-lb,
	mesos-dns, kafka,...).

	These will be used to troubleshoot any problems at cluster-level.

	Having each component's version as a metric label could help with
	troubleshooting, for example seeing a modified marathon-lb's impact on
	cluster (having a graph with both old an updated releases).

	Cluster-level metric collection, storage and also representation
	should not have a hard dependency on any DC/OS cluster component
	(marathon, marathon-lb, mesos-dns, zookeeper), since it would also be
	used to troubleshoot cluster-outage problems. E.g. if zookeeper is
	down, mesos control plane stops and so mesos-dns and marathon.


	2. node-level metrics (node resources)

	Classical old style host based metrics.

	* metric labels: nodes could have related labels and other
	metadata by their executors (host/slave-id/ip, node attributes, )

	* metric values: Usual resource utilisation (cpu/mem utilisation,
	net bandwith, ...).

	But these should be
	properly labeled (mesos node-id, ip, node attributes, ...) so
	that one can generate aggregated metrics like:
	* "This single app is assigned %30 of total CPU resource in cluster and utilising %43 in reality"
	* "This application is using %90 of netowork bandwith on all nodes it has instances"


	3. application-level metrics by containeriser (resource usage
	collected from outside of application's context)

	Each executor should configure the app to run and limit the
	resources for each task they run:

	* metric labels: tasks could have related labels and other
	metadata by their executors (host/slave-id/ip, marathon
	id/labels, docker id/image/labels, specific ENV values...)

	* metric values: the resource utilisation (cpu/mem utilisation,
	net bandwith), also other framework/executor metrics at
	containerizer level (marathon cpu/mem limit)

	Labels should be in sync with node-level metric labels so that
	they can be used to aggregate different metrics.


	4. application-level metrics from application (components to let
	application push their metrics or let them expose metrics and
	get them collected)

	Metrics generated by application, should also be properly labeled
	with mesos/framework/node metadata so that they can be used to
	generate different levels of aggregation.

	Solution should allow metrics to be pushed or exposed by and
	endpoint to be pulled by another component.

	Applications should not be expected to be aware about their
	orchestrator/containerizer level metadata, these should be
	auto-populated during collection.