Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?

Ryan Frantz defined monitoring at http://www.ryanfrantz.com/posts/solving-monitoring/ as

Monitoring is the aggregation of health and performance data, events, and relationships delivered via an interface that provides an holistic view of a system's state to better understand and address failure scenarios.


Here's my alternative definition

  • Monitoring is the detection of variation outside the steady state of the business.
  • Monitoring may result in alerting, which may result in an incident response process to reduce variation.
@FlorianHeigl

This comment has been minimized.

Copy link

FlorianHeigl commented Sep 29, 2015

...lost some text here...
please keep in mind that the term Monitoring is a lot older than some blogs or gists. So neither of you, Ryan, me can be defining it. We can interpret it at best.
Sorry for the nitpicking, but I think it is important to differentiate. For the same reason I like that you said it may result in alerting -> etc.
That is very true.

A holistic view is mostly a goal, and a good monitoring setup may show it. But that doesn't mean it is necessary in all cases or interesting for i.e. the alerted parties.

I'd stay away from the term business. Monitoring can track the speed of a conveyor belt, a number of signups per day, the radiation at the next leaking nuclear site (and we've seen sometimes monitoring doesn't even provide any view in that case, when the values are out of the measurable boundaries - which was a major problem), it can concern production items or business KPIs - OR a holistic view of the state of business.

I'm also not sure if monitoring means the detection of variations/deviations. At the most basic it just means it tracks the current state at the time.
Example is a check many people used to run:
tracking the number of ssh logins.
Outside of industrial or other defined-user turn-key setups this check is horribly useless. Yet many people did it, and tracking this number (let's assume I'm just too stupid to get why they did it) IS monitoring.
It doesn't matter if they alert on it - oh, but it gives them means of detecting a variation outside the steady state. So, maybe this is where I meet with your phrasing.

Still, be careful about being definite:

  • some industries use event handlers (i.e. many running crap delivered as java web applications) - for them automatic restarts are a good thing to do.
  • Other industries might be completely forbidden to intermix monitor and actor. I.e. chemical industry in my country.
    IMO a good tactic to keep this separation. Just because I need an automatic restart, it doesn't mean it should be triggered from the monitoring. Lose the monitoring, no restart. Loose WAN connections, and your gearman queue soon holds restarts for all remote instances.

About interface: you already dropped that point - good thing to do. Just as some systems run without any email/pager or other alerts besides a dashboard, some others require noone's view of the monitoring interface. Long ago I went with that in saying "if you find yourself needing to open the nagios UI, I (doing the config) have to improve the monitoring config"

I hope this helps, I'm sorry it's not a plain "patch".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.