auxesis/gist:9301680

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    As I've given a few talks about Flapjack in the last few months, some common questions have popped up after the talk.
In this post I'll go a bit deeper into some of the thinking and motivations behind Flapjack, and why we solve the alerting problem the way we do.
How does Flapjack decide how to send alerts?

Flapjack depends on a constant stream of events to do its failure detection and alert routing.
An event in Flapjack looks like this:
{
  "entity":  ENTITY,     # Name of the relevant entity
  "check":   CHECK,      # The check name ('service decription' in Nagios lingo)
  "type":    EVENT_TYPE, # One of 'service' or 'action'
  "state":   STATE,      # %w(ok warning critical unknown acknowledgement)
  "time":    TIMESTAMP,  # UNIX timestamp of the event's creation
  "summary": SUMMARY     # Check output, or acknowledgement message
}

Interally, Flapjack maintains a bunch of counters that keep track of various states associated with the event data it's receiving. These counters only update when Flapjack pops an event off the queue to process it.
This means Flapjack will only decide to send an alert when processing events - there are no background processes watching state changes or event staleness that trigger alerts.
It also means there are no such thing as one-off events. You can't just dispatch an event about a failure state when you detect it in your app. You need something sending constant updates.
This might seem a little limited, however it's crucial to Flapjack's design.
Flapjack asks one question when determining whether to send an alert - "How long has a check been failing?". This is quite a departure from the question Nagios asks - "How many times has this check failed?", or more accurately "How many times have we observed a failure over an indeterminate time period?"
I've blogged about this at length before, but check latency has a significant effect on the reliability of your alerts when you're using soft/hard state model.
In a dynamical systems with variable sampling latency (like a monitoring environment), you don't have any guarantees about when the next sample (check result) is going to come in. In a monitoring context, soft/hard states become near useless as an alerting mechanism if you care about detecting faults in a consistent and timely manner.
There is an entire branch of engineering and mathematics that deals with modeling these behaviours called control theory.
Flapjack attempts to route around this problem by eschewing the soft/hard state model entirely, and looks at how long a particular check has been failing before deciding to send notifications.
This introduces a new problem however - how do you know if Flapjack has stopped receiving events from an event producer like Nagios?
oobetet

To solve this problem we ship a Flapjack component called oobetet, the "Out Of Band, End To End Test".
oobetet expects a constant stream of events for a check that oscillates from OK to CRITICAL and back - if the events stop oscillating, an out-of-band alert is sent. You run an oobetet per-Nagios instance to verify the currency of the stream of events from that Nagios.
There is an edge case where we only stop receiving events for a subset of checks, and this is definitely a problem we'll need to address in the future. In our experience, an entire Nagios instance hanging or disappearing is the more common operational issue, so we built the oobetet to address that particular problem.
We have seen cases in the wild where Nagios only ends up executing a percentage of its checks, so we know this will need to be addressed by Flapjack in the future.
How does rollup work?

Rollup is an increasingly prevalent topic in the monitoring space. Nobody likes being woken up by a flood of alerts, and as our infrastructures increase in size, the problem only gets worse.
While Flapjack has alert rollup functionality, we actually refer to this functionality as "alert summarisation", not rollup. This is because we don't cap the number of notifications sent in a time window - we summarise the alerts going out to the operator once the initial alert summarisation threshold is tripped, and we'll continue to push summaries if we detect further failures.
Alert summarisation is made possible in Flapjack by inducing a time delay on sending out alerts and asking the question "how long has a check been failing?". This essentially turns Flapjack into a broadcast delay system that does smart aggregation of alerts before sending them.
Time delayed alerts and alert summarisation are core to Flapjack being an effective alert umbrella.
What do you mean by "bi-directional" PagerDuty integration?

When combining Nagios with PagerDuty, there is double handling of acknowledgements by the operator when responding to an incident:

Ack the alert in PagerDuty
Ack the alert in Nagios

This is an extra burden for the operator and paperwork they certainly don't want to be dealing with when a large scale outage is unfolding.
When using Flapjack's PagerDuty gateway, alerts that are ack'd in PagerDuty are also ack'd in Flapjack.
Flapjack polls the PagerDuty API to see if alerts have been acked by an operator via phone, sms, the web interface, etc. If the alert has been acked in PagerDuty, we ack it in Flapjack.
This saves the operator crucial time and reduces the number of procedural tasks they need to juggle in their head when responding to an incident.
Can I scale Flapjack horizontally?

You can run multiple instances of the Flapjack components (processor, notifier, gateways) - in fact we encourage you to for HA.
The processor and notifier will easily run as separate instances across multiple machines, provided they can all talk to the same Redis instance.
Scaling the web + API gateways is just like scaling any web app - just throw up more instances and sit them behind a reverse proxy.
Flapjack makes this easy by allowing you to specify what components you want to run within the Flapjack config. You can run all the components on a single machine, one component per host, or a mix.