Skip to content

Instantly share code, notes, and snippets.

As I've given a few talks about Flapjack in the last few months, some common questions have popped up after the talk.

In this post I'll go a bit deeper into some of the thinking and motivations behind Flapjack, and why we solve the alerting problem the way we do.

How does Flapjack decide how to send alerts?

Flapjack depends on a constant stream of events to do its failure detection and alert routing.

An event in Flapjack looks like this:

We originally held the meetups at Atlassian, however we made the decision pretty early on to move them to a sponsor-agnostic location.

We did this for two reasons:

  1. As cool as offices can be (and Atlassian have fantastic digs), they have a particular atmosphere about them that can encourage (and discourage) particular attitudes when talking about problems. We want the venue to be neutral and a safe place to talk about tricky issues.
  2. Access to offices can be tricky. Getting people in and out is often marred by building security requirements, and this extra hurdle can make the meetup less inviting for newcomers.

We still think these reasons are valid, and we'll continue to host the Sydney DevOps meetup in pubs and bars for the forseeable future.

flapjack logo

The tutorial will be interactive. You will be:

  • Running up a Vagrant box locally
  • Installing Flapjack, and
  • Configuring a local Nagios to talk to Flapjack

Preparation

require 'ipaddress'
addresses = ARGV
addresses.each do |addr|
if addr =~ /\//
IPAddress(addr).each_host do |host|
print host.to_s + ' '
end
else
require 'ipaddress'
addresses = ARGV
addresses.each do |addr|
if addr =~ /\//
IPAddress(addr).each_host do |host|
print host.to_s + ' '
end
else

The devops field guide to cognitive biases (second edition)

Cognitive biases can deeply affect our behaviours towards others and our ability to process information by herding us towards mental shortcuts that are optimised for timeliness over accuracy, at the expense of rationalising irrational behaviour.

In the first edition of this talk we looked at cognitive biases in the context of teamwork - how they affect our ability to interact with other people, and limit effectiveness of teams that collaboratively solve problems.

In this second edition, we turn our focus to cognitive biases during high stress situations - outages and incidents.

You know that running drills are important, but what's stopping you from doing them? What during an stress-filled outage convinces you relationships between systems exist, where in hindsight it's obvious they don't? What develops common narratives that lead you to the same contributing factors at every incident retrospective?

< Set-Cookie: rack.session=BAh7CEkiD3Nlc3Npb25faWQGOgZFVEkiRWYxZDcxZWI3NzFjZDYwMDVlYTYw%0AMjNmNTg1Y2JjYzg
3NjU2MjA2NzQ3YzFkZDliYTMxZTgwN2Y3ZmE3Zjg0ZjkG%0AOwBGSSIJY3NyZgY7AEZJIiVjYTAyMzdjYjYyY2ZjZjY3N2YzOTViZWVmY2Jl%0ANTVkOQY7AEZJIg10cmFja2luZwY7AEZ7B0kiFEhUVFBfVVNFUl9BR0VOVAY7%0AAFRJIi0zZDcyNmZlZjg0MmU1YzgwMDYwMDExYzU5M2E5OTBlMDJhMWM1MGFj%0ABjsARkkiGUhUVFBfQUNDRVBUX0xBTkdVQUdFBjsAVEkiLWRhMzlhM2VlNWU2%0AYjRiMGQzMjU1YmZlZjk1NjAxODkwYWZkODA3MDkGOwBG%0A--f0e0ef490ef8beddc696392813b2687354b8241b; path=/; HttpOnly

One big success was the “traffic light system” invented by the lab branch in Norlisk—a Russian mining town north of the Arctic Circle.

In response to the CEO’s call for reduced customer waiting times and better sales and service, the branch experimented with “varying standard procedures depending on how many customers were waiting in line.” They started with a paper mock-up and later developed green, yellow, and red lights that appear on tellers’ computer screens.

Branch managers activate the green light when lines are short—at such times, tellers are expected to explain things carefully, answer questions completely, and cross-sell services. A yellow light means things are getting busy and tellers should hurry customers a bit and do less cross-selling. A red light means “all hell has broken loose.”

Bugrov explained: “The standard time to serve a customer is greatly reduced. All customers with ‘long’ transactions get transferred to a dedicated teller. The tellers are forbidden to cross-sell and discourag

Leveling up your Flapjack stack

Too many alerts. Too many dashboards. Too much noise - and the alert fatigue isn't receding.

If you're frequently on the end of a pager (or pager-like device) and working with systems running in the cloud, you've probably noticed an increase in the volume of alerts over last few years.

This is a problem that's not going away - in fact, with the proliferation of monitoring tools going on at the moment due to a renaissance in Open Source monitoring, coupled with the ever expanding sprawl of systems that make up modern businesses on the web, the problem is only getting worse.

Flapjack is an alert umbrella for people on-call that intelligently routes and rolls up alerts, integrates with check execution engines like Sensu & Nagios, and ships a well documented API for restart-less configuration.

<!DOCTYPE html>
<html>
<head>
<meta http-equiv='content-type' value='text/html;charset=utf8'>
<meta name='generator' value='Ronn/v0.7.3 (http://github.com/rtomayko/ronn/tree/0.7.3)'>
<title>visage-api(5) - Reference documentation for the Visage JSON API</title>
<style type='text/css' media='all'>
/* style: man */
body#manpage {margin:0}
.mp {max-width:100ex;padding:0 9ex 1ex 4ex}