Skip to content

Instantly share code, notes, and snippets.

@alzabo
Created September 23, 2014 21:11
Show Gist options
  • Save alzabo/c562ef2e729654e5ad27 to your computer and use it in GitHub Desktop.
Save alzabo/c562ef2e729654e5ad27 to your computer and use it in GitHub Desktop.
Sensu
-----
monitoring not only an ops problem. devs need to be on board, looped in
poor monitoring coverage
rather than making a large change, iteratively integrate sensu, gradually replacing nagios
exploring what success looks like; observations on whether or not it was better than the status quo
standalone checks?
checks have runbook baked in
sensu::check puppet type / wrapped with in-house code
handlers for irc/issue tracker; arbitrary api/tools (awsprune)
sensu plugins worth looking at for use with nagios
upon login alerts indicated / machine stats, error conditions
dns is canonical source of truth in yelp deploy
stale cron, cron spam
- solved by writing "staleness" file in cron, monitoring for age
- stale crons fail a check, open a ticket
python sensu client; open sourced. pumps test results into sensu client socket
services contain a yaml file which explains how to monitor they can deploy with their app
parsed yaml configures sensu checks. no operators required
cases
- alert misbehaving devs that they are killing a machine. example troubleshooting commands
slideshare/bobtfish
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment