lost-theory/01_intro.md

## 01_intro.md

      
    Raw
  

              01_intro.md
            
          
    Monitorama 2014 notes

http://monitorama.com/
Best talks day 1:

Please, no More Minutes, Milliseconds, Monoliths... or Monitoring Tools! - Adrian Cockcroft

gave 5 good rules for monitoring systems, showed what cloud / microservices monitoring looks like @ Netflix


Simple math to get some signal out of your noisy sea of data - Toufic Boubez

explains why static alert thresholds don't work and gave 3 techniques to use instead


Car Alarms and Smoke Alarms - Dan Slimmon

how to use sensitivity and specificity in monitoring, some good math


Metrics 2.0 - Dieter Plaetinck

metrics20.org = redesign of graphite that fixes a bunch of stuff, keep an eye on this project


StatsG at New York Times - Eric Buth

the first half of the talk on ops philosophy was really interesting, second half about statsg is not so useful


Best talks day 2:

"Auditing all the things": The future of smarter monitoring and detection - Jen Andre

really awesome security talk, lots of good practical steps for us


Is There An Echo In Here?: Applying Audio DSP algorithms to monitoring - Noah Kantrowitz

shows how to use audio processing techniques on monitoring data, good math, very interesting


The Lifecycle of an Outage - Scott Sanders

github's tools & procedures & culture around resolving outages


A whirlwind tour of Etsy's monitoring stack - Daniel Schauenberg

practical walkthrough of Etsy's (extensive) monitoring system


Web performance observability - Mike McLane & Joseph Crim

not sure we can directly use the tool they made, but this is a good idea of what a web performance benchmark suite looks like, also see canary.io lightning talk


Good lightning talks:

serverspec + sensu: interesting approach to testing & monitoring, if you write serverspecs for testing / CI, you can also run then on your productions servers and get even better coverage
monitoring & inadvertent spam traps: anecdote from a developer on how developers can use monitoring to solve problems
Expanding Context to Faciliate Correlation: showed 3 open source tools that improve on graphite/nagios web interfaces
canary.io: project from github ops for doing web performance testing, still in the early stages, but looks promising
Distributed Operational Responsibility: some tips from spotify on why ops responsiblities (like monitoring) should be shared with developers

Semi-interesting sponsor plugs:

VividCortex: MySQL performance analysis tool (SaaS) from ex-percona guys
Pagerduty: we should start using multi-user alerting (new feature, they gave 2 good use-cases)
Elastic Search: ~70% of the people attending were using ElasticSearch
Big Panda: building a smarter "inbox" for ops (to replace email + jira)

Recurring themes / big takeaways:

monitoring must scale ahead of the underlying system
you need high frequency monitoring: it's not OK to wait minutes for a check result or alert
collect data on everything with graphite
data collection should be a default on everything from the beginning, it should not be a time-consuming / reactive / after-the-fact process
only alert when work isn't getting done, RAM / swap / CPU / etc. are not something you should directly alert on
manually watching graphs & dashboards doesn't scale
start using anomaly detection
static thresholds do not work for data from the data center, moving averages are only slightly better, you need to use better math
do more analysis, understand your data (scatterplots, histograms, find distributions, correlations, probability & stats, etc.)
ops should provide self-service data collection / monitoring / alerting for developers


## 02_day_one.md

      
    Raw
  

              02_day_one.md
            
          
    welcome

Jason Dixon:

this monitorama is 2x the size of last year & berlin
conference buddies, if you see someone with a heart sticker introduce yourself to them
everyone give a high five or free hug
why do this? this isn't a ruby conference
empathy and culture is important, especially for ops
between engineers, ops, and management
and for the community here
share the love
sponsors are great bla bla
breaks and lunch bla bla

Please, no More Minutes, Milliseconds, Monoliths... or Monitoring Tools! - Adrian Cockcroft

http://www.slideshare.net/adriancockcroft/monitorama-please-no-more


keynote


formerly of netflix


graph of enterprise IT cloud adoption


from left to right: ignore, ignore, ignore, no, no, I said No dammit, oh no, oh fuck


rest of world = half way through cloud adoption


you are here = trying to play catch up


20 years exp:

94 "SE Toolkit"
98 Sun Perf. Tuning
99 Resource Mgmt.
00 Capacity Planning for Web Services
07 Outstanding Contrib. to Computer Metrics
04-08 Capacity Planning Workshops
14 Monitorama!

state of the art in 2008:

cacti, ganglia, nagios, zenoss, mrtg, Wireshark
low number of machines
it was subversive to think that open source could replace expensive enterprise tools
created "SE", a C interpeter which could extract solaris performance information and output it all in a standard format
created "virtual adrian", a simple rule based system for automated monitoring of disk, memory, etc. in solaris (to watch systems while he was on vacation)

why no more monitoring tools?

we have too many
we need more analysis tools, can we get an analysorama conference?
rule #1: we spend too much time collecting, storing, and displaying metrics
if you spend 50% of your time on this it's too much
we need more automation, more analysis
monitoring should not be tacked on, it should be a default

what's wrong with minutes?


not enough resolution to catch problems


it takes 5-8 minutes before you start seeing alerts


if you had second resolution, you can see the difference in 5 seconds


if your rollbacks are quick, you can revert a bad change in 5 seconds


compare a 10 second outage to a 10 minute outage


from continuous delivery we know that small incremental changes are best


so we need the same from monitoring


instant detection and rollback within seconds should be a goal


SaaS tools that do this: VividCortex, boundary


how does netflix do it? hystrix and turbine, websockets, streaming metrics, 1 second resolution & 15 seconds of history, circuit breakers, pages go to who is directly responsible for a specific component or change


rule #2: metric collection -> display latency should be < human attention span (10s)


what's wrong with milliseconds?


in a lot of JVM instrumentation, ms is the standard


the problem with ms is that a lot of datacenter and hardware communication needs nanosecond resolution


rule #3: validate your measurement system has enough accuracy and precision


if there's a difference between something taking X and Y nanoseconds in your system, and all you have are a bunch of 1ms data points, you can't identify the problem


what's wrong with monoliths?


monolithic monitoring tools are easy to deploy, but when they go down, you then have no monitoring


there needs to be a pool of aggregators, displayers, etc.


easier to do upgrades, more resilient to downtime


anything monolithic has performance problems, scalability problems, SPOFs, can't tell the difference between monitoring system going down vs. actual system going down


in-band monitoring: running monitoring on the same process, server, data center, etc. as the system itself


SaaS monitoring: send to a third party


both: an outage can't take out both monitoring systems, HA monitoring


they might not being monitoring exactly the same stuff, but they should have some overlap


rule #4: monitoring needs to be as (or more) available & scalable than the underlying system


continuous delivery:


high rate of change


new machines being spun up and shut down all the time (in netflix's case)


short baselines for alert threshold analysis


ephemeral configuration


short lifetimes make it hard to aggregate historical data


hand tweaked solutions do not work, it would take too much effort


microservices:


complex flow of requests


how do you monitor end-to-end when the dependencies and flow of requests is so complex and dynamic?


Gilt Groupe: went from a handful of services to 450 services over the course of a year


"death star" microservice pattern: everything is calling everything else in one big tangled graph of dependencies


how to you visualize this? we need more hierarchy & grouping


closed loop control systems:

how did netflix do autoscaling?
on every deploy during peak time, double the number of servers
using load average, which is not the best metric to use
lots of overshoots
new solution: scryer
predictive autoscaler, FFT based algorithm, builds a forward predicted model to set the autoscale level
scales ahead of time, then corrects as necessary
using the old method it was hard to do this analysis, because the data was so chunky (from the doubling)

code canaries:

ramp up of deployment, looks for errors, if there are problems it emails the responsible team and stops rolling out the code

monitoring tools for developers:

most monitoring tools are built for ops / sysadmin (DBA vs. network admin vs. sysadmin vs. storage admin)
fiefdoms of different teams and tools, different levels of access, hard to collaborate, hard to integrate and extend
state of the art is to move towards APM, analytics, integrated tools for all teams
deep linking & embedding, extensible tools
business transactions, response time, runtime (e.g. JVM) metrics

challenges with dynamic ephemeral cloud apps:

dedicated hardware: arrives infrequently, disappears infrequently, sticks around for years, unique IPs and MAC addresses
cloud assets: arrive in bursts, stick around for a few hours, recycles the IP and MACs of machines that were just shut down!
in the cloud model, you need to have a historical record of everything that ever happened in your infrastructure (Netflix Edda)

traditional arch:

business logic
DB master & slave
some fabric in between
storage

new cloud systems:


business logic


NoSQL nodes


cloud object store


not all hosted cloud services have detailed monitoring / metrics exposed


you depend on web services to integrate with cloud services


span zones & regions, monitoring now needs to span zones & regions too


NoSQL introduces new failure modes


5 rules:


analysis > collection


key business metric monitoring should be second resolution


precision and accuracy -> more confidence


monitoring must be more scalable than the underlying system


start building distributed, ephemeral cloud native applications


Q&A:


Q: you mentioned better visualization for microservices, like what?

A: a user hits the homepage -> what services are hit?, there is no arch. diagram anymore, part of viz. involves seeing which zones and regions are hit, manual tagging & hierarchy of components, owners, etc. it's useful to for instance limit to just the services my team owns or depends on, aspect-oriented view, but it's not a solved problem, most OSS monitoring tools have good backends but less good UIs, cloudweaver looks interesting


Q: canary system, what types of checks are you running?

A: error rate, CPU time, response time, jmeter functional tests, business metrics, and you need to do the comparison on freshly spun up nodes (e.g. 3 old vs. 3 new copies of the code on freshly spun up machines)


Computers are a Sadness, I am the Cure - James Mickens


(this talk was just entertainment, no practical information)


i'm here to take you on a quest


everything i'm going to tell you is 100% true


bla bla


distributed systems send messages back and forth


most messages fail because god hates us


so we send more


10 years ago the MapReduce paper was like alien technology


it was so simple and seductive, you just specified a map and reducer function, ran it on commodity machines, it was amazing


that was 10 years ago


let's stop talking about MapReduce


say "word count" one more time


let's also stop talking about "the cloud"


the problem with all this social cloud stuff is that i hate most people


there are two kinds of people: people who have actually build cloud software and others


others: cloud is great!, 99.9999999%!, everyone is happy, everything is a solved problem!


real cloud people: it's a nightmare, hardware fails, SLAs are misleading, IO is queued up, packets get sent to a black hole, it's madness


why does anything happen at all in the cloud?


it's like an old timey map with dragons in the middle


this is why we need monitoring & analysis


a message of hope: give up


look at the CAP theorem, you can't have it all


if your email goes down, then your reaction should be to want to use email less, go do something else


can't take your test at your MOOC? take it later, your MOOC degree will be just as worthless


let's be serious though


some things we do need to care about


(nosql rant i didn't fully write down, nosql = bane from batman, throw out all the rules and laws, chaos)


conventional wisdom: america needs more programmers


reality: we need fewer programmers


technology is not the future, no more stupid apps, painting is the future, go do that, leave me alone


if you are a VC who funds this kind of stuff, i hope you become poor


let's be serious about security


threat model: mossad or not-mossad


either you are being attacked by mossad or you're not


"not attacked by mossad" = where you want to be, just keep using strong passwords and don't click on weird links


"you are being attacked by mossad" = no defenses, you're going to die


america's mental model of the CIA, FBI, etc. are that they are bunch of boy scouts


in reality: drones, exoskeletons, cable splicing submarines


they're not going to send boy scouts, they're not going to fight close range musket battles, they're going to use their advantage of having access to all the infrastructure you depend on


how do you defend against that with rocks and pencils and leaves?


easy attacks are easy


"Mary" from "Central University" working as a "Rectuier" with an attactive profile picture wants to be my friend on Facebook


obviously i don't know mary


BUT WHAT IF I DO KNOW MARY


most important goal in security: eliminate men as a gender


possible solution: dude overflow detected -> trigger bear trap and the guy from the SAW movie


summary:

ozzy osbourne crazy train = cloud computing
bane = nosql
bla bla

Q&A:

Q: can i be your friend on facebook?

A: there is a background check, and i will wait 2-3 days to show i'm not desparate, i encourage you to submit an application though, i love judging people


Simple math to get some signal out of your noisy sea of data - Toufic Boubez


i lied! there are no simple tricks


too good to be true = it probably is


background:


CTO Metafor Software


CTO Layer 7 Technologies


CTO Saffron Technologies


let's start with the "Wall of Charts"

hire a new guy: shove him in front of the wall of charts
we collect 1000s of metrics, pick 10, and put them in a dashboard
this is meaningless
WoC leads to alert fatigue
alert fatigue is one of the largest problem in ops
watching WoCs cannot scale
at some point, you will need a person or a team dedicated to watching the WoCs
so we need to turn this work over to the machines


to the rescue: anomaly detection

definition: detect events or patterns which do not match expectation
definition for devops: alert when one of our graphs starts looking wonky


who else is doing anomaly detection?

manufacturing QC has been doing this for a long time
measure the diameter, weight, etc. of the flux capacitors and throw the outliers away
assumptions: normal, gaussian distrbution; data is "stationary", it doesn't change much over time
the "three-sigma rule": 68% of the values lie within 1 std dev of mean, 95% lie within 2, 99.7% lie within 3
mark those percentages as the "red lines" on the graphs and take action when a value falls outside of a red line


if you implement 3-sigma rule alerts in the data center:

a. you get alerted all the time, or
b. you don't get alerted when there's a real problem


the assumptions from manufacturing (gaussian, stationary) don't apply to the data center


static thresholds are ineffective


if data is moving, we need a moving threshold, that's a smart idea


the "big idea" of moving averages: the next value should be consistent with the recent trend

finite window of past values, ignore the whole history
calculate a predicted value
"smoothed" version of time series
compare squared error rates between smooth vs. raw data
now you can compute the 3-sigma values based on that smoothed data


what about spikes, outliers, etc.? windows can be skewed


ok, now we use a weighted moving average, less weight on data that is further away

not good enough, doesn't handle trends, exponential smoothing
double exponential smoothing (DES)
triple exponential smoothing (TES)
Holt-Winters (seasonal effects)


result:

a. you are woken up a lot less, but still woken up
b. it still doesn't catch some problems


are we doomed?


no


smoothing works on certain kinds of data


smoothing works when deviations are normally distributed


there are lots of non-gaussian techniques, we're only going to scratch the surface in this talk


trick #1: histograms

(better: kernel densities, but histograms work and are simple)
if you have a bunch of different time series of the same metric, build a histogram for each series
start by looking at the distribution of your data, understand what it looks like before you start your analysis


trick #2: kolmogorov-smirnov test

it sounds cool and it works
compares two probability distributions
requires no assumptions about the underlying distribution
measures max dist. between two cumulative dists.
good for comparing day-to-day, week-to-week, seasonal affects
"are these two series similar or not?"
KS with windowing

example: KS for week 1 vs. week 2 and week 2 vs. week 3 (where week 3 is during christmas and we experienced a problem)
1 vs. 2: small distance
2 vs. 3: huge distance


the case where 3-sigma static threshold failed is now extremely clear with KS


trick #3: diffing / derivatives

often when your data is not stationary, the derivative is
e.g. random walks
most frequently, the first difference is sufficient: dS(t) <- S(t+1) - S(t)
once you have the stationary data set, gaussian techniques work better
real example: CPU time
the distribution is totally non-gaussian, very noisy and random looking
but.. first difference, it totally is gaussian!


you're not doomed if you know your data


understand the statistical properties of your data


data center data is typically non gaussian


so don't use smoothing


use histograms, KD, and derivatives instead


Q&A:

Q: is your point to make everything gaussian?

A: no! sorry if i conveyed this message, KS does not involve gaussian, there are lots good non-gaussian techniques


The Care and Feeding of Monitoring - Katherine Daniels


a story

pagerduty tells us our site is down
so we checked, and it was down
then... a minute later, it's back
hmm. ok.
then.. a few minutes later
down again
and up again


this is.. The Blip, a randomly occurring outage that fixes itself


so what's happening?

500 rate.. nothing
API errors.. nothing
error rate... nothing


what are we missing from our monitoring?


monitor all the things!

we're missing something, just start randomly adding metrics until we find it
then you get.. this..
zenoss screenshot that's all red from down checks


we're trying to find a needle in a haystack and just added more hay


this is why you don't do a full body diagnostic scan for medical patients, the more you look for, the more you might find, and they might not all be actual issues


so, we need to monitor only some of the things..


first looked at the load balancers, because everything dropped out of the LB at once


tried provisioning a new ELB, switching availability zones


looked at access logs


everything worked the same, still getting the blip


how about the healthcheck?

the healthcheck was hitting something called "healthD", a healthcheck service that failed when one or both of two important backend components went down
and there weren't any logs or monitoring for healthD itself


looking inside healthD showed that one of the two services, api2, had a problem

it seems a certain misbehaving user was triggering bad requests
so we went into api2 and added metrics per response type
found the response type that stood out
decreased timeouts from 60 seconds to 5 seconds
optimized some slow queries
deleted some old slow / unused API methods


now the site was back to normal


why didn't we have monitoring for this?


black boxes, mysteries


any X-as-a-Service that you depend on (e.g. ELBs) are black boxes and need some special care for monitoring


technical debt / bad technical decision


why did the healthcheck require both services to be up?
why did we even have two separate APIs?
long ago someone decided to do a rewrite, but the old system remained
we can only move foward at this point, we can't shut down either system, so we need to monitor both


what to monitor:

monitor all services
monitor responsiveness (network, API, web server)
system metrics (memory used, CPU used, disk space)
application metrics (read lock time, write lock time, error rate, API response time)


don't get into a situation where you have to say "oh yeah that check is red but it's OK, don't worry"


as someone mentioned earlier, your monitoring needs to scale above your application

load test your monitoring, make sure it can keep up and responds properly with increased load


monitoring should not be a silo, it shouldn't be an ops problem

monitoring should be built in to the application from the beginning
work with developers
ask: "what does it mean for this application to work properly? what does it look like when it breaks?"


monitoring shouldn't be a reactive last minute thing


Car Alarms and Smoke Alarms - Dan Slimmon


Sr. Plat Engineer at Exosite, which does internet of things

we recently made a better mousetrap that texts you when it goes off, so if you have a building full of mouse traps you only need to check the one that was tripped


we wear many hats in ops


but data science is becoming a very important hat


people believe you when you have graphs


signal to noise ratio


example: plagiarism detection

let's say we make a system that has a 90% chance of positive plagiarism detection
20% chance of negative result
and 30% of kids currently plagiarize


some questions:


given a random paper, what's the prob you get a negative result?


59%


what's the probability that the system will catch a plagiarized answer?


90%, duh, we already knew that, why'd i ask you that?


if you get a positive result, what's the probability the result really is plagiarized?


65.8%


this is an unintuitively terrible result


we originally heard 90% chance


but now in the real world it's down to 65.8%, that's pretty useless


sensitivity and specificity

sensitivity: % of actual positives that are identified as such
specificity: % of actual negatives that are identified as such
high sensitivity: freaks the the fuck out when anything might be considered slightly bad
high specificity: if it says you cheated, sorry, you definitely cheated


here's the graph if you want to look at it again: http://imgur.com/LkxcxLt.png


how does this relate to ops?

positive predictive value is the probabiilty that: when you get paged, something is actually wrong
consider your service has 99.9% uptime, and your check is 99% accurate
that sounds pretty good right?
P(TP) = 0.01%
P(FP) = 0.99%
PPV = P(TP) / (P(TP) + P(FP)) = 9.1%
if you get paged, you only have a 1 in 10 chance that something is actually wrong
that's horrible


car alarms

when you hear a car alarm, is your immediate reaction to run and check to make sure everything is ok?
the majority of car alarms sounding don't indicate a problem, they go off all the time for no reason
they have low specificity, high sensitivity


smoke alarms

when you hear a smoke alarm in a building, you don't have the same reaction
you don't sit around and say "do you guys smell smoke? i think i'm just gonna wait here"
you get out of the building and wait for the fire department to give the OK


why do we have such noisy checks?

undetected outages are embarrassing, so we focus on sensitivity
that's a normal, good reaction to have
but understand that the relation between the alert threshhold and PPV
looser threshold = less alerting, higher PPV, more uninterrupted sleep (but a chance you'll miss a real problem)
strict threshold = more alerting, lower PPV, more false positives


sensitivity / specificity don't need to be competing concerns


instead of a line, you need a surface


hysteresis is a great way to get these additional degrees of freedom


state machines


time series analysis (like mentioned earlier, smoothing, histograms, derivatives, etc.)


as your data changes (e.g. your service becomes more or less reliable) or your checks become more reliable


your sensitivity & specificity will change too, sometimes wildly, so you can't just set it once and forget about it


a lot of nagios configs conflate the detection vs. indentification of a problem


for example, say you have these 4 checks for your website:


apache process count


swap usage


site responding to HTTP


requests per second


"your alerting should only tell you whether work is getting done"


if your site is still up, but apache isn't running, that's great news! (haha)


so cross off #1 and #2


and #3 and #4 can be combined into one check, if your RPS is good, then it must be responding


here's a tool that i want: something like nagios that monitors services instead of hosts


when a service is down, only then do you kick off a bunch of host level diagnostics


if the tool was aware of these SNR concepts (specificity, etc.), and had some built in knobs to tune, that would be even better


other useful stuff:

bischeck
see links in slides


Q&A:

Q: is it foolish to tweak these knobs manually? shouldn't this be automated?

A: i haven't found anything to automate this yet, manually tweaking is the only way i've found so far


Metrics 2.0 - Dieter Plaetinck


works at vimeo


video transcoding & storage


lots of metrics, lots of graphite


when a user uploads, it first runs a few checks to determine which data center to route your upload to


graphite is used to make a feedback loop to make sure that kind of automated system is working properly


but this talk is going to be about problems, mostly with graphite


a timeseries looks like this: (unixtime, value)


timeseries are labelled like "mysql.database1.queries_per_second"


it is difficult to navigate the hierarchies


it is difficult to find how and why a metric is being generated


timeseries don't have units, they don't describe their behavior (e.g. semantics like which time period they cover)


unclear, inconsistent formats


metrics are tightly coupled to the source and lack context


one metric name can have multiple meanings


complexity = lots of sources * lots of people * multiple aggregators


it's a time sink

everything has to be done explicitly, even when this data could be determined implicitly (units, legend, axes, titles, etc.)
in graphite, different subtrees may contain the same types of data, so this makes it hard to compare across the hierarchy
as you gather more metrics, these problems get worse


metrics 2.0 tries to solve these problems


metrics have a self describing format


compare graphite:
stats.timers.dfs5.proxy_server.object.GET.200.timing.upper_90

to metrics2.0:
{
    server: dfvimeodfsproxy5,
    http_method: GET,
    http_code: 200,
    unit: ms,
    metric_type: gauge,
    stat: upper_90,
    swift_type: object
}


metrics20 allows you to use more characters to label your metrics (e.g. "/" for "Req/s")


metrics20 allows you to add extra metadata to your metrics

for example, src/from parameters, so you can track where a metric is being submitted from


conceptual model -> wire protocol (compatible with graphite/statsd/carbon) -> storage


metrics20.org


units are extremely useful:

MB/s, Err/d, Req/h, ...
B Err Warn Conn Job File Req ...
we allow you to use SI + IEEE standard units


easier to learn, more flexible


Carbon-tagger:

middleware between old graphite instance and new metrics20 instance
adapts old format to new format (adding metadata, units, etc.)

Statsdaemon:

similar to etsy statsd, drop-in compatible
if you send a bunch of bytes B over time, it automatically figures out this is B/s
if you send a bunch of milliseconds ms over time, it automatically calculates percentiles/min/max/mean/etc.

Graph-Explorer:

dashboard system with a new query syntax

New query syntax:


proxy-server swift server:regex unit=ms


automatically does group-by based on metadata


automatic legends, axes, tagging (these are all manual in graphite)
stat=upper_90
from datatime to datetime
avg over  (5M, 1h, 1d, ...)


Some examples:
Which is slower, PUT or GET?
stack ...
http_method:(PUT|GET)
swift_type=object

Show http performance per server:
http_method:(PUT|GET)
group by unit, server

grab all job stats (note how no timeseries names are explicitly given, this finds all timeseries that have a unit of "Jobs/second"):
transcode unit=Job/s
avg over <time>
from <datetime> to <datetime>

another example:
...didn't catch it...

another example, but now grouped by zone:
...
group by zone

network bandwidth by server:
unit=MB/s network dfvimeorpc sum by server[]

cumulative total of bandwidth over time
(automatic integration)

rate of change:
(automatic derivatives)

bonus features:

graphs are interactive (inspect, zoom)
set up rules & alerts

imagine a disk space check which can alert you on both individual machines and cluster-wide


email alerts (with embedded graphs)
emit events (see anthracite), add notes / events to graphs, events have full text search
better dashboards: allow you to dynamically append a fragment of a query to every query in the dashboard (e.g. switching between different group-by clauses)
easier to define colors

future work:

these three features are all about condensing series into smaller sets of data:

aggregation rules
graphite API functions like summarize, etc.
consolidateBy & graph renderers (i.e. at the pixel level to generate images)


a lot of mistakes show up from these operations
with metrics20 we shouldn't need to do this anymore, the graphs themselves should know how to do this
maybe we can automatically display mean/lower/upper/upper90/lower90 on graphs
facet based suggestions
imagine if you consistently emitted metrics with "unit=Err/s" across your entire stack, i.e. this was a standard in every piece of infrastructure / system / application, if you did this, you could have complete visibility into errors across your entire infrastructure, plus super easy drill-down

Q&A:


Q: openstack has a technology called "cata"(?), used by ceilometer, it's a standard, has 5 W's metadata, etc. have you looked at that?

A: i haven't, i tried searching for something like this but didn't find anything, sounds interesting, definitely will look at it


Q: does carbon-tagger cause performance problems?

A: we have 170k metrics at vimeo and it's performed fine. both tools i mentioned are written in go


Our Most Wicked Problem - Ashe Dryden


lack of diversity in tech is a wicked problem


http://en.wikipedia.org/wiki/Wicked_problem


it's like playing tetris with only one piece


whites and asians are overrepresented in tech vs. the general population


women, black, and hispanic are underrepresented


56% of women leave tech after entering, twice the attrition rate of men, and we don't have stats on other groups


why is it a wickedly hard problem?


incomplete or contradictory knowledge


not enough research


people & opinions involved


people have different opinions on this subject


economic problems


not all schools can get computers & internet access & teachers for tech


there is a pay difference between certain groups


there is no solution


just like poverty, the problem can never be totally solved


there's no right or wrong solution


we don't even know what the solution is yet


the solvers of this problem can also be the creators of the problems


what contributes? society, class, family & community, education, industry


what can i do?

if you're a parent, raise your children to be respectful of others
get involved in education
listen to the people who are affected
have empathy
collaborate
change your behavior
use your power & influence to change things, talk to your boss, talk to your colleagues, talk to strangers, reach out, speak out on behalf of others


Q&A:


Q: i'm a pro-feminist man, and i understand why you can't depend on the repressed group to solve the problem, but if i use my voice then i'm going to be speaking for women and reinforce the problem, what can i do?

A: instead of speaking on behalf of others, speak for yourself to create space for others


Q: what is low hanging fruit in this problem?

A: talk to your friends, if someone says something that doesn't sound right to you, that sounds harmful, say something to them, and explain to them instead of criticize them


Q: is it difficult because success has no definition for this problem?

A: yes


StatsG at New York Times - Eric Buth


works at the New York Times in the interactive news department


what does our department do?


i sometimes can't do a good job of explaining it, maybe some examples would be better


"The Guantanamo Docket"

interactive timeline showing what has happened to the gitmo detainees from 2002 to 2014
click on detainee's name to bring up their bio, documents, articles, etc.


"Watching Syria's War"

timeline of video clips & articles


Sochi 2014

neat tables and graphs of olympic results (medal counts, etc.)


haiku.nytimes.com

finds accidental haikus written in articles


Blackout Poetry

article starts off completely redacted, then you click on words to reveal them and create a poem


and lots more...


what's in common?


i don't know actually, we're kind of responsible for whatever we say yes to doing


we're separate from the larger NYTimes organization


we have our own infrastructure, we don't have to deal with the larger more "corporate" parts like the CMS, mobile app, etc.


we don't have as much traditional releases, milestones, etc.


heterogeneity


over 100 active apps


short turnarounds


collaborations with other departments


everything is different, for a good reason


another example: the Dialect Quiz

someone threw together a node.js app last minute
ended up being their highest traffic feature ever


if you work in systems, this might lead you to become an embittered jerk

everyone tells you their project is the most important thing ever and then it launches and you're stuck maintaining it forever
if you are in the position to say "no", you start to say "no" all the time
no new technologies, no new languages, more conservative choices
ops is vaguely managerial, you are partially in charge of leading technology projects, to make sure projects succeed, to give technical advice, to help organize the systems and keep them running
so if you have a bad run, if have some bad experiences, you tend to start saying no to everything
a year ago i tried to make a change in this behavior


what if your relationship was the opposite?

what if you tried to say "yes" to everything?
this is actually the reason behind having an interactive news dept., to do this kind of stuff
even though it can be a pain in the ass


if someone's enthusiastic about something, and you shut them down, that's not good for either side


wasted enthusiasm is a very bad thing


if you don't embrace that enthusiasm, they will go elsewhere


so how do you handle so many heterogenous systems?


have preferences and offer alteratives (e.g. nginx instead of apache)


pick technologies that are widely applicable (e.g. varnish works in front of everything)


what are you logging? how are you logging?


can you set this up without my help?


everything needs to be self-serve


including metrics gathering


old way: boilerplate / sample code / examples


new way: be reasonable, follow a few guidelines, and you're free to run whatever you want


we had an old log aggregation system, which was unmaintained


statsd replaced that system


because statsd is:

self reporting, zero config
get what you asked for
easy to integrate with everything
easy to explain
doesn't over-solve the problem


well.. we did decide to over-solve the problem a bit.. and wrote statsG

easier to run
automate data retention
eliminate flushing
safely expose self-serve data retrieval


go is a good choice for this kind of application

running binaries is a big advantage
(gave a few other reasons i missed)


redis also sounded like a good fit

redis is good at sets, this sounds like a set management problem
redis has automatic expiration


lua for scripting redis

having a scripting language inside the DB allows you to do aggregation inside the DB itself, which is very easy and super fast


result:

consumes JSON data
interactive graphs with 10 second resolution
dashboards are totally driven by developers
Winter Olympics was a big success story, the developers wrote all their own monitoring by themselves


problems:

UDP is awesome ("free" message sending), but is incredibly difficult to debug, filling up buffers/queues and dropping messages is always a worry
redis is very powerful, but redundancy and scaling are a problem


rolling your own solution is OK, but it's not for everyone


if you feel enthusiastic about something, and you want to put the time into it, then you can roll your own


this allows you to get to the root of the problem and you might learn something really valuable


for us, it was having the ability to make metrics completely driven by developers


cool bonus:

nytlabs.github.io/streamtools/
this project is going back to using log data and building up subscribe-able streams of log events
using a visual interface


Q&A:

Q: for that streamtools project, once you consume the data, what can you do with it?

A: you can do anything, different plugins for sending to redis, sending to console, forwarding the message along to another service


The cost and complexity of reactive monitoring - Chris Baker


(this talk was mostly just a war story, not much real info to take away)


data guy @ Dyn


how many people have ever been in the situation where they were staring at a pile data wondering "how did this problem happen?"


how did we get there?


scale 1: how much money do we have? (money to buy infrastructure & tools vs. extremely strapped)


scale 2: cutting edge vs. classic (new and shiny vs. nagios)


scale 3: neckbeard vs. handwaver (refusal to work with new tools vs. oh please new tools save me)


scale 4: time (lots of time budgeted vs. project manager hovering over you)


scale 5: legacy (totes cloud brah vs. you down with PDP & ancient pyramids?)


cost = price & manhours


probability of user churn (customer leaves) vs. problem duration vs. problem severity

time to identify
time to mitigate
time to resolve
impact vs. identification vs. diagnosis vs. resolution
if you fix a problem before it occurs, there is no customer impact, this is where you want to be


make more metrics to track this


metrics all the way down!


have metrics to track your metrics


but the end goal is to solve problems in CI / testing instead of production


time to identify: time motion study (cool industrial study, makes us feel good to compare ourselves to industry)

first you have to realize there is an issue
you should notice before your customer does
where do you look first?


example: customer reports that API is unavailable

so, the customer knew about this before we did
when did the problem really start?
here's where the complexity begins
when you're under pressure, your problem solving ability changes
humans are fallible, you're very likely to come up with any idea under pressure, then start to investigate or build evidence for that idea
if you started using some brand new database monitoring software, and then something breaks, you're going to start being suspicious of that new monitoring software... even though in this case it's not the cause
all the while time is still ticking
vendor plug / shout out to VividCortex, this actually solved the problem! it highlighted the problem for us!
we found the problem! or did we???
(i guess this is turning into a war story now?)
well, vividcortex showed us problems, but it didn't fix the customer's problem
so.. back to square one


reactive monitoring is the result of a bigger problem


humans are not good at this kind of problem solving


the crunch to provide an answer often leads you to the wrong answer


part 2

i work in DNS
and we know there's a certain traffic pattern during the holidays, traffic increases, we run into new problems every year because of this
but this year.. hmm.. everything is green, no pages, all graphs look amazing, everyone is relaxed & off-guard because things are going so well
we're handling huge spikes of traffic with no problem
when everything looks this good then something is probably wrong
you need someone on your team to be the pessimist, to think that everything is broken all the time...
who is driving these spikes? CDNs? marketing campaigns? botnets? round up the usual suspects
how are we collecting this data? how does this data go from the real world into our monitoring system?


your dashboard is the sausage produced by the sum of your monitoring


if there's sawdust and rats in the input, it's going to show up in the output


interesting aspects of DNS traffic:

recursive resolution (series of misses & lookups, terminating at the root)
TTL = time to live
RCODE = response codes, 0 = good, 1 = format error, 2 = server failure, 3 = name error, 4 = not impl., 5 = refused, 6-15 = bla bla
if you're not monitoring RCODEs, you don't know whether there's rat bits in your sausage
certain RCODEs don't use TTL/caching
TTLs are a rule people, and we have rules for a reason!
why monitor RCODE 5? it tells you all kinds of useful stuff
well.. we weren't monitoring RCODE 5
pretty obvious in retrospect


(i'm not quite sure what the main point of this talk is, it was more of a fun war story i guess)
Q&A:

Q: is it difficult carrying all this weight as a devops thought leader on your shoulders? (some kind of in-joke in the DevOps twitter community?)

A: when i think about it.. atlas shrugged


From Zero To Visibility - Bridget Kromhout


having aspect ratio problems


yes, definitely aspect ratio problems


I work at 8thbridge

small dev team, one person ops team (me)


joined the startup in progress


twisty maze of shell scripts


time consuming


easy to break


cron jobs which rewrote the crontab


in portland we have bespoke artisanal everything


we also used new relic


pros:

nice graphs
application level view
good error analysis


cons:

slow to update
many false-positive alerts (not totally their fault)
we couldn't afford it (has changed some since then)


so those were our motivating reasons to change


but the main motivator was not getting enough sleep


so i changed our monitoring to nagios

nagios: every bit as hideous as you remember
yes it's hideous, but everything is right where you left it in 1912
the new shinies are great, e.g. sensu
but if we started using sensu it would have been the most complicated thing in our stack


hating on nagios: the middle years

this is when nagios starts getting chatty
as soon as you see a problem, you write a new check and ratchet up the chattiness
everyone hates you when you write spammy checks


how do i monitor something like HBase / hadoop?

best way to monitor HBase: hbck, the hbase consistency checker
nagios -> hbck bash script -> parse output
the most awesome tool in the world won't be able to monitor stuff like this out of the box
the only way you get that is by writing a custom check, which is the same no matter what technology you use


mongoDB:


much like stumbling upon a robbery, i walked into a mongoDB in progress, with zero monitoring


found nagios-plugin-mongodb


worked pretty well, made a few fixes & improvements


and they accepted my pull request!


but.. mongoDB gave us trouble on cybermonday


our traffic spiked and our response time went to crap


"a single write operation holds the lock exclusively, and no other read or write operations may share the lock"


the write lock always seemed sketchy, but it couldn't be that big of a problem, right? it was


so.. next step.. we need to measure everything

we had an old unused, unmaintained graphite install
running something inside screen does not make it a daemon!
so, get that into shape
statsd chef cookbook worked great
graphite cookbook.. not so good, chef 11 only (we're dragging our feet on chef 10) and we run nginx, not apache
had to use tcpdump to debug why statsd/graphite didn't work
but got it working eventually


shout out to carbonate

whisper-fill.py: backfills data between whisper files
very useful for the cutover


how to detect real outages vs. deliberate drop-offs in traffic?

we provide a third party cookie
some people enable/disable our cookie on purpose (e.g. because they think it's causing a problem)
and some people disable it accidentally (pushing bad code)
this is difficult to catch without constantly looking at the graphs


we didn't have money for new relic so we used sentry (open source error reporting system)


this was really helpful in catching API errors from third parties trying to integrate with us


showed a diagram of all their monitoring tools and the way the data flows


when we explain this to non-ops people, they usually ask "why do you guys use so many tools? can't you use just one?"


no! there is no one tool, there is some overlap, but you can't survive with just one monitoring tool


what's next? wishlist for what i want to do next

logstash, kibana, elasticsearch
etsy/skyline - anomaly detection
etsy/oculus - metric correlation for etsy's "kale" system
zorkian/nagios-api - REST-like JSON interface to nagios
grafana - better graphite interface
hubot - want to use this to interact with nagios via chat


what is the ideal monitoring system?

finds real problems
actionable alerts
usable by everyone


Q&A:


Q: why did you choose nagios if everyone hates it?

A: i've done sysadmin before, quite a few years ago, i've never set it up from scratch, but i had a feeling it would work, it wasn't too bad to set it up manually, we needed a solution ASAP, and it worked


Q: have you looked at check_mk?

A: i'm aware of it but if haven't looked closely at it, right now a lot of our nagios checks are alerting on data in graphite, what would you suggest using it for?


Q: uhhhh monitoring (?)


Q: what do you want to get out of the nagios API?

A: scheduling downtime and acknowledging alerts via hubot


Conclusion of Day 1

Jason Dixon:

i remember talking about composable monitoring 2 years ago (http://www.infoq.com/news/2012/10/future-monitoring)
remember just a few years ago all we have was just nagios & cacti?
look how far we've gotten in just a few years


## 03_day_two.md

      
    Raw
  

              03_day_two.md
            
          
    "Auditing all the things": The future of smarter monitoring and detection - Jen Andre


founder & programmer at Threatstack


premise:


are you keeping a record of all processes running on your network?


are you keeping a record of all hosts those processes are talking to?


if not, you are not secure


why do you want to know this information?


because you're a tinfoil hat security person


is there a reason to be this paranoid? yes, if you ever get hacked


even if you think you are secure, people are the weak links


should you care if you are hacked?


snapchat for pets: maybe not


big pharmaceutical company: yes


rest of us: it depends, but probably yes


do a risk assessment process to figure out how important this is to you


whenever a company is hacked


they all post the same message


"we got hacked but we found no evidence of really bad stuff. please reset your password as a precaution."


really?


did you look for evidence? or is that wishful thinking


do you even have any evidence?


we don't know what goes on internally


but I do know that forensics after the fact is really hard and really expensive


if you log everything ahead of time by default, this is much easier


the cloud

for security people the cloud limits visibility
old school networking: defined perimeter, harden the outside of your network, DMZs, firewalls, etc.
in the cloud this doesn't apply, there is no well defined perimeter
so you need to do continuous security monitoring
audit everything, instrument everything, keep historical records of everything (sent to a secure place)
continually improve monitoring & detection


what to monitor:


systems: authentications, processes, network traffic, kernel modules, file system access


apps: authentications, DB requests, http logs


services: API calls to SaaS or cloud providers


intrusion detection


"active defense"


incident response


do you know who is accessing your S3 buckets? do you have logs of that?


monitoring your systems:

start at the host level
process auditing - linux audit
network flow - libnetfilter_conntrack
login - wtmp/audit/pam_loginuid
keep everything in one 'big data' DB (e.g. elasticsearch)
write scripts to analyze this data

The Linux Audit System
pros:

powerful
built in to the kernel
relatively low overhead
apt-get install audit
it audits all the things, sort of
syscalls, syscalls by user, logins, etc.
doesn't include network data

how does it work?
kernel threads doing things
-> audit messages ->
kernel thread queue
-> netlink socket ->
userland audit daemon & tools (redhat's auditd, auditctl, etc.)
-> /var/log/audit/audit.log

configuration:
files (watch all modifications to /etc/shadow):
    -w /etc/shadow -p wa

syscalls (watch all kernel module changes):
    -a always.exit -F arch=ARCH -S init_module -S delete_module -k modules

follow executable:
    -w /sbin/insmod -p x

cons:


the logging is very obtuse

logged values are a mishmash of strings, decimal integers, hex, etc.
lots of manual matching up of cryptic names and values to other log lines for context


it can crash your box

if the auditor is slower than the rate of incoming messages, buffers fill up and stuff starts crashing
enable rate limiting to help prevent this


performance...


one alternative is to connect directly to the auditing socket and write your own listener

for example, we wrote a listener that emits JSON instead of the obtuse text logs
we also wrote a luajit listener that can do super fast filtering, transformation, and alerts


libevent + filtering + state machine parser


reduced CPU usage from 120% to 10%, greatly increase throughput


logins:


wtmp / "last" command


fairly easy to parse and turn into json


auditd also records login info


you can configure SSH to emit login events to audit


what about tracking "sudo su -"? how do I track commands that are run once someone becomes root?

use pam_loginuid
this adds a session ID to every audit event so you can track everything from the user login -> running commands as root


network traffic:

src/dst ips
src/dst ports & protocol type
use the netfilter & conntrack systems
netfilter = used by iptables
conntrack = tracks connections
turn this on: sysctl nf_conntrack_acct
the conntrack tool will show you raw packets and byte counts, very ugly
use libnetfilter_conntrack to emit JSON
it's hard to directly tie a process to conntrack data
but you can correlate using port numbers

putting it all together:

someone logs in
you can view all the commands they run (as their user or as root)
you can view all their network connections
all this information is stored in a database that can be queried or accessed through a web interface

bonus: detection

so i am collecting all this information now, how can i use it for detection?
most attacks typically aren't very sophisticated
many attacks use valid credentials (obtained through weak human targets, social engineering, malware)

what to look for:

"is this user running commands they shouldn't be?"
"why is a user running gcc?"
"why is a user account running a command that only root or system user should run?"
"where are my users connecting from?" (china? eastern europe?)
"what are my users connecting to?" (again, any outlying places like china, eastern europe)
you can create simple rules for these

Q&A:


Q: something about conntrack

A: capturing raw data is very large, you need to filter, another option is to have a NAT box / router that all machines connect through and track everything there


Q: are you saying it's ever OK to be hacked?

A: no, but your response is different depending on what industry you're in, e.g. the medical industry you must respond within a certain number of days and disclose the information in a certain way according to the law, hacking is only going to be more common, everyone will eventually be hacked


Q: something about standards, are there any tools to help achieve standard compliance?

A: (she lost her voice and couldn't continue)


Is There An Echo In Here?: Applying Audio DSP algorithms to monitoring - Noah Kantrowitz


math ahead!


metrics have value @ a certain time


we can put them into graphs, we look at them all day every day


but you can also put this data into a .wav file


have you ever seen a visualizer / EQ?


it looks kinda like our graphs


but they have a frequency domain


value over time vs. value over frequency


x axis frequency: 0Hz -> 20Hz


y axis decibel value: +0dB -> +50dB


you can use the fourier transform to turn (time, value) data into frequency data


(gave the formal definition)


sine wave


add multiple sine waves together


add some noise


and this starts looks like one of our graphs in systems land


you can convert this graph to frequency space to get the underlying components


this reveals new information


instead of the mathy formal definition of FT (with integrals and infinity signs, which computers are bad at)


we use DFT and DTFT, discrete fourier transforms


one problem with this is that we have to do an O(N^2) calculation on the entire data set


there is an algorithm called Fast Fourier Transform


which is O(NlogN) instead of O(N^2)


an IFT does the opposite process, it turns frequency data into time series data


low-pass filter:


say we have a series with a threshold


and it's constantly flapping in nagios terms


use FFT to convert to frequency, run a low-pass filter, use IFT To get back to time series


then apply your threshold


this gets rid of the noise


e.g. it allows you to catch longer term rampups instead of short term blips


there are also high-pass filters (delete high values) and band-pass filters (delete outside of range)


windowing:

chops off data that you aren't concerned with
rectangular window function - very simple to implement
need to be careful of spectral leakage when using a small window size
which gives you "mushy" peaks, less clear signal
triangular window function - better, but not perfect, also easy to implement
blackman harris window function - best result

how do you do this?


NumPy is the one-stop shop, all of these functions are built-in


FFTW for C


go-dsp for Go


nothing in ruby, there isn't much scientific / numeric software for ruby


go forth and find the signals!


bonus content:


discrete cosine transform (DCT)

how most audio/video compression works
this is why MP3 files are smaller than WAV files
WAV stores all the frequency data
MP3 stores the DCT, much smaller to store, then uses IFT to decompress
someone, please write a metrics database that uses DCT!


wavelets

next generation compression systems (e.g. H264)
someone should build something using this too


???

(something i missed)


hysteresis

use input to predict output


control theory

goes hand in hand with signal analysis
signal analysis gives you tools to analyze data, but control theory gives you tools to act on the data
for example autoscaling
PID control loops


Q&A:


Q: can you demo some of the numpy code?

A: sorry, no, it's too much to get into right now


Q: any monitoring tools using these techniques?

A: no! I don't know of any, nagios flap detection is a poor reinvention of the most basic form of signal analysis, but it sucks, there's a thousand years of research on this subject and nobody is reading it or implementing it!


Q: is our data amenable to this approach? is our data really all built out of sine waves?

A: most of the data we look at has periodic components, at the very least you have a daily cycle; and there are a lot more cycles e.g. timeouts, response times, user activity, etc. all contribute to periodic rhythms


Q: is your code on github?

A: no it's all homegrown hacky python code, not releaseable yet


Q: if we added FFT to graphite would that solve a bunch of problems?

A: yea that'd be helpful, but would be better in a streaming system like riemann


Q: something about high frequency data

A: it's the same problem as audio, audio needs to be sampled, you might need to do the same thing with your data, sample it


Q: how do you deal with noise in data? what about the colored noises?

A: haven't run into this much, i'm using data i know to be periodic


A Melange of Methods for Manipulating Monitored Data - Dr Neil J. Gunther


http://en.wikipedia.org/wiki/Neil_J._Gunther


author of many books, teaches classes, workshops


The Practical Performance Analyst


no more plane crash analogies? (monitorama berlin joke)

too bad, it's a useful
asiana flight 214
report found that asiana pilots are too focused on instrumentation
they didn't do basics like... look out the window


monitoring is not about pretty pictures / graphs / tools / fancy math

it's all about the data
what story is the data trying to tell you?
you need to have a consistent interpretation of data, across all the data


how do we converge on consistency? i'll show some examples


The Greatest Scatter Plot


(shows strip charts of metric1 and metric2)


if we were good at looking at data the stock market would be a solved problem


is there a relation between metric1 and metric2?


put both sets of data into a scatter plot


does it show anything interesting? a trend in any direction?


linear regression


Least Squares Fit


LSQ fit and R^2 value (what percent of the data matches up with the model?)


are we done now? no, this is just the beginning


is linear fit the best choice?


what is the meaning of the slope?


are you comfortable extrapolating this model into the future?


the most important scatter plot in history


1929


Edwin Hubble's plot of distance of stars from us & their velocity


what does the slope mean? v/r, Hubble's constant


from this slope we can calculate the age of the universe!


one small problem, hubble's calculation of the age of the universe (2B years) was lower than age of the earth (3-5B)


how did the earth get here before the universe?


what could he do?


(answers from the crowd: "look out the window", "fudge the data")


well, the earth is not stationary, so he compensated for earth's velocity


and... the data got worse!


nonetheless, he published the data


some thought he was crazy, it's obvious something is not right


70 years later, Hubble is now vindicated


Hubble's plot was a tiny area of what we can now see


telescopes weren't good enough in Hubble's time


the data was wrong, but his model was correct


lesson: treating data as divine is a sin


i am fond of saying that all data is wrong


irregular time series:

regular samples: like a metronome, every time has a value
irregular samples: missing data
you use the arithmetic mean on regular series
you use the harmonic mean on irregular series
with unequal intervals you need to scale the mean based on how long the intervals are between data points
use HM on aggregate monitored data when the following apply:
R - rate metric (y axis)
A - something i didn't catch
T - something i didn't catch
E - something i didn't catch
this doesn't come up too often in our systems

Power Laws and the Law of Words:


Zipf's law


plot the frequency of words in the english language


words like "the" are many many magnitudes higher than more exotic words


what function describes this data? it's hard to say from looking at the graph


the trick is to use logarithmic axes


check if a linear regression works on the data with logarithmic axes


power laws imply persistent correlations that need to be explained


what is the explanation in Zipf's case?


the rules of english grammar require certain words to be more frequent than others


example: DB query times


rank by time (histogram)


put on loglog axes


hmm this data looks weird now, it's not linear


it has three different behaviors


1st part: power law decay


2nd part: exponential decay


3rd part: exponential decay


is that enough?


no, we must determine why each of those correlations fit


example: in Australia all business were required to register an ABN number for tax purposes, with a hard deadline

very similar to the healthcare.gov problems
at the 11th hour, people rushed to finish, and the system crashed
could that peak have been predicted?
yes, it's complicated, but a power law can do this


lesson: rank data by frequency (histogram) and try using log / loglog axes

you can use this technique to predict spikes in noisy data
this allows you see a strong correlation, the explanation is more difficult


conclusion: aim for consistency


learn to listen to your data


Q&A:


Q: have you seen people fudging data in the operations world?

A: physicists are notorious for this, i haven't seen it as much in the operations world, i have been guilty of ignoring or overlooking strange noises or inconsistencies, also, be careful of making really complicated models (unless you know what you're doing), at some point you may feel a conviction about your model like Hubble did, and Hubble was correct in the end, important question for science: "how do I convince myself this model is true?", use this approach when making your models, look at Einstein's first 5 papers, everything is written in a way that anyone can understand, using very broad statements, then gradually narrows down and paints you into a corner of accepting his claim, and these were outrageous claims at the time, as simple as possible but no simpler, and this is now a rambling answer but it was fun to give


Q: hubble's estimate was wrong because his data wasn't accurate, it seems in our world that our measuremens are very accurate, does that change our approach?

A: so, do we need to do something differently from Hubble? i'm fond of saying that all measurements are wrong, you don't have his exact problem, but you should never trust the data, you can have completely accurate measurement of the wrong thing, (relays an anecdote about LHC measurements that were accurate to 6-sigma, but a 50 cent connector was not attached properly, so the data was super accurate garbage that was misleading people)


Q: a comment - we can measure time accurately in computing, but most data in operations is very inaccurate and noisy


Q: another comment - i'm struggling with eventual consistency of the cloud, as such you have to deal with eventual consistency, even in your monitoring

A: sure, that's a different concept, but yes if you're using a distributed system, the "consistency" of your models will have to take these distributed computing problems into account


Q: in your last example with the power laws, you found the peak after the fact, does it work ahead of time?

A: yes, you can construct a power law prediction, it's not always correct, but it's another tool, requires more math


Q: would human behavior play into your prediction? i.e. you're counting on people to wait to the last minute?

A: no, i might point to human behavior as the explanation, but the prediction does not depend on that fact


The Final Crontab - Selena Deckelmann


works at Mozilla on the Socorro team


Socorro is a crash reporting system


about:crashes


click on a crash there and it takes you to socorro's web interface


crash reports from users are fun to read (shows some funny quotes and http://lqbs.fr/suchcomments/)


(showed some diagrams of the system architecture)


postgres is central to the system


it's the main architectural element


background tasks are also important


so, what is the final crontab?
*/5 * * * * socorro /usr/bin/crontabber


our old cron jobs had no tests


but they were so critical to our systems


everything was special shell scripts


jobs would kick off postgres stored procedures that would break if run twice and are very hard to debug


email from cron

everyone has this problem
worst month: 22k emails sent from cron


crontabber saved us from a lot of these problems


cron emails are a security blanket that we no longer need anymore


use nagios/sentry instead


what's cron good for? it runs jobs on a predictable schedule


how socorro uses cron:


reports


postgres materialized views


status logging


jobs that don't fit into a queue system because of dependencies, complexity, etc.


github.com/mozilla/crontabber


pip install crontabber


here's what our jobs look like:
socorro.cron.jobs.matviews.ProductionVersionsCronApp|1d|02:00
...dozens of lines like this...


everything is a python class with a run method


shared code (e.g. transactions, setup, teardown), is shared across jobs using decorators


jobs have a frequency ("1d") and start time ("02:00"), and the job code contains metadata like dependencies


uses configman (github.com/mozilla/configman) for parsing command line args vs. config files


github.com/mozilla/socorro/blob/master/config/crontabber.ini-dist


what do i like about this system?

no more shell scripts, that's the main thing, huge improvement
easier to write & test
automatic retries on failure
jobs wait on their dependencies to run (including when a dependency fails)
dependencies are documented in the code, automatically builds a visualization of job flow
automated nagios alerts, including sending triggered exceptions to IRC, no more email alerts
configurable number of failures before CRITICAL
unit test framework for jobs

problems:

configs are a bit complex
one-off runs aren't simple (stored procedures are designed to only run once per day)
no parallel execution yet, jobs are run linearly in dependency order, one possible solution:

    */5 * * * * crontabber --conf=/etc/cron1.ini
    */5 * * * * crontabber --conf=/etc/cron2.ini
    */5 * * * * crontabber --conf=/etc/cron3.ini


yea... we're not going there again :)


depends on python 2.6 or higher and postgres 9.2 or higher


Q&A:


Q: no question but just want to say that it looks awesome

A: thanks!


Q: have you had problems with circular dependencies?

A: not sure, we only have 4 levels of dependencies, so i don't think we've run into that yet


Q: how is the JSON postgres performance?

A: awesome, document size per row is tiny, main write DB is 1.5TB, half of that is probably JSON, way faster than hadoop, 1 hour for hadoop query -> 10 minutes for same query in postgres


Q: you're trying to get rid of shell scripts, did you rewrite in python or wrap them in python?

A: rewrite in python, bash is OK to start, but gets too crufty


Q: did you look at pgAgent? (job scheduling agent for postgres)

A: no we didn't look at that


Q: can it do cross-node dependencies?

A: what do you mean


Q: like if a job on machineA depends on a job on machineB?

A: no... right now it only runs on one machine


Q: is there a reason you didn't look into marathon or cronos for distributed cron?

A: we didn't need a distributed tool, crontabber is more about the framework for jobs, and all these jobs seemed pretty critical to the product so we wrote our own system to handle them


Q: do you handle timeouts & stuck jobs?

A: timeouts are built into the jobs themselves when necessary


Q: how do you determine what jobs are currently running? any visualization?

A: no visualization, but that info is in the crontabber logs


This One Weird Time-Series Math Trick - Baron Schwartz


more math...


this was going to be about math, but other people already covered it!


works at VividCortex - New Relic for the database


formerly worked at Percona


author of: High Performance MySQL & Web Operations


"anomalies" vs. "typical data"


anomaly = not typical


my worldview:


monitoring tools are not enough


monitoring = healthchecks, metrics, graphs


we need performance management


work-getting-done is top priority


we need more than recipes or functions to grab and apply, we need to know the right techniques to use


fault detection = work is not getting done, true/false


anomaly detection = something is not normal, uses probability & statistics


just because something is anomalous doesn't mean it's bad


what is the holy grail?


determine normal behavior


predict how metrics "should" behavior


quantify deviations from prediction


do useful stuff with that data


at 1 second resolution, your systems are anomalous all the time


that holy grail is very practical, too practical for this talk


sometimes i want to do something fun


like use fun math


high level math is difficult to do at scale, it's better suited to academic papers


timeseries metrics are not always best displayed in strip charts


how many of you know these statistical / probability methods? (shows big list of methods)


how many of you have used the smirnov-kolmogorov test? (mentioned in Toufic's talk)


how many of you know these descriptive statistics methods? (wikipedia page on descriptive stats)


i don't know any of these


but basic statistics is good for quite a bit


learn the simplest, most effective approaches first


advanced stuff is there if you need it


you don't need a PhD to do this


spectrum of metrics analysis:
turd polishing <-------- sweet spot --------> lilly gilding


anomaly detection


anomaly -> deviation -> forecast/prediction -> central tendency/trend -> characterization of historical data


these are all separate problems with different techniques


dumb systems don't produce good results


if a system is getting work done, it's not faulty, no matter what a fancy technique says


control charts

draw lines for 3 sigmas
is the process within normal limits?
control charts assume a stationary mean
most data is not normally distributed
lots of problems at smaller time scales

first idea: moving averages

gives us a moving control chart
somewhat expensive to compute
current values are influenced by values in the past
a spike in data causes an inverse spike in the sigma values once that spike drops out of the window

exponential moving averages

more biased to recent history
cheaper to compute, only need to remember one value at each step and apply a decay factor
EWMA is a form of a low-pass filter
we can do the same thing we did earlier and make EWMA for control charts
which is a little better than moving average control charts or plain control charts
one place where EWMA falls down are trends
the EWMA lags behind the actual trend

double exponential smoothing


tries to solve the lagging by adding a prediction


once you do this, the alpha and beta factors become very sensitive


it's easy to way undershoot or overshoot the trend


holt-winters forecasting


DES plus seasonal indexes


more complex, slow to train, previous anomalies start getting built into the predictions


MACD - moving average convergence-divergence


comes from the finance world


finance is probably the most advanced application of these techniques, look there for inspiration


seems to be the most accurate


Q&A:

Q: what happens when you subtract current timeseries data from previous week's data?

A: yea i've tried that sort of thing, this is similar to holt-winters, what happens if you had an outage last week? then you will be predicting an outage next week, also, is week the right period? should you combine weekly/daily/hourly? should you use multiple "seasons" (i.e. if using weekly data, use 3 weeks in the past)?


The Lifecycle of an Outage - Scott Sanders


operations at github


tools + process = confidence


take any business metric and multiply it by your downtime


while you have downtime, you have no registrations, no revenue, etc.


human error is not random, it is systematically connected to people, tools, tasks, and operating environment


triggers:

detection & notification of a problem, get a human involved
alert fatigue is real
people tune out notifications
human fatigue is also a problem
if you are paged in the middle of the night
keep shifts as short as possible, right now github has 24 hour shifts
simplify overrides and give them out freely
be persistent, don't page every 15 minutes, page every 60 seconds until a problem is ack'ed
escalate quickly, don't let a dead battery cause your downtime to go on longer
be loud
create handoff reports for every on-call shift, spot trends

github has a chat command called "handoff" which generates a report & graphs of all incidents during an on-call shift


initial response:

establish command & identify severity, quickly
graphs are a great way to determine severity
chat bots are a great way to signal to both systems & teammates what is happening during an incident

github's monitoring stack:


graphite, 175k updates/sec


collectd (system level metrics), 1200 metrics per host


statsd (app level metrics), 4 million events/sec


and.. sFlow, SNMP, HTTP, etc.


logging: scrolls, splunk, syslog-ng


1TB of logs indexed per day


special purpose monitoring directly covers business concerns


we don't consider a tool production ready until we can interact with it via chat

because that interface fits our culture
you should do the same for your culture
accept the processes that emerge and adapt your tools to augment those processes
don't force your team into processes


corrective action

collective knowledge & feedback loops
real example: last year, github was hit by a string of DDOS attacks

    hubot: nagios critical - ddos detected via splunk search
        (this also generates a github issue
        with the check result and a link
        to DDoS-mitigation.md playbook)
    tmm1: oh?
    tmm1: /arbor graph -1h @application
    hubot: <graph of incoming traffic>
    tmm1: /pager me incoming ddos
    tmm1: ...more steps to determine what's happening...
    other people join in
    jssjr: going to enable protection now
    jssjr: /shields enable w.x.y.z/24
    hubot: please respond with the magic word, today's word is knight
    jssjr: /shields enable w.x.y.z/24 knight
    jssjr: /graph me -1h @network.border.cp1.in
    hubot: <graph of incoming traffic at the router to verify the change>


playbooks are awesome
they allow you to distribute knowledge
as you come across a new problem or missing knowledge, add more to your documentation
tools make software less horrible
nobody should have to know everything about your entire infrastructure
make things safe for your less experienced engineers

create issues for postmortems


dedicate a repository for postmortems, for github this private repo is: github/availability


identify problems


involve many people


propose solutions


some incidents require a public postmortem to be released the same day


but the private postmortem can be open for weeks, to make sure we got it right and are completely satisified the issue is fixed


this is how we close the loop on outages and make progress towards prevention


for example, some improvements for DDoS are: automatic mitigation, better monitoring, etc.


study the lifecycle of your outages


tools are complimentary to your process, not the other way around


communication is the cornerstone of incident management


tools & process enable confidence


never stop iterating


Q&A:


Q: do you have problems with availability of your tools during outages?

A: absolutely, for example we keep the playbooks off-site and on-site to make sure they're always available


Q: you mentioned a huge graphite instance, what backend are you using? i don't think whisper would work?

A: we are using whisper


Q: tell us about the "shields up" command, what does it do? does it get logged somewhere?

A: well, our chat is logged, that gives us the timeline


Q: if you're fixing an outage and you need to clone something from github, what do you do?

A: ha ha well we work very hard to make sure that doesn't happen


A whirlwind tour of Etsy's monitoring stack - Daniel Schauenberg


software engineer on infrastructure team @ etsy


25 million members


18 million items listed


60 million monthly visitors


1.5 billion page views per month


all with a single monolithic PHP app


master-master mysql


we have some smaller services in java


and image service is not in PHP


we deploy a lot


the actual number doesn't matter much


what matters is how comfortable are you deploying a change right now?


when you start at etsy the first thing you do is deploy the site (team section)


and then you watch the graphs


what are in the graphs?


ganglia:

system level metrics, everything specific to a node (requests per second, jobs queued, CPU, memory, etc.)
one instance per DC/environment
220k RRD files
fully configured through chef roles
automatically runs all files in a certain directory to generate these stats

StatsD:

single instance, one server
traffic mostly comes from 70 web servers & 24 API servers
heavily sampled (10%)
graphite as backend

graphite:


application level metrics (not system level)


2 machines: 96G RAM, 20 cores, 7.3T SSD RAID 10


500k metrics per minute


mirrored master/master setup


sharded setup, 7 relays running per box, replicating data to the other server


the sharded setup also helps isolate problems (when something blows up, only one of the two servers is affected)


things to monitor when running graphite:

disk writes, disk reads, # of keys being written, # of values being written, cache vs. relay stats


don't monitor graphite with graphite


we monitor graphite with ganglia


syslog-ng:


web, search, gearman, photos, nagios, network, vpn


1.2GB of logs written / minute


fully configured via chef roles (to determine which log files to send for a node)


rule ordering is important


syslog boxes also run a web frontend called supergrep which is a node.js app that basically runs "tail -f *.log | grep ..." over the web


syslog boxes also run etsy/logster


extracts metrics from log files


written in python


runs once per minute via cron


splunk:

supergrep only shows the last ~1 minute of data, how about longer?
splunk indexes all your log files
easy & powerful search syntax
saved searches
glorified grep

logstash:

experiment to replace splunk
easier to integrate with
easy to set up in dev environment (can't do this with splunk)
can logstash give our developers more insight while they are developing?

eventinator:

tracks all events in the infrastructure
chef runs & changes
DNS changes
network changes
deploys
server provisioning and decommissioning (we use dedicated hardware, no cloud)
12 million events in the last 2 years
originally stored in one mysql table, now using elasticsearch (free search)

chef:


everything is configured with chef


same cookbooks in dev & prod


every node runs chef every 10 minutes


tons of custom knife plugins & handlers


we use spork for our workflow, which notifies IRC of changes / promotions, also kicks off a CI build


mentioned git repo vs. chef server being out of sync


"knife node lastrun web0200.ny4.etsy.com"


120 recipes successfully run in 20 seconds


there's also a handler for failures, chef failures are automatically sent to a pastebin and posted in chat


nagios:

raise your hand if you have a strong feeling about nagios (everyone raised their hand)
raise your other hand if that feeling is love (only a few people)
well, too bad for most of you, computers don't care about your emotions
nagios works really well for us
2 instances per DC/environment
we use nagdash to aggregate results across all instances, our main view of the world
interact via IRC, set downtime, see check results
used to have a manual deploy process (ssh into box, etc.)
why do that? we have a good way to test & deploy software
now they have a real deployment process, real CI process
feels just like working on the web app, that's a good thing

nagios herald:

adds context to nagios alerts
what are the first 5 things you do when you get paged?
you already have your phone in your hand, wouldn't it be great to get this information in the alert?
now our alert emails contain graphs, tables, output of shell commands, alert thresholds, alert frequency (# of times alert has been triggered in the past 7 days)
this is awesome, on-call is so much better now

ops weekly:

we have weekly rotations
at the end of your shift, you are given a survey
you have to specify which alerts were actionable, which were ignorable

of pages during sleep vs. awake time


amount of time kept awake by alerts
can also scrape data from fitbit to get actual sleep times
and these results are discussed at the weekly ops meeting

summary:

use a set of trusted tools
enhance tools when they come up short
keep trying new things
write your own tools where applicable

See our blog, github, and other talks for more detail.
Q&A:


Q: how do you feel about kale?

A: kale is our anomaly detection stack, it's still an experiment, we're trying to figure out how and where to use it, it was recently broken by a graphite upgrade


Q: how self-service is your nagios setup? do you provide tools for devs to build monitoring?

A: not very self-service, still need to write your own checks & configs, but every team has an ops person, and all those people are excited about writing checks that make developers lives better


Q: elaborate on logstash & elasticsearch?

A: right now it's an experiment, also using kibana, side-by-side with splunk, what parts of splunk work better in logstash? how useful is it for developers in their dev environment? those are the main points


Q: how many syslog servers? do you split the logs between multiple hosts for performance reasons?

A: two, and I think they both get the same data for redundancy purposes


Wiff: The Wayfair Network Sniffer - Dan Rowe


wayfair.com


leads the infrastructure tools team at Wayfair


two sub-teams: internal tools (customers are employees) and dev tools (customers are engineers)


wayfair is an online retailer


7 million products


16 million visitors per month


a lot of these kind of presentations someone presents a homegrown tool and everyone is like


"why did you do it that way? why didn't you use X?"


i'm going to try to cover those questions ahead of time


our setup:

active/active DC setup
main sites -> loadbalancer -> PHP web server farm
java / ASP.net for other stuff

logging overview:

syslog, app log, network traffic, commits
logstash
elasticsearch
kibana, dashboards, graphite, zabbix, ad hoc querying & alerting

what is wiff?


out of band traffic sniffer and analyzer


wireshark as a service


packet processing pipeline


feed in packets -> process -> output -> report / analyze -> profit


how do you feed in the packets?


wireshark / NIC level


pcap files (ring buffer or tcpdump files)


rabbit mq


once you feed in the packets, configure which protocols, ports, etc. you are interested in


currently HTTP, HTTPS (needs private keys to decrypt, take care not to log the request/response bodies anywhere..), and TCP are supported


showed a typical HTTP processing workflow (big diagram)


reporters output the data somewhere


JSON, elasticsearch, rabbitmq


wiff is the beginning of the pipeline


we have some example kibana queries to get started with


once it's in elasticsearch it's up to you to do the analysis


alerting: doesn't exist yet, want to build an alerting system for ES


pessimism:


if we already have web server logs and application logs, why do we need this?


this is just another vantage point to gather this data


it's a companion tool


where does it fit?


you tell me, it can track both inbound & outbound traffic


it can spot problems before the request hits a given layer


what if your LB or webserver is misconfigured?


what if the request never reaches where you expect it to reach?


what if your server segfaults?


can spot problems that don't show up in logs


real world example: Set-Cookie was being specified multiple times per response, but their logging was only showing it as set once


because it's out of band, it doesn't matter if it crashes, it doens't matter if it goes down


it doesn't require you to make changes to your application


very little performance overhead


(i think all of these arguments apply to using plain old tcpdump?)


MOAWSL: mother of all web server logs


we have this layer that aggregates all web requests in a single log file, standard format


but if you didn't have this layer, wiff could be used to do that


other benefits:

runs on windows
can be used to watch network traffic of proprietary / third party software
packet RTT
obtain network timing information
call frequency (how often is this web API getting called?)
showed screenshots of command line tool & kibana dashboard

todo:

improve SSL decryption performance (do it in the background)
better reporting

notes:

needs some monitoring
watch for dropped packets, un-stitchable requests
no support for SPDY or websockets
YMMV, it works for us, not used by anyone else yet

github.com/wayfair/wiff
Q&A:


Q: do you instrument wiff before & after the load balancer? to track requests through the system?

A: uhh we can see the source/destination and track them that way, but that isn't done automatically


Q: anything on the roadmap for SIP traffic?

A: no, but we have a big call center, i can see it being useful there


Q: what is the throughput?

A: we have 10G NICs, it's only using ~1G in testing, depends on tcpdump buffer settings and how much your NIC can handle


Web performance observability - Mike McLane & Joseph Crim


work at Godaddy


we went full prezi, so bring some dramamine


measure performance


is it good enough?


if not, look for bottlenecks


how are people using our hosting?


setting up blogs, PHP apps


what are the common use cases?


know your customer


so... lots of PHP benchmarks


wordpress, joomla, drupal


response time is very important for your customers and their customers


people leave and/or complain when things are slow


imagine loading screens in video games, nobody likes loading screens


google has shown that page load time has a direct impact on how likely a person is to make a purchase


google ranks your site based on the load time


webrockit:


webrockit is our performance testing stack


how long does page load time take in a real browser?


data collected has to be real, match up with real users' experience


it needs to be understandable by our internal users


webrockit uses headless browsers to calculate page load time


time to first byte


number of assets


time to complete loading assets


100 different stats related to page load time


why not use a commercial offering?

too expensive for the amount of traffic we want to pump through
data resolution wasn't good enough
didn't include all the stats we wanted
we wanted to feed data into graphite
no commercial offering gave us all the features we wanted

how about open source?


similar to commercial offerings


we looked at: casperjs, selenium, watir, ghost.py


none of them had all the parts we wanted


so we decided to build our own and open source it


working prototype in 3 days


using phantomjs, wraps headless webkit with an API


and it was spot on with how real browsers work, gave accurate measurements


the API lets you do some cool stuff like overriding which IP to use for host


and exposes all the internal timing / metrics in the browser


example:

let's say we want to benchmark changes across changes in our app
let's use a standard LAMP stack, running wordpress, using stock versions of everything
no optimization ahead of time
let's point webrockit at it
start by focusing on time to first byte
test #1: enable compression

this made time to first byte slightly worse
that's useful to know


test #2: switch from modphp to fastcgi + phpfpm

no speed change, but more stable looking graphs


test #3: enable APC

APC is an opcode cache for PHP, so source doesn't need to be compiled for each request
gave a great improvement in response time


test #4: upgrade package versions

php 5.3 to 5.5, apache 2.2 to 2.4, fastcgi -> modproxyfcgi
another good improvement


The end result is that we had a nice workflow for testing and iterating on performance changes.
how does webrockit work?


we decided to use sensu


which is normally used for monitoring


but had all the basic pieces we needed for building a performance testing suite


we wanted the design to be API-first, REST API


written in jruby & sinatra (jruby = easier deployment)


users Riak for main source of truth, storing results

the data structures used are really simple, would be easy to port to other data stores


checksync API, webrockit API -> write checks to disk for sensu


all metrics go into graphite


web UI:

uses rails
set up a poller, e.g.: AWS east & west, digital ocean, internal network, etc.
then set up a check: name, run interval, which poller to use, URL, ip address override (to skip DNS lookup)
you can view a queue of all the jobs, each job has some debugging info in case there's a problem
wait for the job to run for a while then you can view results
graphite dashboards (high level overview of a few metrics)
cubism graphs (condensed strip charts, very easy to spotcheck)
explorer view (drill down into those 100 different finegrained metrics, add multiple targets to a graph to visualize better)

future:

virtualization
introduce packet loss / traffic shaping / bandwidth limits / TCP level network tweaks
better analysis (see all the previous talks on math & anomaly detection)
heatmaps
events & errors (200 expected and now it's 404 or 301, page size drastically changed, etc.)
better dashboards, what is the state of the art? can we use or feed into those systems
better debian support (we're a RH/centos/fedora shop)
real configuration management (we are both a puppet & chef shop, which drew applause from the crowd, they are using bash scripts to install everything right now)

sound interesting?

twitter.com/webrockit
webrockit.io
https://github.com/WebRockit


## 04_lightning_talks.md

      
    Raw
  

              04_lightning_talks.md
            
          
    @M_richo, when testing and monitoring collide:


serverspec + sensu


serverspec = rspec testing framework for server configurations, platform agnostic, 26 resource types


very fast, example: 266 tests in 2.78 seconds


when do you want to write serverspecs? when you're writing infrastructure as code to validate your code


you can also run your serverspecs on your live servers, why? because it's quick and a cheap way to verify everything is working


great addition to your monitoring system


let's put this data into sensu


first attempt: wow we have a lot of failures, and i have no idea what's broken


use rspec's json output format


sensu has a feature to send check results over a socket


these two features allow you to split the checks up, instead of one huge summary check for all server you now have a bunch fo separate checks, easy to see failures


summary:

write tests for your systems / infrastructure code
don't duplicate your effort, run your serverspecs on production


@laprice, monitoring postgres performance:


hardware determines: memory, random_page_cost, tablespaces


workload determines: query_planner, autovacuum, stats_collector


what is autovacuum?


cleans out dead tuples


reorders pages on disk


thresholds can be set per table


one of the primary culprits for "my database is slow and i don't know why"


highly tunable: workers, nap time, duration, timeout, max age, cost delay, cost limit, etc.


focus on the tables that need it most (the largest tables)


track dead tuple count & percentage (>5%)


main question to answer: are my tables being vacuumed when they should be?


you can get this info by querying pg_stat_all_tables, see the docs


@petecheslock, 17th century shipbuilding and your failed software project:


aka - why your project managment sucks


the Vassa


grandest ship built by the royal swedish navy


the most expensive project ever undertaken by the country at the time


after sailing less than one mile a gust of wind hit the ship, it tipped over, and it sunk to the bottom of the sea


50 years later they recovered the ship and analyzed what went wrong


the captain who survived was thrown into jail, he was asked if the crew was drunk, they were not, he was later released


it tipped because it didn't have enough ballast


why? it started off as a 108 foot ship


then was changed to 111 feet (originally wanted 120 feet)


then they wanted to add another gundeck


sure, ok, then they needed to scale it up to 135 feet


(nobody in sweden had even built a ship with two gundecks yet)


they kept revising the number of guns, size of guns


rush job


the king also needed to have a bunch of ornate carvings added, making it more top heavy


most of the design came from the king's head


they did a lurch test (30 men running back and forth on the deck, believe it or not), and they had to stop because the ship was about to tip over


the design changed so many times, they needed to add ballast, but there was no place to add it


if they did add ballast, the lower gun deck would have been underwater


so you may be thinking..


why did they launch if all the tests failed??!


if they didn't launch on time, the people inolved would have been subjected to "the King's disgrace" (execution?)


to recap:

schedule pressure
changing needs
no specs
lack of project plan
excessive innovation
secondary innovations
requirement creep
lack of scientific methods
ignoring the obvious: launched after failed tests


the lesson: those who ignore history are doomed to repeat it!


@hypertextranch, monitoring & inadvertent spam traps:


i work at wordpress.com as a developer


i've never actually seen nagios


but i've infiltrated your ranks


we see a lot of spam


any developer can make their own stats


memorization < (intuition + investigation)


how i found a random spammer


i deployed elasticsearch and checked our monitoring to see if it made things better or worse


i saw queries stacking up


only 3 nodes pegged CPU, all other nodes were fine


if this were a problem in my code, it would have caused a problem on all nodes


every blog has a main instance and is i replicated to two extra machines


so it seems like this is a problem with a single blog


some user scripted thier blog to pull in articles from the washington post, splice in some affiliate links, and repeat every 30 seconds


every time a site gets marked as spam by our filter, it causes the articles to be reindexed


lesson: your devs should look at monitoring because they probably have more intuition about problems


automated monitoring might not have caught these three bad nodes


an ops dude would have noticed that three nodes


but i as a dev was able to intuitively pick up on the problem right away


Chess - a reflection of life:


"Chess is everything: art, science, and sport"


tournament players lose 10-15 pounds after a tournament, physical and mental stress for 8 hours a day burns calories


you are the winner even if you lose, you can learn from every match


the game is egalitarian, the only thing that matters is the moves


it doesn't matter what your age or gender or race is


ego is the enemy of learning & growth


ego is an anchor


accept that there is more for you to learn, and you will


chess exemplifies the power of cause and effect


your moves at the start are directly related to the moves at the end


time & timing are everything


a good position fades quickly


the game is all about patterns


our brain is built to detect patterns


control the center applies to chess and to life and business


ran out of time


@isaacfinnegan, Expanding Context to Faciliate Correlation:


basically i want to show off some cool stuff


"we've got great tools"


really?


i have to use 5 different tools to get stuff done, they all have different, crappy interfaces


github.com/evernote/graphite-web


templates for graphite


NagUI: federated nagios interface


very fast (especially compared to the classic interface)


bulk viewing, bulk actions


drag & drop custom views, saved views, share views with your team


graphite integration


acknowledge + send to jira


mobile interface too


CMDB: pull data from different tools into one view


nagui + jira + graphite


i think this is the next step for monitoring tools


instead of monolithic rewrites, integrate existing tools


Feature Knobs & Deploy Knobs:


feature flags, feature toggles, config flags


they're awesome!


doing 100 deploys a day is awesome!


deploy dark and turn up slowly for everything


this leads to a problem though


over time, we have a million feature flags and it's not clear which ones can be safely turned off/on


you need a promotion process, cleanup process, which is tough


use feature knobs wisely...


what about deploy knobs?


with a deploy knob, once you turn it up, you can't go back


this makes them self-cleaning


some dude running linux tried to present but couldn't get the display to work

@michaelgorsuch, github ops, canary.io:


scratching an itch via small, composable tools


measure URL performance & availability


at high resolution (sub-second)


multiple vantage points


based on libcurl (ubiquitous and provides good stats)


sensorD, gets a blob of JSON with a list of URLs


it measures them with libcurl and spits out JSON, that's cool


now i have all these sensord instances running around the globe


what do i do with this json?


i need to aggregate


new tool: canaryD


siphon off the useful info, store it in redis for the past 5 minutes (starting small...)


exposes the stats via REST API


even with 5 minutes, that's 1200 measurements


compare that to nagios's check_http, that would be like 1 measurement per 5 minutes in nagios


so why not feed this high resolution data into a nagios check?


what if i want to share this data?


i want to make this open source, infrastructure independent


open measuring for an open web


it "launched" 3 days ago, by that i mean i tweeted a gist


it's running in DO, but rackspace offered a bunch of servers


someone already built a dashboard


github.com/canaryio


i'm learning go, don't be scared by the code


Sergey Fedorov, netflix, Stateful monitoring:

couldn't present due to technical difficulties


Martin Parm, spotify, Distributed Operational Responsibility:


first person to present using linux!


give ops responsibility back to developers


capacity planning


monitoring


config mgmt


instead of doing this for them, we give them the tools to do this


why do this? doesn't this seem like a bad idea?


we have so many changes and engineers we can't do it all with an ops team


so why not get the right people in front of a project the first time?


if you break something, you need to fix it, better accountability


we want the teams to work with technologies


how about monitoring?

devs need training, but not a whole new education, just enough to solve their problems
devs need autonomy, and will do stupid things (ops does stupid stuff too)


alerting: metrics & events -> magic monitoring pipeline & alerting rules -> pagerduty alerts

our alerting stack: ffwd (homegrown stat forwarder), apache kafka, riemann, even more stuff
we don't need them to learn or touch the internals of that alerting stack


different abstraction levels


script hooks, drop a script in a folder


write your own python script with riemann library


write your own rules, provide tools for that


impact on monitoring?

more monitoring, better monitoring
monitoring platform
more teaching, less babysitting / hand-writing monitoring code


Charlie, cofounder of Hosted Graphite, protecting your lizard brain while on-call:


failures are very stressful at Hosted Graphite, people depend on us for their monitoring


feedback loop: failures -> more checks -> more alerting -> more docs


things are getting better, but...


but failures start training you on a primitive level, that certain things are bad


you start to learn that your phone is a source of pain and fear


things were alright until they weren't


panic, jumpy, stressful


why is that the reaction? you need to be calm to solve the technical problem


and most outages aren't that serious


i have to remind myself "it's not that bad"


but my lizard brain is fucking terrified no matter what


if you hear an incoming text, and it isn't even your phone, and you jump, then that's not right


just let people know that you're down, that can relieve some stress


is that stress symbolic of something else? are you afraid of failing? your company failing?


what are other on-call people thinking?


i've heard the same stuff from everyone.. big or small company, big or small team, one person or multiple people on-call


having someone else on-call in front of you is helpful


turn off all other notifications on your phone


what can we do better? i want to talk to people about this


what can companies do to improve mental health of those on-call?


i'm gonna stand by the door back there and i want to talk to you


## 05_sponsor_plugs.md

      
    Raw
  

              05_sponsor_plugs.md
            
          
    Sponsor Plug: New Relic - Chase

New Relic browser / front end:

how fast your pages load
how fast are your ajax calls?
JS error tracking

interesting stuff we found:

error messages get translated, "Syntax error" vs. "Erreur de syntaxe", they get reported differently
his site had no ajax, but there were a ton of AJAX errors

what is this stuff?
the majority are toolbars, malware, etc.
browser extensions, google translate, etc.
some are pretty nasty, "Skype click-to-call" got into an infinite loop and triggered tens of thousands of errors


Sponsor plug: Elastic Search - Rashid


who uses ES? show of hands


70% use it vs. 30% don't (hmm... interesting..)


i'm going to give a workshop on wednesday, so i'll demo a lot more then


but if anyone has any questions, feel free to ask me now


Q: why do we need log searching? why elasticsearch?

A: a graph shows you when something might be wrong, but logs allow you to go back to the original event and see what exactly happened


Q: what did you have for breakfast?

A: yogurt, granola, melon


Q: do you want to buy a musket?

A: yes, to defend myself from the government


Q: did you know you can 3d print a musket?

A: yes, i'm terrified of this


Q: does ZK cluster discovery work?

A: not used it, zen (?) discovery works


Q: can you talk about jepsen and ES?

A: there's a recent blog post about it, it's a tough subject, distributed is hard, we don't have an answer for everything but we're doing pretty good


Q: roadmap?

A: for what?


Q: kibana?

A: will talk more on wed, better aggregations / facets, which are useful for turning logs into charts, "top N query" reduced from N queries to 1


Q: when is ES going to learn how to reindex something something without something?

A: push harder if you want this feature


Sponsor plug: Librato - Joe


CTO of librato


librato is a platform for storing, monitoring, and alerting on custom metrics


composable monitoring system tailored to you


in the past that meant building your own solution from scratch with a bunch of OSS


librato lets you correlate arbitrary time series with each other


marking events like deploys & config changes


no proprietary agent, everything works over HTTP


80-100 products (middleware, web servers, databases, etc.) know how to speak to librato via opensource plugins


if you can write to stdout, you can capture that log output and send to librato as metrics


new features:

more integrations
better alerts - tune the sensitivity of alerts using historical data
better on-call information - associate URLs / documentation with alerts, find all previous occurrences of an alert
"composite metrics" - custom query language to manipulate raw data, calculate ratios, aggregates (looks like graphite's URL/function interface)


Sponsor plug: Pagerduty


pagerduty sits between your monitoring systems and your on-call people
we integrate with everyone
we send SMS/email to the right person
we take reliability seriously, full end-to-end tests

we have 4 android phones in our lab constantly receiving texts to ensure deliverability!


new stuff:

multi-user alerting
on-call handoff notifications
SSO
outbound webhooks

multi-user alerting:

we found this is a great way to do onboarding for new ops people
put the new guy on-call alongside a veteran so they can get trained up in being on-call
multi-user alerting is also good for higher levels of escalation
for example if two people sleep through the alert, then set up your third escalation level to alert everyone instead of continuing to retry people one-by-one

handoff notifications:

notify by email, sms, and push when you go on or off call

outbound webhooks:

now has integration with slack, hipchat, flowdock, etc.
live demo of webhook FAILED, kinda awkward... lolz
oh wait he just yelled from the crowd that it worked (sure it did)

Sponsor plug: Dataloop.io - David


lots of teams spend a lot of time building monitoring solutions using OSS


but as soon as you try to get developers or QA to use it, you run into problems


high learning curve, confusing documentation, difficult interfaces


we want to un-silo the monitoring tools


as we move to microservices, traditional monitoring gets more difficult


we are building the monitoring tool for microservices


easy to use


flexibility of nagios / graphite, but with drag & drop


easy to create alerts


use existing nagios check scripts


speaks graphite/statsd/carbon protocol


create hierarchies with drag & drop


use tags


write plugins in any language


another thing we do besides config is visualization

nagios, collectd, and statsd all in one place
create dashboards via drag & drop, resize
send dashboard reports via email (good for weekly / monthly reports to management teams)
embeddable widgets


next, alerting:

big feature is multiple triggers for alerts
build context for your alerts
condition A and condition B and condition C
e.g. both web performance & service up/down check must trigger before alert goes off
this decreases alert spam


actions:

email / SMS / phone
send to jira
trigger event handlers (any language)


driven by API, command line tool, or github


launching later this year, beta testing now


Sponsor plug: Salesforce

no-show
Sponsor plug: Puppet


who doesn't know what puppet is?


we have commercial & open source offerings


who's coming to the puppet party tonight?


it's really hard to get there, left then right


we're hiring, a lot


(scrolls through dozens of job listings)


can everyone from puppet labs stand up?


(like 20 people stood up)


come to puppetconf in SF, september 20-24


all kinds of presenters, lots of topics


early bird pricing ends this month


Sponsor plug: pingdom

interesting numbers from our customers:


14 billion checks per month


9.4 million detected outages per month


8 million alerts sent per month


total downtime to 500 million minutes, across 450k customers


what can we do at pingdom to help with this?


#1 most requested feature: alert management


new feature: BeepManager


pingdom.com/beepmanager


team members can customize their method of contact


automated escalations


integrate with other systems (nagios, new relic, rackspace cloud monitoring)


alert flood protection


access levels


alert templates


most important feature of monitoring system is that it works for your team


we are committed to making our tool work for your team


Sponsor plug: Grok - Jared


numenta.com/grok


we do anomaly detection


we've heard all about it these two days


how do we solve it? science


years of research, we've made some breakthroughs


automatic & unsupervised machine learning on timeseries data


open source at numenta.org


first product: grok


mobile app


automated model creation & monitoring for AWS instances


showed some examples


automatic anomaly detection in CPU load


they used this to catch someone running manual builds on a build server


required no setup / training


free trial: simple to get running, 10 servers, no time limit


Sponsor plug: Big Panda


we launched our private beta yesterday


we spend a lot of time tweaking tools, building thousands of alerts


what do you use to manage your response to issues?


jira, zendesk, email


those tools are meant for humans


they were not built for responding to tons of automatically created incidents, flapping alerts, etc.


bigpanda is basically jira for ops


live demo


home page "OpsBox" shows all alerts


UI should be very familiar to gmail users


star alerts, mute alerts


how do i rise above the noise of alerts?


shows a timeline of alerts, when did it start warning, when did it reach critical, when did it go back to normal


(pretty cool looking)


shows a lot more data in context


"Changes" view: event log of changes in your infrastructure


we're already helping people today respond to alerts in a much more intelligent manner


Sponsor plug: Datadog - Alexei


cofounder and CTO of Datadog


hosted monitoring service


easily monitor from 5 to 50,000 hosts


what have we been working on the past year?


better graphs


better visualizations, histograms


better counts & counters


heatmaps


better alerts, more sophisticated alerting


the ability to embed disturbing images into your dashboards (nicholas cage meme pics)


more integrations: fastly, google cloud, slack, new relic, 50-60 integrations total


monitoring is fun!


who here has learned a lot these past two days? (everyone)


who here wants to work on monitoring more? (still everyone)


that's good news because we're hiring ha ha laffs