zachm/monitorama_2018_day2.md

## monitorama_2018_day2.md

      
    Raw
  

              monitorama_2018_day2.md
            
          
    Want to solve Over-Monitoring and Alert Fatigue? Create the right incentives!

Kishore Jalleda - (fmr.) Yahoo and Zynga
Patients in hospitals have heart monitors, and they overalert by a ton. So occasionally patients actually die because of
a missed/ignored alarm.
Zynga: 100k alerts/month across 25+ studios. 50+ SREs in 3 locations.
How to fix this fatigue and anxiety?

Can't add more people - doesn't scale.
Can't color code more - ran out of colors.

Goals:

Under 2 alerts/shift
Dev on call (direct escalation)
SRE does Eng (tooling, etc)
No more TVs kthxbai

Started having bad outages. What was the fix?

Deny SRE coveraged based on alert budgets.
Go over the budget and you're kicked out! No more SRE support.

Woooooooooooo forcing functionnnnnnn.
People at Zynga lost their shit when he did this. Apparently they are whiners who whine a lot. Whether this is actually true depends, likely, on who you ask.
"Sorry, but not every team deserves SRE support; it must be earned."
DUDE I'M SORRY THIS IS THE OPPOSITE OF DEVELOPER EMPATHY.
You got yourself into this mess by not bridging Dev and Ops anyway. So you fixed it by... creating more hostility?!?!
This is a pretty controversial talk, at least from my perspective. It also, unintentionally, exposes what Zynga was like ca. 2013. And I'm glad I was not there for that period.
Okay now we're getting to "How do you reduce alert noise?"
Lol ok so you create a best practices doc w/ 15m fix windows etc.
OMG he is encouraging "Public Shaming" and "Peer Pressure". What?! This talk is like the token conservative on MSNBC.
I'm not surprised this worked, but he even acknowledged that some folks left the company over this initiative. If this is okay at any scale (more than one or two) then the issue is who you hired in the first place imho.
They did get a 90% drop in false alarms; uptime up by a whole 9; unicorns and puppies frolicked through the meadows.
Next-Generation Observability for Next-Generation Data: Video, Sensors, Telemetry

Peter Bailis - Stanford CS
Holy crap they have a dream team at Stanford working across this crazy broad stack that's all data centric.
Cites: Hidden Technical Debt in Machine Learning Systems (Google paper, I think we had it in the reading group a while back.
"What if anyone with domain expertise could build their own prod-quality ML products using data at scale?"
No PhD in ML, no expertise in systems, no latest hardware... it's happened before dun duuuun (teh googz)
300 hours of video uploaded to Youtube every minute.
Ok now we're talking about neural networks. In theory you can just download the big weight data and run it over a video and count something in the video.
tl;dr specialized NN models are 10000x fewer FLOPS and 300x faster when on GPU. (shocking!)
They've used FrameQL to be able to write SQL against video feeds. This is really cool!
Have some cool projects: dawn.stanford.edu.
tl;dr

Need new forms of infra for dealing with models to handle the rising volumes of data.
Abstractions + system optimizations will continue to prove necessary for modularity, scale, and impact.

Coordination through community: A swarm of friendly slack bots to improve knowledge sharing

Aruna Sankaranarayanan - Mapbox
Treating this as something of a retrospective on becoming an SRE at Mapbox. THey use a 3-month buddy system with increasing amounts of responsibility.
This talk is really about their Slackbots. They run each command as an AWS Lambda.
A lot of their bots talk to SumoLogic, which is kinda-SQL-but-not-really. They just write the query inside Slack.
There wasn't much new to me here. The overarching idea is that SRE should be a shared experience. So tools should reflect that and work in a manner that's conducive.
Automate Your Context

Andy Domeier - SPS Commerce
Lots of high-level velocity ideas. Complexity is increasing, and one of the main reasons is to enable velocity.
Efficiency of an org can be directly correlated to how effective you are with your available context.
Ops confidence: If it's important at readiness, it's important over time.
He has a lot of things set up that automatically push context around. So like lambdas that flow out of - and back into - JIRA!
Slack in the Age of Prometheus

George Luong - Slack
So we've had lunch with George and his Slack folks already. Excited for his talk!
Nice bit on systems empathy:

monitoring systems get replaced, but not because they are bad...
because (their) needs have evolved!

2015: ganglia, librato, elastic, icinga(nagios)
2016: Replace ganglia and librato with graphite! (sound familiar?) and threw in statsd too.
2017: they burned down graphite! (sound familiar? 18 months, about the same time as us...)
People stuff:

discoverability is bad... real bad...
query performance: slow rendering, 45s to 1m to load a chart.
no tags or labels (graphite 0.9.x)
too many aggregation layers (statsite -> carbon-relay)

Ops stuff:

team of six metrics folks, couldn't scale graphite horizontally
if you lose a node, you lose metrics across all your systems
devs could take down graphite with infinite cardinality

UX needs:

discovery
response time
custom retention and ingestion
scalable
slice/dice based on dimensions
introspection: topN queries
an API

Ops needs:

single failure must not be catastrophic
teams want to own their monitoring (wow!)

Now: trying to move (more) stuff from graphite to prometheus.
Single region architecture: two hot frontends running prometheus, then N backends that are hashed/replicated.
Multi region: two more nodes that sit in front of all the different regions.
Using Terraform, and Chef. Chef contains all their config rules for Prometheus.
But the big question... can Prometheus handle their gigantic PHP monolith?

each webapp server creates 70k metrics: 500 servers, 35M metrics
each job worker creates 79k metrics; 300 servers; 24M metrics

How'd he handle a geographically distributed team? He went to Melbourne to collaborate with Tom! They did have to write some PHP to make this all happen, but that's needed to deal with your monolith. (n.b., I take away from this that we should continue to embrace our own monolith as we continue to do the needful.) But now, they're 83% of the services into Prometheus and the rest will come soon.
Pain points:

Developers are unfamiliar with Chef.
Service discovery (in their stack) is unsolved.

Thanos: a management layer that runs on prometheus boxes. It's pretty new so they're not certain about it yet, but they sound optimistic.
Sparky the fire dog: incident response as code

Tapasweni Pathak - Mapbox
This talk is - I think - supposed to talk about an autoremediation strategy for alerting.
Unfortunately it was pretty nonsensical. I took a full page of notes on this, and they made no sense, so I guess I'm not the intended audience?
Reclaim your Time: Automating Canary Analysis

Megan Kanne - Twitter
Pretty much as the title states, which is a useful thing to be. She's going over a few different statistical modeling approaches that they can use to automate the analysis. Mann-Whitney U-Test is one of them. It's saved them from some pretty nefarious bugs multiple times.
They do this for true canaries, but they are considering doing this per pull request. Pretty cool!
Lightning Talks

Throwinng Spaghetti at a Blue Sky

Ted Young, Lightstep
Hijacking distributed tracing. One stack trace for a whole transaction, no matter how many services.
Whoever's writing the tracing doesn't care what you do with it. So monitoring systems can be swapped out.
Things to do that aren't tracing... with tracing:

suck app data out and steal it
debugger will jump around from one process/service to another!
Trace-driven development (lol!) - if it's 1:1 with a function call, couldn't you test against tracing data?
Oh no, you end up with formal verification! Except worse!

Recruiting Open Source Contributors

Matt Broberg - Sensu
Lessons from Ben Franklin? He'd borrow books he'd already read from people he wanted to befriend; return them; tell them how much he liked them; then become their friend!
If I've done something already, I must have a reason to do it! (fallacy...)
Cite: Thinking Fast and Slow (yessss)
Ask people for help over and over and over. They'll want to help etc.
tl;dr

Just ask
Make it easy
Say thanks

How North Korea Helped Improve MTTR

Jamie Buchanan - Trimline
He put a log scale on a graph of 5xx counts surrounding the "Little Rocket Man" international almost-incident. Context matters. If you can understand graphs faster about microservices faster, you can have conversations about them faster.
On Wisdom

Richard Whitehead - Moogsoft
Military power supply: now illegal to electrocute soldiers! Need to keep them out of the box. Secured with six screws. That's not enough to keep them out! How do you do it? Allen Wrench!
Do not underestimate how a tiny little change can completely change an experience.
Next: Must prevent boxes from being hurt when a military truck rolls over. They tested with a 3-ton truck. Still broke. Why? They used an 8-ton truck. And then a tank!
Never underestimate the end user's capacity to do things wrong. They are inquisitive and motivated to break things.
Be the Team Member You Wanted

Jon Cavanaugh - OP5

Failure breeds honesty. Makes you look for meaning.
The soccer team he coaches lost every game this season :(
But does winning even matter? Or is building self-confidence more important?
Back each other up. Means you have to run, not walk.
Repetitive small wins are important for people to build that confidence.
As a leader, being vulnerable is surprisingly useful. Because the group gets more honest.
Be grateful and say thank you. Listen, ask questions. Be consistent and work hard. This all creates a virtuous cycle.