I hereby claim:
- I am zachm on github.
- I am zachm (https://keybase.io/zachm) on keybase.
- I have a public key ASCTDpJFS9Ckm0hs45ZAKXbXSkMGbruCSpkeXeaQ-oKkyQo
To claim this, I am signing this object:
I hereby claim:
To claim this, I am signing this object:
Morgan McLean - Google
This talk is all about opencensus.io. Morgan is the PM for it. He's arguing that the usual N pillars of observability is not sufficient to... have observability.
Instead he's saying context/topology+status+root cause analysis == observability.
Opencensus does:
Kishore Jalleda - (fmr.) Yahoo and Zynga
Patients in hospitals have heart monitors, and they overalert by a ton. So occasionally patients actually die because of a missed/ignored alarm.
Zynga: 100k alerts/month across 25+ studios. 50+ SREs in 3 locations.
How to fix this fatigue and anxiety?
Logan McDonald - BuzzFeed
She's talking about the rampup period as a new DevOps person. Her background is in cognitive science, so she's used that push forward her own learning.
"Problem solving is easier with constraints." - Yes!
Google SRE Handbook - "Dickerson Hierarchy of Site Reliability" Base of this pyramid? Monitoring!
David Mills - Staff Architect, IT Operations Analytics
Basically we're not just looking at events. We're instead looking to tie events together with some ML, with some dashboards, and this ITSI tooling. They're using New Relic events as an example, but the workflow looks like you could just pump PagerDuty events into Splunk for a similar effect. (n.b. why are we not doing this?)
A little bit of discussion on defining good Opsy KPIs but nothing that doesn't follow. They wrap in Businessy KPIs,
They're doing logical actions, like opening tickets, paging people downstream, etc. I'm not sure we'd want to move straight to
Jeff Champagne, Principle Architect, Splunk
Events fall into buckets, 1+ buckets make up an index, indexes live on indexers.
Hot: At least 1 hot bucket per index, per indexer. More created for each parallel ingestion pipeline, or when quarantine is needed. Quarantine: Happens when you load in data from ages ago (too old). Also when timestamps are broken.
Iman Makaremi - Senior Data Scientist, Splunk
Matthew Modestino - ITOA Practitioner, Splunk
So they want to move away from static alarming/decision making. Can the data itself tell you what's normal? Basically, looking for outliers with ML (and the MLTK). One of them is Ops, the other did the math.
"We know what's normal - we collect it every day." You already have the baseline. But how do you write SPL to detect deviation? (Hoping this next bit is relevant to sourcetype volume tracking and to larger anomaly detection work at Yelp.)
Jim Drewes @ Daugherty Business Solutions
It was okay - I didn't take many notes though.
Gartner's hype curve strongly implies that, with all the shiny new devops tools, the "Trough of Disillusionment" is soon to follow.
Basically, Jim believes that due to quickly-approaching enterprise adoption, devops is about to be come "a hell of a lot less fun". This is probably the case, but I don't know that I agree completely - there's always going to be younger technologies and companies on the vanguard of the technical bits of the movement.
Bridget Kromhout @ Pivotal
Great overview of a lot of standard devops practices, some of the sorrows that can result, and so on. Bridget gives a lot of talks - she's in an evangelist role at Pivotal.
She emphasized a lot of communication issues within orgs. Some recruiter cold emailed her and used as a selling point that their company had two OpenStack deploys. Why is that a good thing?!
"Good to be explicit and not assume defaults." - A great lesson for everyone's documentation ever!
A longer version she gave at CONFENGINE: https://www.youtube.com/watch?v=UjhIA6QTy5k
Dan Norris @ Digitalocean
They built DO using DO components, but because they obviously have a decent amount of infrastructure they use Terraform to manage it. Droplets module, then hook it into Chef - combine launch and provision steps.
Vault as a CA for Kubernetes - they have a blog post out on this. http://do.co/vault
Some examples are given of Terraform commands; they don't appear to have much sanity checking around their workflow (e.g. terraform apply
vs make plan/apply
). This might be simplified for the talk - for their sake I hope it is.
terraform taint
- using it to mark resources as requiring replacement.