Skip to content

Instantly share code, notes, and snippets.

View zachm's full-sized avatar

Zach Musgrave zachm

View GitHub Profile

Keybase proof

I hereby claim:

  • I am zachm on github.
  • I am zachm (https://keybase.io/zachm) on keybase.
  • I have a public key ASCTDpJFS9Ckm0hs45ZAKXbXSkMGbruCSpkeXeaQ-oKkyQo

To claim this, I am signing this object:

@zachm
zachm / monitorama_2018_day3.md
Last active June 6, 2018 23:53
Monitorama 2018 day 3

Achieving Google-levels of Observability into your Application with OpenCensus

Morgan McLean - Google

This talk is all about opencensus.io. Morgan is the PM for it. He's arguing that the usual N pillars of observability is not sufficient to... have observability.

Instead he's saying context/topology+status+root cause analysis == observability.

Opencensus does:

  • distributed traces
@zachm
zachm / monitorama_2018_day2.md
Last active June 5, 2018 23:44
Monitorama 2018 Day 2

Want to solve Over-Monitoring and Alert Fatigue? Create the right incentives!

Kishore Jalleda - (fmr.) Yahoo and Zynga

Patients in hospitals have heart monitors, and they overalert by a ton. So occasionally patients actually die because of a missed/ignored alarm.

Zynga: 100k alerts/month across 25+ studios. 50+ SREs in 3 locations.

How to fix this fatigue and anxiety?

  • Can't add more people - doesn't scale.
@zachm
zachm / monitorama_2018_day1.md
Last active June 5, 2018 16:35
Monitorama 2018 Day 1

Optimizing for Learning

Logan McDonald - BuzzFeed

She's talking about the rampup period as a new DevOps person. Her background is in cognitive science, so she's used that push forward her own learning.

"Problem solving is easier with constraints." - Yes!

Google SRE Handbook - "Dickerson Hierarchy of Site Reliability" Base of this pyramid? Monitoring!

@zachm
zachm / Splunk_.conf_17_day3.md
Last active September 28, 2017 17:48
Notes from day 3 of .conf

Splunk IT Service Intelligence (ITSI): Event management is dead - event analytics is revolutionizing IT

David Mills - Staff Architect, IT Operations Analytics

Basically we're not just looking at events. We're instead looking to tie events together with some ML, with some dashboards, and this ITSI tooling. They're using New Relic events as an example, but the workflow looks like you could just pump PagerDuty events into Splunk for a similar effect. (n.b. why are we not doing this?)

A little bit of discussion on defining good Opsy KPIs but nothing that doesn't follow. They wrap in Businessy KPIs,

They're doing logical actions, like opening tickets, paging people downstream, etc. I'm not sure we'd want to move straight to

@zachm
zachm / Splunk_.conf_17_day2.md
Last active September 27, 2017 20:02
Notes from day 2 of Splunk .conf

Splunk Data Lifecycle: Determining When and Where to Roll Your Data

Jeff Champagne, Principle Architect, Splunk

Events fall into buckets, 1+ buckets make up an index, indexes live on indexers.

  • As buckets grow, they roll hot->warm->cold->{frozen|delete}
  • Hot buckets live in $HOME path
  • Data roll: Can roll out to HDFS

Hot: At least 1 hot bucket per index, per indexer. More created for each parallel ingestion pipeline, or when quarantine is needed. Quarantine: Happens when you load in data from ages ago (too old). Also when timestamps are broken.

Detect Numeric Outliers – Advances

Iman Makaremi - Senior Data Scientist, Splunk

Matthew Modestino - ITOA Practitioner, Splunk

So they want to move away from static alarming/decision making. Can the data itself tell you what's normal? Basically, looking for outliers with ML (and the MLTK). One of them is Ops, the other did the math.

"We know what's normal - we collect it every day." You already have the baseline. But how do you write SPL to detect deviation? (Hoping this next bit is relevant to sourcetype volume tracking and to larger anomaly detection work at Yelp.)

@zachm
zachm / devopsdays_DTW_day_two.md
Last active October 13, 2016 18:44
DevOpsDays DTW

Enter The Trough Of Disillusionment

Jim Drewes @ Daugherty Business Solutions

It was okay - I didn't take many notes though.

Gartner's hype curve strongly implies that, with all the shiny new devops tools, the "Trough of Disillusionment" is soon to follow.

Basically, Jim believes that due to quickly-approaching enterprise adoption, devops is about to be come "a hell of a lot less fun". This is probably the case, but I don't know that I agree completely - there's always going to be younger technologies and companies on the vanguard of the technical bits of the movement.

@zachm
zachm / devopsdays_DTW_day_one.md
Last active October 12, 2016 18:52
DevOpsDays DTW: Day One notes

Containers Will Not Fix Your Broken Culture (and Other Hard Truths)

Bridget Kromhout @ Pivotal

Great overview of a lot of standard devops practices, some of the sorrows that can result, and so on. Bridget gives a lot of talks - she's in an evangelist role at Pivotal.

She emphasized a lot of communication issues within orgs. Some recruiter cold emailed her and used as a selling point that their company had two OpenStack deploys. Why is that a good thing?!

"Good to be explicit and not assume defaults." - A great lesson for everyone's documentation ever!

A longer version she gave at CONFENGINE: https://www.youtube.com/watch?v=UjhIA6QTy5k

@zachm
zachm / automacon_notes_day_two.md
Last active November 2, 2017 15:21
Automacon

Automating Kubernetes Cluster Ops at Digital Ocean

Dan Norris @ Digitalocean

They built DO using DO components, but because they obviously have a decent amount of infrastructure they use Terraform to manage it. Droplets module, then hook it into Chef - combine launch and provision steps.

Vault as a CA for Kubernetes - they have a blog post out on this. http://do.co/vault Some examples are given of Terraform commands; they don't appear to have much sanity checking around their workflow (e.g. terraform apply vs make plan/apply). This might be simplified for the talk - for their sake I hope it is.

terraform taint - using it to mark resources as requiring replacement.