Logan McDonald - BuzzFeed
She's talking about the rampup period as a new DevOps person. Her background is in cognitive science, so she's used that push forward her own learning.
"Problem solving is easier with constraints." - Yes!
Google SRE Handbook - "Dickerson Hierarchy of Site Reliability" Base of this pyramid? Monitoring!
"Rule Learning" - How a beginner internalizes patterns in their world.
Used to think the path for learning success is (re)reading the material. But they call this the "illusion" of learning. It's shown to be ineffective at getting things into your long-term memory. So. wat do?
- Test yourself often, in a low stakes/pressure environment.
- Need to learn how your brain (personally) actually works.
- Try to think of a solution before immediately looking it up. Makes for better recall.
- "Delayed retrieval and interleaving." - Try to split up your recall, don't focus on one thing forever.
Memorization Techniques
- "Leitner boxes" - three boxes of flash cards: promote/demote as you pass/fail the memorization.
- "Memory palaces" and the Tarantula Communications Protocol. Funny and so true - this is how med students do it.
Mental models.
- Faulty ones are scary. She's built hers all around observability.
- Reflection.
- Incident and project reviews, at every opportunity. Even when new to a team!
- Managers: Center your incident reviews around the less experienced team members! It helps them learn...
Cultural Memory
- Elephant clans led by older matriarchs survived a bad drought better, because they remembered watering holes they hadn't used for 60 years!
Growth Mindset
- Emphasizing natural intelligence is bad bad bad.
- Emphasizing hard work is much much better.
Psychological Safety
- Safe to ask questions that may be naive, etc.
- Need to feel the team fosters solving tough problems, but does not foster blame.
- If you can't sustain good mental health, your memory will suffer, so your problem solving will too!
Pam Selle - IOPipe No seriously, this talk is partially about cats! <3 <3 <3
Cats are the "easy" pets. Lies, all lies. Pam has two cats and they both require a lot of TLC.
Lessons on "Serverless DevOps" - told via cats!
- Step One: Bought an automatic cat feeder. Hey, it solves the problem! ** n.b., battery powered and not on the Internet!
How is this like serverless? ** "No server is easier to manage than 'no server.'" ** Concerns you do care about are offloaded to the platform. ** Event Driven Systems - since you can't control when things happen etc. Cats eat whenever they feel like it.
Cost(cats) = Food + Litter + Vet + Cat Sitter Cost(serverless) = Compute Time + Invocations + Network
Lesson Two: Know the limits
- Cat water fountain lasts ~5 days without adding more water.
- Same with Lambda: Must run in under 5 minutes. Hard resource limits, etc.
- Deal with it? Acceptance tests on the provider. Don't rely solely on an emulator!
- Deal with it? Use layers like queues and streams to manage flow between components, esp. with things like DBs.
Lesson Three: Exposing System Status
- Oh no... "we don't know what matters until it matters"
- Their alerting system is serverless too - N minute timer. Schedule and Execute functions.
- They dogfood - IOPipe to monitor IOPipe.
- They showed their live production monitoring dashboard. Looks pretty cool. Some NodeJS code too.
Lesson Four: When in doubt... throw some containers at the problem.
- When you have cats, you have N+1 litterboxes. News to me!
- How do you go five days without scooping them? You have more of them...
- When you send an event to Lambda, it sends an event to Lambda and creates a new container.
Future Work: Enhancing CatOps Visibility
- Actual real monitoring! (vision?)
- Mood monitoring, like in the Sims!
- Lambda guarantees at-least-once execution. Need to handle these with some sort of wrapper?
- They have agents in like four languages, Golang in alpha.
- Lots of crazy cool IoT cat stuff out there.
They have a website called servers.lol. Yes, that's a web address.
Zach Musgrave, Angelo Licastro - Yelp
Oh hai. What's up Portlandia?
Dawn Parzych - Catchpoint
Unfortunately, I didn't really take notes on this one. What I heard from it, however, sounded great. Definitely need to watch later!
Jamie Wilkinson - Google
Started with crazy bad on-call load - the worst at Google. Generally, alert less. And alert on expectations that users/stakeholders have.
SLAs SLOs SLIs
- SLI Indicator measurement: distribution of mesurements around the system.
- SLO Ibjective goal: 99.9th pctile response time under X for Y% of month.
- SLA Agreement or we get paged
Symptom: measured by the SLO.
"Absolute thresholds are ... not very good." - understatement of the year.
Definitely want to track your burn rate - both within and over - the period of the SLO. (This is a great idea! We should build alerting that tracks this over a medium-term period!)
If alerts don't help, you don't have to burn them down. Just change their priority! (I like this because it reminds me of NORT tests...)
Conclusion
- Symptom-based alerts are good for your health
- SLO defined by you, customers, and system
- SLO implies error budget, informs engineering tolerance.
- Page only on SLO risk, because that's what matters.
Franka Schmidt - Mapbox
Peter Bourgon - Fastly
Aditya Mukerjee - Stripe
Ian Bennett - Twitter