Skip to content

Instantly share code, notes, and snippets.

@lost-theory
Last active October 8, 2016 17:56
Show Gist options
  • Save lost-theory/3a854f56832d170ab956 to your computer and use it in GitHub Desktop.
Save lost-theory/3a854f56832d170ab956 to your computer and use it in GitHub Desktop.
DevOps Master Class & Opbeat Launch, 2014-09-26

DevOps Master Class & Opbeat Launch

Opbeat dude

  • 2 years ago, we decided to make it our mission to make ops better

  • flew around the world

  • met the best ops people, learned how to do proper ops

  • today, we decided to launch opbeat at an event like this, with some great ops speakers

  • traditional ops is being outsourced to github, aws, heroku, etc.

  • you can go from idea to launch in less time

  • after launch, you need to write more code & do ops

  • developers now have to do both jobs

  • this is messy

  • we built opbeat as an ops platform for devs, to let developers do ops

  • develop on github, deploy on heroku, ops on opbeat

Michael Friis - Incident management

  • PM @ Heroku, cofounder at AppHarbor (heroku for .net, YC W11)
  • 5+ billion requests per day (50k rps)
  • we take over ops for you
  • just ship new code, we handle the rest

Ops at heroku:

  • we're owned by salesforce, but very independent

  • ~150 people total

  • lots of teams, lots of software, lots of change

  • everything on AWS

  • total ownership: "you build it, you wear the pager"

  • SRE team monitors the systems, track uptime, etc.

  • when we launch a project, it has all the operational features baked in from launch

  • when we encounter incidents, we divide incidents into two parts: development (tools, control plane) & production (deployed apps)

Examples of incidents:

  • shellshock, heartbleed
  • AWS issues (network, capacity, etc.)
  • git push & build issues
  • internal incidents that don't affect customers

Incident management framework:

  • based on both the ops world and the real world (e.g. disaster relief)

  • first, monitoring continually checks if something is working

  • if it's not working, someone gets paged

  • everyone gets in a shared chatroom (hipchat)

  • someone is assigned IC (Incident Commander)

  • verify there's actually a problem

  • open a status issue on the status site to let users know

  • turn the red lanterns on (siren thingies that sit in the engineering department)

  • send internal sitrep: what do we think is happening, who is looking at it, what's the next step

  • this report goes out to everyone at Heroku

  • assess the problem, check with AWS (we have a direct line to them)

  • mitigate the problem, is there a workaround? (e.g. out of capacity? boot capacity in other zones)

  • "control rods" - flags for certain features that can be turned off when there's a problem (e.g. booting new servers)

  • coordinate response, get whoever is an expert / owner in the area working on the problem

  • continue updating status site while problem is resolved

  • post-incident cleanup: undo any manual steps / mitigation, control rod changes, turn off the red lights, close the issue on the status site

Followup:

  • final phase is followup, which is the most important
  • get together a few days later, analyze root cause
  • do this for all issues, not just the major ones, not just the public facing ones, any off-hours pager event
  • pager burn is very serious, you shouldn't be paged all the time
  • you can also share this followup info with customers to reassure the problem won't happen again

See our post on our engineering blog.

Q&A:

  • Q: how did your process evolve?
  • A: starting with appharbor, we used Sentry, no pagerduty, first time we had a server go down without notifying us, we started monitoring servers, we were just 3 founders with a few employees, and we took too long to share pager responsibility outside the founders, we didn't track issues, just crash -> fix -> be annoyed, we tried to replicate what heroku was doing as much as possible
  • Q: what is the most common way you're notified of issues?
  • A: hard failures: an engineer gets paged, soft failures: twitter

Mike Krieger - How Ops & On-Call Evolved

  • cofounder of Instagram

  • you don't often get a glimpse inside product companies on how they do ops (vs. a company like Heroku)

  • going to show you lessons we've learned / mistakes we've made along the way

Early days

  • ops experience = none

  • running on a single server in LA

  • fabric tasks for everything

  • I'm kind of glad we didn't spend a lot of time of time on scale & ops up front, because we didn't even know if what we were launching was going to have any success

  • do the simple things first (KISS)

  • ops was all in python

  • our first problem: we had no DR plan at all

  • our stack: django & ubuntu & postgres & memcached & redis

  • we started to get traction very early, the one server caught fire soon after that

  • so we moved to AWS

  • devops cycle of pain: only 2 people, no chance to hire new people / improve, so the cycle continues

  • alerting & monitoring = munin & pingability

  • we bought MiFis

  • we had no idea how to do ops

  • one lesson we learned: measure & monitor everything

  • don't go down due to obvious things like no disk space

  • we were committed to hop on and fix things urgently

  • if something was easy to set up for monitoring, we used it

  • we dumped nagios because it was too hard to set up

  • munin had no way of muting alerts / scheduling maintenance at the time, so we were constantly getting paged for things that weren't problems

Scaling up

  • both awake, but primarily me fixing

  • don't underestimate solidarity with other ops folks

  • ChickenCoopOps - war story about getting paged and doing maintenance from a farm

  • hired our first dev (iOS + Infrastructure)

  • took my first trip abroad (started writing runbooks, but new problems showed up in Paris)

  • Pingability died, moved to Pingdom + PagerDuty

  • having early cross-stack employees means we had a few people who could make fixes

  • we've tried to keep that "total ownership" idea today

  • very easy to see problems and rollback, but still manual

  • but burnout was impending

Starting a team

  • hired 2 infra engineers

  • started writing an eng blog

  • no rotation yet, on-call shared between 4 people

  • traffic peaked on weekends, so that's when we were getting paged

  • replaced munin with sensu & ganglia

  • sensu is very good, so much better

  • still using PagerDuty

  • everyone knew expectations going in

  • android launch = 2x traffic in 6 months

  • no time to invest in ops & infrastructure improvements

  • living week-to-week

  • new AWS features (e.g. provisioned IOPS) saved us a few times

  • if everyone is responsible then nobody responsible

2012: Acquired by Facebook

  • 2 infra eng -> 6 infra eng in the next 3 months

  • started an on-call process

  • primary, secondary, tertiary

    • primary = can fix everything, has a laptop at all times
    • secondary = can also fix everything, but doesn't need to be glued to their laptop, more of a safety net
  • use Facebook messenger & IRC

  • increased runbook coverage

  • shadowing: clone your primary & secondary people!

  • should you start people on secondary first or primary first?

    • primary: get experience very quickly (this is what we chose)
    • secondary: already a bit fatigued when you reach primary, things only get harder
  • at FB each system has on-call people

  • at Instagram, we kept the ops people dedicated just to our product

  • we now have more specialized developers, but try to get them involved with ops

Stability

  • new rotation setup (L1/L2/L3):
  • L1: triage and responds if able (simple issues), but escalation is encouraged
  • new people can do the L1 responsibility very quickly
  • exit surveys for L1 shifts (how did it go? how many times did you get paged? did you lose sleep?)

Going forward

  • we still have some unsolved problems
  • which tech should be offloaded to FB? how do we make sure we're aligned?
    • e.g. should we turn over our memcached usage to the FB memcached team? how much work do we need to do that?
  • coordination of on-call responsibilities with FB (both ways)
  • small issues for FB can be large issues for us
  • scaling intra-team communication without interrupting everyone
  • how do you teach on-call triage & problem solving?
    • one idea: create AMIs with different scenarios and let people practice, but lots of problems are things you haven't seen before, so need general problem solving skills...

Q&A:

  • Q: what day do you do the rotation?

  • A: started off on Wednesday, but moved to Friday to get a fresh person on-call during peak time (the weekend)

  • Q: I'm a fan of 'you build it you run it', but does that mean everyone has to be good at everything? where is the divide?

  • A: our weak spot is that we still have super specialized systems & knowledge (e.g. deep DBA-level issues with postgres or cassandra), stuff where the bus factor is 1, try to get people involved where there are overlaps

  • Q: do you have a moral to the story for new teams / new devs doing ops? what to concentrate on?

  • A: use more hosted services (e.g. Parse), when you do port things to your own stack: keep things super simple, don't jump into a plethora of technologies/DBs/languages/etc., you never want to be in over your head on technical issues, use IaaS and PaaS, early on your choice of DB is not as important as whether you have product-market fit, have a clear triage process

  • Q: not ops-related, what was it like in the early days of launching a photo sharing app? didn't people think you were crazy for competing with e.g. Flickr?

  • A: yes people thought we were crazy, but entrepreneurship is a balance of being sane & insane, our key was that we were more social than anything else at the time, most photo apps were focused on photo editing

  • Q: what were your main evolutions of on-call?

  • A: runbooks, documentation, clear on-call process, standardizing our systems, using a real CM system

Andreas Ehn - Factoring out systems components

  • ex-CTO of Spotify & current CTO of Wrapp

Factoring:

  • finding & removing commonalities

  • ab+ac = a(b+c)

  • one pattern: internal project -> open source project

    • e.g. Django, built internally for newspaper CMS, turned into an open source framework used by tons of people
  • another pattern: productize some service (SaaS)

    • e.g. Opbeat, Heroku, Copperegg, Github
  • back in 1999 people spent millions of dollars launching startups on proprietary hardware & OS's (Sun & Solaris)

  • by 2006, when we launched Spotify, people were moving to linux and using open source, but still spending a lot of time maintaining and building out hardware

  • in 2011 with Wrapp we do almost everything on SaaS, dozens of products

  • all startups face the same kinds of problems, why reinvent the solutions each time?

  • let's do them together, either as open source or paying a service provider

Benefits of outsourcing to open source / SaaS:

  • focus

  • don't repeat yourself

  • someone else's headache to do maintenance

  • shared cost of development -> better product

  • amortized cost of scaling across many users

  • with hardware: you're continually either under- or over-provisioned, lots of negotiation, configuration, hardware arrives in batches

  • with AWS: pay only what you need, scale smoothly

Future of outsourcing:

  • login & user management (we have Github login, twitter login, OAuth, etc., but in the future why can't we outsource our entire user management system?)

  • CRM for consumer products (as opposed to CRM for sales, which is saturated): retention, push notifications, etc.

  • before you build: look at what services are available

  • if you do build: don't forget that maintenance is usually the expensive part

  • if you do build: open source it, and do so early

Q&A:

  • Q: when would you consider building something yourself?

  • A: when it's specifically addressing a problem that hasn't been solved before, or the existing solution isn't good enough

  • Q: what about latency of using SaaS? that can affect user experience

  • A: a lot of these services aren't in the direct path of the user's experience, it's at the fringes of the product, but yeah latency would be a factor in build vs. buy, some people do end up migrating off of AWS due to scale or price or some other requirement, also, maybe the SaaS approach is not cost-effective enough yet, but that will change..

Opbeat product demo

  • get all your data in one place

  • a coordination layer for your entire team

  • we want ops to be as easy as downloading an app from the app store

  • so you can focus on building your core product

  • get developers the info they need when they need it, anywhere they need it

      1. release tracking - who pushed what, at what time? (git & github & heroku)
      1. exception tracking - exception name, stacktrace, severity, frequency graph, correlation to releases
      1. team coordination - errors are assigned to users
  • (all of this demo is on an iphone, so I guess they're mobile-first?)

  • (ok.. now he switched to the browser)

  • (the projector keeps cutting out :E)

  • regressions: errors that are reopened after being marked as fixed

  • reopened errors are assigned back to the person who marked it as fixed

  • we can assign issues based on who wrote the code

  • "this is really powerful"

  • or reassign to others

  • accountability is very important so we know who is in charge of this issue

  • if everyone is emailed, nobody will act on it

  • but if someone is assigned, that person is clearly responsible

  • assignments trigger push notifications on the mobile app

  • this assignment causes workflow to be passive for other users, someone takes charge and gets out of the way for others

  • we also have a feed

  • which also shows a graph

  • we support multiple apps

  • available today

  • coupon code in the goodie bag, go grab one

  • try it out, let us know!

Q&A:

  • Q: can you trigger errors yourself? e.g. from the frontend?

  • A: uhh if you mean incident management (like the heroku talk), no, we're not really for that, but yea you can trigger custom errors in opbeat

  • Q: what affects the "time to release" metric? do you have any more analytics like that?

  • A: we calculate datediff(released on prod, first commit in release), we don't do anything more than that, but it's still pretty useful into showing whether a release was big or small

  • Q: how do you get the data?

  • A: we ask you to install a small agent / plugin in your app, and we pull data from github (code & users)

  • Q: how do you know when the code gets to the server?

  • A: it's just a simple endpoint you put in your deploy script, e.g. with curl

  • Q: how do you handle two releases running at the same time?

  • A: uhh you mean different environments?

  • Q: no, what if I deploy one app to multiple clients?

  • A: oh, you should make multiple projects in opbeat

  • Q: what's on the roadmap?!

  • A: don't want to say too much, but there's lots of interesting data you could potentially pull into opbeat, also, more tools that let you view changes over time

  • Q: what data does opbeat need to run?

  • A: exceptions, code

  • Q: do you track dependencies between applications?

  • A: no, but good idea

  • Q: how much access do you have to our code?

  • A: github's API only allows you to get read access to the entire org, not repo-by-repo, but we are working on something to get around this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment