lost-theory/notes.md

## notes.md

      
    Raw
  

              notes.md
            
          
    DevOps Master Class & Opbeat Launch


http://www.eventbrite.com/e/devops-master-class-opbeat-launch-party-registration-12672813727
2014-09-26

Opbeat dude


2 years ago, we decided to make it our mission to make ops better


flew around the world


met the best ops people, learned how to do proper ops


today, we decided to launch opbeat at an event like this, with some great ops speakers


traditional ops is being outsourced to github, aws, heroku, etc.


you can go from idea to launch in less time


after launch, you need to write more code & do ops


developers now have to do both jobs


this is messy


we built opbeat as an ops platform for devs, to let developers do ops


develop on github, deploy on heroku, ops on opbeat


Michael Friis - Incident management


PM @ Heroku, cofounder at AppHarbor (heroku for .net, YC W11)
5+ billion requests per day (50k rps)
we take over ops for you
just ship new code, we handle the rest

Ops at heroku:


we're owned by salesforce, but very independent


~150 people total


lots of teams, lots of software, lots of change


everything on AWS


total ownership: "you build it, you wear the pager"


SRE team monitors the systems, track uptime, etc.


when we launch a project, it has all the operational features baked in from launch


when we encounter incidents, we divide incidents into two parts: development (tools, control plane) & production (deployed apps)


Examples of incidents:

shellshock, heartbleed
AWS issues (network, capacity, etc.)
git push & build issues
internal incidents that don't affect customers

Incident management framework:


based on both the ops world and the real world (e.g. disaster relief)


first, monitoring continually checks if something is working


if it's not working, someone gets paged


everyone gets in a shared chatroom (hipchat)


someone is assigned IC (Incident Commander)


verify there's actually a problem


open a status issue on the status site to let users know


turn the red lanterns on (siren thingies that sit in the engineering department)


send internal sitrep: what do we think is happening, who is looking at it, what's the next step


this report goes out to everyone at Heroku


assess the problem, check with AWS (we have a direct line to them)


mitigate the problem, is there a workaround? (e.g. out of capacity? boot capacity in other zones)


"control rods" - flags for certain features that can be turned off when there's a problem (e.g. booting new servers)


coordinate response, get whoever is an expert / owner in the area working on the problem


continue updating status site while problem is resolved


post-incident cleanup: undo any manual steps / mitigation, control rod changes, turn off the red lights, close the issue on the status site


Followup:

final phase is followup, which is the most important
get together a few days later, analyze root cause
do this for all issues, not just the major ones, not just the public facing ones, any off-hours pager event
pager burn is very serious, you shouldn't be paged all the time
you can also share this followup info with customers to reassure the problem won't happen again

See our post on our engineering blog.
Q&A:

Q: how did your process evolve?
A: starting with appharbor, we used Sentry, no pagerduty, first time we had a server go down without notifying us, we started monitoring servers, we were just 3 founders with a few employees, and we took too long to share pager responsibility outside the founders, we didn't track issues, just crash -> fix -> be annoyed, we tried to replicate what heroku was doing as much as possible
Q: what is the most common way you're notified of issues?
A: hard failures: an engineer gets paged, soft failures: twitter

Mike Krieger - How Ops & On-Call Evolved


cofounder of Instagram


you don't often get a glimpse inside product companies on how they do ops (vs. a company like Heroku)


going to show you lessons we've learned / mistakes we've made along the way


Early days


ops experience = none


running on a single server in LA


fabric tasks for everything


I'm kind of glad we didn't spend a lot of time of time on scale & ops up front, because we didn't even know if what we were launching was going to have any success


do the simple things first (KISS)


ops was all in python


our first problem: we had no DR plan at all


our stack: django & ubuntu & postgres & memcached & redis


we started to get traction very early, the one server caught fire soon after that


so we moved to AWS


devops cycle of pain: only 2 people, no chance to hire new people / improve, so the cycle continues


alerting & monitoring = munin & pingability


we bought MiFis


we had no idea how to do ops


one lesson we learned: measure & monitor everything


don't go down due to obvious things like no disk space


we were committed to hop on and fix things urgently


if something was easy to set up for monitoring, we used it


we dumped nagios because it was too hard to set up


munin had no way of muting alerts / scheduling maintenance at the time, so we were constantly getting paged for things that weren't problems


Scaling up


both awake, but primarily me fixing


don't underestimate solidarity with other ops folks


ChickenCoopOps - war story about getting paged and doing maintenance from a farm


hired our first dev (iOS + Infrastructure)


took my first trip abroad (started writing runbooks, but new problems showed up in Paris)


Pingability died, moved to Pingdom + PagerDuty


having early cross-stack employees means we had a few people who could make fixes


we've tried to keep that "total ownership" idea today


very easy to see problems and rollback, but still manual


but burnout was impending


Starting a team


hired 2 infra engineers


started writing an eng blog


no rotation yet, on-call shared between 4 people


traffic peaked on weekends, so that's when we were getting paged


replaced munin with sensu & ganglia


sensu is very good, so much better


still using PagerDuty


everyone knew expectations going in


android launch = 2x traffic in 6 months


no time to invest in ops & infrastructure improvements


living week-to-week


new AWS features (e.g. provisioned IOPS) saved us a few times


if everyone is responsible then nobody responsible


2012: Acquired by Facebook


2 infra eng -> 6 infra eng in the next 3 months


started an on-call process


primary, secondary, tertiary

primary = can fix everything, has a laptop at all times
secondary = can also fix everything, but doesn't need to be glued to their laptop, more of a safety net


use Facebook messenger & IRC


increased runbook coverage


shadowing: clone your primary & secondary people!


should you start people on secondary first or primary first?

primary: get experience very quickly (this is what we chose)
secondary: already a bit fatigued when you reach primary, things only get harder


at FB each system has on-call people


at Instagram, we kept the ops people dedicated just to our product


we now have more specialized developers, but try to get them involved with ops


Stability

new rotation setup (L1/L2/L3):
L1: triage and responds if able (simple issues), but escalation is encouraged
new people can do the L1 responsibility very quickly
exit surveys for L1 shifts (how did it go? how many times did you get paged? did you lose sleep?)

Going forward

we still have some unsolved problems
which tech should be offloaded to FB? how do we make sure we're aligned?

e.g. should we turn over our memcached usage to the FB memcached team? how much work do we need to do that?


coordination of on-call responsibilities with FB (both ways)
small issues for FB can be large issues for us
scaling intra-team communication without interrupting everyone
how do you teach on-call triage & problem solving?

one idea: create AMIs with different scenarios and let people practice, but lots of problems are things you haven't seen before, so need general problem solving skills...


Q&A:


Q: what day do you do the rotation?


A: started off on Wednesday, but moved to Friday to get a fresh person on-call during peak time (the weekend)


Q: I'm a fan of 'you build it you run it', but does that mean everyone has to be good at everything? where is the divide?


A: our weak spot is that we still have super specialized systems & knowledge (e.g. deep DBA-level issues with postgres or cassandra), stuff where the bus factor is 1, try to get people involved where there are overlaps


Q: do you have a moral to the story for new teams / new devs doing ops? what to concentrate on?


A: use more hosted services (e.g. Parse), when you do port things to your own stack: keep things super simple, don't jump into a plethora of technologies/DBs/languages/etc., you never want to be in over your head on technical issues, use IaaS and PaaS, early on your choice of DB is not as important as whether you have product-market fit, have a clear triage process


Q: not ops-related, what was it like in the early days of launching a photo sharing app? didn't people think you were crazy for competing with e.g. Flickr?


A: yes people thought we were crazy, but entrepreneurship is a balance of being sane & insane, our key was that we were more social than anything else at the time, most photo apps were focused on photo editing


Q: what were your main evolutions of on-call?


A: runbooks, documentation, clear on-call process, standardizing our systems, using a real CM system


Andreas Ehn - Factoring out systems components


ex-CTO of Spotify & current CTO of Wrapp

Factoring:


finding & removing commonalities


ab+ac = a(b+c)


one pattern: internal project -> open source project

e.g. Django, built internally for newspaper CMS, turned into an open source framework used by tons of people


another pattern: productize some service (SaaS)

e.g. Opbeat, Heroku, Copperegg, Github


back in 1999 people spent millions of dollars launching startups on proprietary hardware & OS's (Sun & Solaris)


by 2006, when we launched Spotify, people were moving to linux and using open source, but still spending a lot of time maintaining and building out hardware


in 2011 with Wrapp we do almost everything on SaaS, dozens of products


all startups face the same kinds of problems, why reinvent the solutions each time?


let's do them together, either as open source or paying a service provider


Benefits of outsourcing to open source / SaaS:


focus


don't repeat yourself


someone else's headache to do maintenance


shared cost of development -> better product


amortized cost of scaling across many users


with hardware: you're continually either under- or over-provisioned, lots of negotiation, configuration, hardware arrives in batches


with AWS: pay only what you need, scale smoothly


Future of outsourcing:


login & user management (we have Github login, twitter login, OAuth, etc., but in the future why can't we outsource our entire user management system?)


CRM for consumer products (as opposed to CRM for sales, which is saturated): retention, push notifications, etc.


before you build: look at what services are available


if you do build: don't forget that maintenance is usually the expensive part


if you do build: open source it, and do so early


Q&A:


Q: when would you consider building something yourself?


A: when it's specifically addressing a problem that hasn't been solved before, or the existing solution isn't good enough


Q: what about latency of using SaaS? that can affect user experience


A: a lot of these services aren't in the direct path of the user's experience, it's at the fringes of the product, but yeah latency would be a factor in build vs. buy, some people do end up migrating off of AWS due to scale or price or some other requirement, also, maybe the SaaS approach is not cost-effective enough yet, but that will change..


Opbeat product demo


get all your data in one place


a coordination layer for your entire team


we want ops to be as easy as downloading an app from the app store


so you can focus on building your core product


get developers the info they need when they need it, anywhere they need it


release tracking - who pushed what, at what time? (git & github & heroku)


exception tracking - exception name, stacktrace, severity, frequency graph, correlation to releases


team coordination - errors are assigned to users


(all of this demo is on an iphone, so I guess they're mobile-first?)


(ok.. now he switched to the browser)


(the projector keeps cutting out :E)


regressions: errors that are reopened after being marked as fixed


reopened errors are assigned back to the person who marked it as fixed


we can assign issues based on who wrote the code


"this is really powerful"


or reassign to others


accountability is very important so we know who is in charge of this issue


if everyone is emailed, nobody will act on it


but if someone is assigned, that person is clearly responsible


assignments trigger push notifications on the mobile app


this assignment causes workflow to be passive for other users, someone takes charge and gets out of the way for others


we also have a feed


which also shows a graph


we support multiple apps


available today


coupon code in the goodie bag, go grab one


try it out, let us know!


Q&A:


Q: can you trigger errors yourself? e.g. from the frontend?


A: uhh if you mean incident management (like the heroku talk), no, we're not really for that, but yea you can trigger custom errors in opbeat


Q: what affects the "time to release" metric? do you have any more analytics like that?


A: we calculate datediff(released on prod, first commit in release), we don't do anything more than that, but it's still pretty useful into showing whether a release was big or small


Q: how do you get the data?


A: we ask you to install a small agent / plugin in your app, and we pull data from github (code & users)


Q: how do you know when the code gets to the server?


A: it's just a simple endpoint you put in your deploy script, e.g. with curl


Q: how do you handle two releases running at the same time?


A: uhh you mean different environments?


Q: no, what if I deploy one app to multiple clients?


A: oh, you should make multiple projects in opbeat


Q: what's on the roadmap?!


A: don't want to say too much, but there's lots of interesting data you could potentially pull into opbeat, also, more tools that let you view changes over time


Q: what data does opbeat need to run?


A: exceptions, code


Q: do you track dependencies between applications?


A: no, but good idea


Q: how much access do you have to our code?


A: github's API only allows you to get read access to the entire org, not repo-by-repo, but we are working on something to get around this