Skip to content

Instantly share code, notes, and snippets.

@mjuarez
Created June 1, 2015 18:48
Show Gist options
  • Save mjuarez/66b8fa60664d06449a6b to your computer and use it in GitHub Desktop.
Save mjuarez/66b8fa60664d06449a6b to your computer and use it in GitHub Desktop.
Velocity Conf 2015 Notes

Velocity Conf 2015

Day 1

Bootstrapping an Ops Team - Charity Majors (Parse)

Operations - anything pertaining to maintaining and implementing systems at scale

Do you really need an ops team?

Software engineers think they are off the hook with performance, management, scalibility.

Operations Engineering at Scale is a specialized skillset It is not "soft eng lite" It is not someone to do all the annoying parts of running systems for you

You have hard operational problems

Hard Operational Problems

  • Extreme reliability demands
  • Extreme scalability (3x-10x year over year)
  • Extreme security requirements
  • Solving operational category problems for the whole internet (platforms, services)

What makes a good startup ops hire?

Look for strenghths that are key to your company's success.

Good operations engineer is broadly literate and can go deep on at least one or two areas.

  • Strong automation instincts
  • Ownership over their systems
  • Strong opinions, weakly held
  • Simplfy, simplify, simplify
  • Excellent communication skills in a crisis
  • Value process Prevents the same mistake, over and over.
  • Empathy

Things that dont work

  • whiteboarding code
  • particular technolog or language
  • particular degree
  • big company pedigree

Success at a startup

  • comfortable with chaos
  • knows when to solve 80% and move on
  • total responsibilty for outcomes
  • good judgement
  • highly reactive
  • technical breadth

How do you interview and sort for those qualities?

First, do the job yourself

Figure out what strengths you really need The Hard Thing about Hard Things (Horowitz)

Don't say no because someone has a lack of weaknesses

Hire for YOUR weaknesss. Find someone to fill them.

Good interview questions

  • Are leading to establish tech. range
  • probe the candidates self-reported strengths
  • related to your problems
  • ask culture questions. screen for learned helplessness

Bad interview

  • depends on a specific tech.
  • look for a reason to deny a candidate
  • designed to trip them up
  • deny candidates the resources they would use to solve something in the real world

ask culture questions. how they felt about their last job and coworkers. screen for learned helplessness. learned helplessness is like startup kryptonite.

"Employers put too much weight on interviews, and too little weight on references." - Reid Hoffman

Your hired an ops engineer. Now what?

Including ops team for product development.

Bad ops engineers / fire them

  • Tweaking indefinitely & pointlessly
  • Walling off prod from dev.
  • Adding complexity
  • Won't admit they don't know things
  • Disconnected from customer experience.

How to lose good ops engs

  • all the responsibility, no authority
  • all the tedious shitwork
  • blameful culture
  • no interesting operational problems

encouraging worklife balance supporting members

cultulre: the patterns you call out and celebrate, will get repeated

Stream Processing and Anomaly Detection - Karthik Mittel (Twitter)

Streaming Analytics

Cube analytics ( business intel.) predictive analytics ( statis and machine learning )

Real Time or Batch

Analyze data as it is being produced: streaming

interactive: store data and provide results instantly a query is posed

First gen. SQL

aurora, borealis, cayuga, STREAM, NiagaraCQ (SIGMOD / CIDR)

Next gen:

S4 STORM Samza Spark Pulsar

storm

guarenteed msg protocol horizontal scalability robust fault tolerance cncise code-focus on logic

data model

topology - directed acyclic graph - verties = computations, edges = streams of data

spouts - sources of data for the topoplogy

bolts - units of ocmputation on data

Storm Operations

  • Bad Host
  • Hot Keys
  • Network Issues

Anomaly Detection

Performance Bottlenecks

  • Real-time Processing
  • Failures (slow writes, connectivity issues)
  • Backpressure / Container Deaths (ms spent in backpressure)

Spike in input traffic Hot Keys/Connectivity Issues Anomalous Nodes - Kestrel Spout Lag

Finding

Automated

Statistically Robust ( minimise false postiveis)

  • R Package : Seasonality and Trend Aware (available on Twitter blog)
  • Key Features - Filter/Expected values/Long term
  • Widely Used Outside TWitter

Applicable to univariate time series

Leverage multiple metrics (minimize false positives)

Exploit correlation/topology - observed variables and latent variables

Host Health

Determine the intersection of the set of anomalies of each process

Service Component Health

Determine the intersection of the set of anomalies of each process HC

Anomaly Type - Container Death = all metrics of instances on that container had drops

Failure is an Option - Ian Malpass - Etsy

How philosophy of failure is approached at Etsy

three truths

  • you will create bugs
  • you will build the wrong thing
  • you will not foresee the unexpected

There are costs for thoee truths:

  • money
  • time
  • data (loss)
  • customers
  • credibility

failure is inevitable

expensive failure is not

no barriers

just speed just trust

Crafting Performance Alerting Tools - Allison McKnight - Etsy

logster to grab specific metrics out of logs

captures in range, aggregates, sends to graphite

reported weekly w/median and perc95

Change alert mechanism

individual check for each page/api individual threshold for each page/api

How do you choose thresholds?

Perfnag (Etsy)

95th perc over 2 weeks * 1.1 ( warning )

  • somethig ( critical )

Changing alert format

github.com/etsy/nagios-herald

Better visualization for warning / critical states.

Alerting on improvements

focus on improvements

Database Engineering - Laine Campbell, Pythian

removing dba to database engineer

Databases at Scale - Laine Campbell and CHarity Majors

engineering

  • quantitative
  • interdisciplinary
  • results focused
  • repeatable and code-driven

systems engineering

  • designing and managing complex systems for complete life-cycle
  • translation from biz to sys, focus

ops engineerings

  • designing process to balance objectives
  • infrastructure to serve businesss

reliability engineering

  • focuses on the glue common to all services/platform
  • deployment, efficinecy, scale, perf., observ
  • often done by systems and operations eng. rather than being their own

virtualization and cloud

  • forces horz. scaling
  • forces designing for resillience
  • elasticity drives new data store
  • management by api

infrastructure as code

devops cultures

  • lena manufacturing defines our workflows
  • tighter feedback loops require org. shifts
  • experimentation and controlled failure shift arch and proc. design
  • integration drives empathy

continuous delivery

  • brings us to the source code control paradigm
  • we must be teachers, not gatekeepers
  • testing and compliance become top priorities

polyglot persistence

  • relational is not the end of the line
  • data must be looked at end to end
  • function dictates form
  • we cannot rpedic all sues

db eng manifesto

  • its about the mission
  • protect the data
  • elim. waste
  • data-drivee decison making
  • dbs are not special
  • eliminate the barriers between sw and ops

design for?

  • mission KPIs
  • function not fomr
  • operational processes and management

ex. kpis

velocity

how quickly can we pivot or change the datastore

efficienty

how elastic, adding resources, vendor lockin, cost per transaction?

security

user management, audit trail, data and connection encryption, vunerabilities history

performance

how tunable, limits and curves

availability

spofs, backups, partitioning, failover/rebalancing, consistency

(VESPA)

Partitions always occur, whether outage or overload

polyglot love

  • e.f. codd's 12 rules of relations
  • sql access
  • acid levels

config mgmt

files/apis - antipatterns: in memory/binaries degrad. mgmt: read-only modes, dyn config, queue drainin, timeouts

anti patttern

static config, long timeouts, bad defaults

change mgmgt

online changes, fast alters, atomic ops, instrum.

anti-pattern: schema level locking

systems

voldemort hive

deploy

config. management orchestration self-service

disciplines and systems

anomaly tests and statistics

need to know

dataflow

lambda arch: pubsub, batch proc, hadoop

cache

immutable architectures will force us to create change at the template layer and redeploy

Day Two

Orbitz

Docker Slave - The Rickbot

Added consul (eventually consistent service registry) - register & lookup for port mapping containers

Chef to only manage on-box/host (outside of docker)

Host machine - Consul, logging, metric agg.

Marathon - Bamboo - HAProxy to register services w/haproxy

github.com/QubitProducts/bamboo

yeoman - bootstrap services

Spring Boot - java autoconfig

Dropwizard Metrics - java in app metrics

Consul Registration/Discovery (OrbitzWordlwide/consul-client)

Logstash/Logback

Swagger

Hystrix

Retrofit + Consul

Amazon ECS? Docker Swarm? Kubernetes?

Automation

  • Docker - repeatable apps
  • Chef - repeatable infra
  • Jenkins - repeatble releases

Delineate config concerns

  • compile time - bake into docker image
  • boot time - bake into playbook/launcher - parameter for Docker
  • anytime - externalize (consul kv, etcd, zookeeper)

Burnout in Tech - John Allspaw (Etsy)

Problem of Burnout

Exhaustion

cynicism

developing a hostile attitude towards their job losing motivation and passion shifting to bare minimum from your best

professional inefficacy

negative feelings turning inward on yourself, mistakes on your path, imposter syndrome

Works are overwhelmed, unable ot cope, unmotivated, and display negative attitudes and poor performance.

Stress phenomenon

  • prolonged response to chronic interpersonal stressors on the job
  • three dimensions
    • exhaustion individual stress
    • cynicism negative respnse to job
    • proffesionnal ineffivacy negative self-eval

outcomes of burnout

  • poor quality of work
  • low morale
  • absenteeism - goes up
  • turnover
  • health problems
  • depression

six strategic areas

mismatch/misfit between person and job, thus predicting burnout is it the job or the person (wrong question) - both habe to be taken into account

  • workload not usually the issue

    sustainable-workload good

  • control how much agency does one have over their job. whether micromanaged or chaotic work environment. feeling they have appropriate level.

    choice and control: good

  • reward not just tangible things

    recognition and reward: good

  • community workplace - social relationships with other people, colleagues, supervisors. gain trust, spirit with each other. bad: unresolved conflicts, competing against eachother, preventing clear comms., sharing knwoeldge, providing support, "socially toxic" gossiping/politics This is becoming important/rising

    supportive work community: good

  • fairness how we do work, policies, rules. people feel they aren't being treated fairly or with respect will contribute to cynicism. they arent being treated fairly. counter-productive behaviors will evolve.

    fairness, respect, and social justice: good

  • values not in conflict with what you hold. respecting others for theirs. value conflicts will erode.

    clear values and meaningful work: good

more mismatches = more burnout more matches = more engagement

preventing burnout is a better strategy than waiting to treat it building engagement is the best approach to preventing organizational intervention can be more productive than individual intervention

  • Christina Maslach books truth about burningout banishing burnout

Panel talk

Values and community - recognizing them early as a manager and foster. its a big responsibilit

first person to burnout might not be the only

Organizaitonal not individual, patronizing to isolate

very little orgs do work/health polling

correlistic ignorance - i have to look like I'm doing the right thing, I'm okay, fitting in, mirroring others in the workplace -- then holding back on issues

creating a safe harbor for people to speak if they don't feel comfortable being explicit

trying extra hard to be a team

  • beyond cat gifs
  • being mindful of degrading our community

sharing stories, growing bonds, feeling you aren't alone

culture who glamourizes the hero is toxic. teams that encourage people to disconnect. little things can mean a lot

focusing on when things go right, rather than wrong

larger outcomes, rather than smaller wins

teamwork and community != friendship based more on trust: confidence and character

research work in canada: incivilility in the community, rudeness, bullying, snarky, sarcastic, people reciprocate, and spirals downward. number one sign in this group: eyerolling

c ivility r espect e ngagement @ w ork

Teams are Systems Too: Theory of Constraints in Action - Baron Schwartz

Read "The Goal" to discocver constraints in action Business Process Optimization

Goal of the business is to make money (derp)

throughput generated by sales inventory is money invested / wip / haven't earned back / feature operating expense - time to ship feature, moving from inventory to throughput businesses are systems to produce money

biz optimizations

  • theory of constraints to primary constraints to hold back any step documenting/blogging/etc can slow down a feature release optimize process around constraints

constriants

  • dependencies linkages to another events that have to happen before something happens

  • variations changes in input will change output

how constraints impact systems

  • decouple to create "aysnc" processing
  • gate-ing work/buffering work to slowly release it out
  • trimming waste will be negative, since it creates fluctuations, unless you known it is for sure the bottleneck
  • "this is queueing theory" amdahl's law

continuous improvement

  • make your bottleneck the leader to synchornize, then improve/remove constraints
  • brent is the constraint/bottleneck

switching back and fourth and being the dependency is bad

more problems

  • no prediction
  • lack of repeatable steps
  • lack of knowledge sharing when people were siloed

the e myth revisisted (book)

Github for Poets - Aaron Suggs - kickstarter - @ktheory

what is g4p

live demo of a copy change, class open to all staff, intro to dev tools + process

reference to: liberal arts schools

github flow in the browser - lowest barrier of entry!!!!

shows how we test and deploy makes engineering culture transparent so other teams can see how we work everyone can commit code

why do it

lightweight process for making simple changes most trivial changes causes you to avoid building a CMS!

cultural values become transparent to engineering and helps generate concensus version control is a communication tool, creates history and story on all changes transparency + consensus = blamelessness increases your impact

how to do it

  • explain git branches + commits
  • explain file layout
  • always be learning (just in time learning)
  • dry - dont repeat yourself: code at the limit of your understanding to improve

protip: safe deploy process

  • a git branch of your own
  • tests and continuous integration
  • deployer checks what's getting deployed

protip: explain what this means

WTF emojis mean

👍 means what for different groups

Demo

Experience D.U.T.

gh:hivequeen rack-attack

"It's a security liability" - But what else is?

TCP and the Lower Bound of Web App Performance - John Rauser - Pinterest

https://www.youtube.com/watch?v=G6ah2cq4LFY

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment