Skip to content

Instantly share code, notes, and snippets.

@philandstuff
Last active August 29, 2015 13:57
Show Gist options
  • Save philandstuff/9684513 to your computer and use it in GitHub Desktop.
Save philandstuff/9684513 to your computer and use it in GitHub Desktop.
scale-summit 2014

Scale Summit 2014

Intro, MBS

ideas for sessions

  • bootstrapping environments (without object stores)
  • service discovery
  • removing spofs
  • modern monitoring – sensu, runbooks, dashboards
    • tradeoff between ease of management and sophistication
    • elastic sites?
  • surviving DDoS attacks when your site is transactional
  • modern cmdbs
  • ansible
  • icinga re-acknowlegdement
    • ie I know disk is critical at 10%, but please re-alert at 5%

session 1: monitoring & metrics

  • big infrastructure
    • shared web servers
    • shared tomcat servers
    • zenoss over snmp
      • snmp didn’t scale
    • problem: everything is averaged over 5 minutes
      • teams are spinning up their own graphite instances to monitor their own stuff
    • zenoss required 40 boxes, I expected 2
  • what does graphite look like at scale?
    • protip: buy fusion io
    • it can be hard to rebalance your metrics
      • particularly if you’re using consistent hashing
    • carbonate for migrating data to another graphite server
      • though you’ll probably end up with downtime
  • has anyone used skyline?
    • we looked at it, but we got lots of false alerts
      • my suspicion is that if we understood maths better, we could make it work really well
  • in sensu-community-plugins, there’s a check-graphite
    • it does nice things like exceeding N std deviations
  • what do people use below graphite?
    • we’re using collectd
      • the latest stuff has statsd and jmx connectors
  • anyone using ganglia?
    • we’re replacing ganglia with sensu stuff and diamond
      • why are using diamond rather than the in-built sensu stuff?
        • because we’re a python shop
      • we push data over rabbitmq
      • and fan in to a big fat central fusionio graphite
      • how do you monitor rabbit?
        • sensu monitors rabbit using rabbit
        • there are healthchecks which should fire if rabbit is completely broken
        • we have a cron job on every rabbit and every sensu server to kill the process every hour
          • and it still works
  • is anyone using riemann?
    • is it worth spending time with?
    • where does it add value?
      • real time anomaly detection
      • it does events as well as numbers
      • it also has events timeouts – it can notify on an absence of events as well as presence
      • I think you could replace statsd with riemann
  • does anyone store second or subsecond data for a long time?
  • we have a single biggest day each year
    • we snapshot everything for that day - stats, logs, etc
    • use it to drive load testing for the next year
  • we’ve been trying redshift
  • how does elasticsearch cope with metrics?
    • we push quite large documents about everything to do with a web request
    • I often find log data in kibana much more useful than the same data in graphite
    • does anyone use realtime queries to drive alerting from elasticsearch?
  • one thing we’ve done recently is tuning down the amount of io operations that carbon uses per second. massively reduces disk usage
    • or write to ram disk and sync once per minute
  • how do you get devs to make more metrics available?
    • you put them on call until they do
  • do people cull metrics at all?
    • i never have enough data
  • do people have app metrics measured by their continuous delivery pipeline?
    • our apps publish an xml document which is a schema of the types of metrics that they can publish
  • if I don’t hate myself, is there anything other than sensu I should use for monitoring that environment?
    • does anyone rely on cloudwatch?
      • we use it as a source for some data (ELB metrics)
        • you can get these delivered into S3 these days
      • but it only stores data for two weeks
  • does anyone using sensu miss nagios tactical view?
    • I miss having a decent dashboard
      • I don’t miss the 10 different nagioses per environment
      • I don’t miss the failover when we lost the primary nagios instance and all the state in it
    • we wrote a dashboard to query nagios and sensu
  • from the internet peanut gallery: is anyone using circonus?

Session 2: versioning of artefacts

  • my agenda:
    • the presence of artefacts I don’t necessarily own
      • large graphical images or video data
      • third party applications
    • I may wish to release the same artefact multiple times
      • we’ll use oracle 11 everywhere at one patch level
        • but in different configurations
    • windows images (VDIs)
  • fpm is useful
    • but it never generates a spec file or a source rpm
    • makes me uncomfortable
  • I’m not happy about rpms, because you can only have one version of one package installed at once
    • eg a simple webapp where we don’t want to do the loadbalancer dance
    • that also implies the app is relocatable which vendor binaries often aren’t
  • is containerization part of the solution?
    • it allows you to have multiple overlapping filesystems
    • a model: each customer has their own container
      • we haven’t done it
      • that sounds very expensive
    • how do you version control containers?
      • do you treat them as a single binary?
      • do you reconstruct it?
    • a lot of solutions assume all machines are stateless
      • someone else will deal with the databases
    • containers allow you to minimize surprise
      • a DBA logging into your container can find things where they expect, even if it’s from an underlying frankenstein filesystem
    • I don’t mind snapshots, but they should be generated mechanically and repeatably.
  • what tool would you love to exist in an ideal world?
    • I’d like the deployment database to do effectively dependency injection
      • I know where the dependencies are and what data I’m injecting, so I can use system monitoring to know what I’ve deployed

Session 2b: µservices

  • HTTP isn’t the best protocol in the world
  • use queues!
  • refactoring and testing is a better solved problem within the python programming language than over the network
    • I don’t think it’s hard to test µservices
      • there are clear contracts
        • that’s the theory, right?
  • we end up building lots of small monoliths and wiring together
  • we switched to using amazon SNS to manage notifications
  • how you get your ops team to support µservices is you get them to support as little as possible
    • they only work when the functional team owns the whole stack right to the bottom
  • services have a life cycle
    • we like building things
    • we should get better at killing things when they’re not using things
  • is there an additional cost to the organization for running µservices?
    • is there an organizational cost to having a 2 million line codebase?
  • ownership of services
    • handover of building team to ongoing running team
    • problems can get pushed back to the building team
  • antipattern around µservices:
    • developers think they’re clever
  • ntp is a µservice
  • aren’t µservices and SOA the same thing?
    • is it SOA done right?

Session 3: managing OSS software at work

  • how do you deal with PRs?
    • what about things that are not on your roadmap?
      • by not having a very good roadmap?
    • or moving in directions you don’t want to go?
    • it can be awkward because people might have put a lot of work in
      • but you need to explain “if you want to do that you need to fork it”
      • you can try to avoid it by writing a decent rationale of what you’re trying to do
      • though you can’t answer all the questions up front
  • you want to optimize for dragging people into your community
    • as the implementer, your documentation is going to be awful
    • because you already understand the whole system and don’t understand when you’re assuming tacit knowledge
    • whereas if you can attract users to your irc channel, and answer their questions really clearly, they can write great docs for you
    • I try to have a policy of: if anything confuses you, here’s my email, twitter, irc, etc and I will try to help you
    • encourage people to raise bugs against docs
    • I come from the perl community
      • there are 10-15 year old projects there where the maintainer has changed 4-6 times
      • have you got an example?
        • Catalyst
          • ~200 repos (core + plugins)
          • ~450 active committers
  • plugins are interesting: if people are trying to pull the project in different directions, you can let them through extensions but keep the core very small
  • does anyone have experience of running OSS projects at work?
    • how do you manage time management?
      • the important PRs to pay attention to are those from new contributors
        • certainly get back within 24 hours
        • don’t necessarily have to merge
  • why are you open sourcing this code?
    • to get the community using
    • to get good publicity
  • do you have an OSS landing page?
    • yes, but it’s out of date
  • the OSS stuff that has mostly been infrastructure-related we’ve been trying to put into a separate github org
  • you imply some level of support here
    • running an OSS project is more than just making code open
    • to be able to do that successfully, you need to at least mentally divest yourself from your parent organization
  • what do you do if that project isn’t your main focus?
    • my OSS contributions are entirely selfish
    • you need a maintainer
      • there needs to be clear communication channels
  • what does a maintainer do?
    • is it always one person?
      • no! not if you can avoid it?
      • once a project has a community it’s difficult for one person to maintain
      • even if you’re not writing code, managing the community can rapidly become a full-time job
    • what about the cost of maintenance?
      • use travis!
      • but please review the contribution even if the contribution passes the tests
    • problem of selectivity, vision and direction
      • mozilla in the early days, just accepted everything.
      • ended up having to rewrite as firebird (now firefox)
  • how do you ensure governance doesn’t become onerous?
    • example of people who forked their own project after it had become an apache project
    • example of gcc fork (egcs) which got merged back in
  • a lot comes back to documenting your original vision
    • I’ve been added as a maintainer in places, and sometimes there’s clear advice and sometimes there isn’t.
  • if you open source a project that you don’t use is a recipe for abandonware.
    • we also have an organization for abandoned code to move it out of our main github org
  • forks
    • how do you transfer maintainership?
    • what happens if a project gets abandoned and then forked?
  • what are the good communication channels to have for an OSS project?
    • own website for announcement and discovery?
      • how do you summarize your project?
      • peeve: like <other project> but X
    • community of contributors comes from community of users
      • so good user documentation will foster contributors
    • issues
      • is it worth seeding the issues list even if we have an internal tracker?
      • yes, because it helps users google for error messages
      • they are effectively documentation
      • do you move to only use the external tracker or do you have an internal tracker too?
    • do you need a security contact?
      • yes, with a GPG key
    • people need to see activity
      • if all your activity is on your internal tracker & mailing list & private irc, people will think it’s dead
    • where do people host mailing lists?
      • google groups
    • a few people are averse to irc
      • people don’t realise that they won’t get an immediate response necessarily
      • irc shouldn’t be used alone
      • timezones are also an issue
    • ipython uses hangouts
    • gmane: a newsgroup view on your mailing list
    • don’t have a separate irc channel per project if you’re managing lots of projects
  • how do you host your docs?
    • you should control your domain?
    • when is a README not enough?
    • start with github pages, and you can migrate later
    • what should it have?
      • screenshots
      • getting started guides
    • github pages are a bad idea because you can’t version them
      • readthedocs keeps old versions too
    • contributions must update docs when they update behaviour
  • documentation & communication is super super important
    • careful with contributions from newbies
      • rejecting a contribution because of lack of tests can be tricky
        • they might not have written many tests in general
        • they might not understand your particular test framework
      • but rejecting because of no docs is more reasonable
      • you can write tests for them
        • and use this as a communication channel
        • “does this test look like it’s measuring the thing you’re trying to build?”
  • how do you handle trolls, griefers and timewasters?

Session 4: what’s changed since last scale camp?

  • what’s arrived? what’s died?
  • Big Data is now a thing people talk about
    • you’re now seeing adverts on the tube about it
  • is couchdb dead?
    • npm?
    • we still use it, but we only used it as a key-value store
  • still going:
    • mongo
    • riak
  • websockets are now standardized and supported by lbs, proxies
  • edgeconf
    • grunt and pig and oink and stuff
    • doing a js build and running tests
    • angularjs
  • ndoc has gone
  • flash is in its death throes
  • most video sites work on an ipad
  • webgl has taken hold
  • epic demoed unreal engine 4 in firefox
  • 60 fps on the web
  • docker!
    • although solaris has been doing it for yonks
  • golang has taken off
    • when did go hit 1.0?
    • people are rewriting individual bits in go (rather than everything)
  • is hacker news dead yet?
  • bitcoin happened
    • VPS providers have been getting attacked for people trying to steal them
    • people trawling github to find access keys
    • bitcoin mining in the browser
  • erlang
    • nobody’s started writing things in it
    • though there’s elixir
    • and julia
    • and idris
  • what’s falling out of favour?
    • ruby? no
    • scala? no
  • facebook’s hack
    • seems sensible if you’re already in a php environment
  • bittorrent
    • an incredibly good way of saturating your network
    • though this isn’t new
  • µservices
    • just due to containerization?
    • seems to be a bunch of ex-tw people
  • elasticsearch is now usable
    • and quite good
    • and they acquired logstash and kibana
  • logs being searchable in es
    • splunk has a reasonable oss competitor
  • graphite has grown
    • there’s experimentation going on there
      • storage backends (cassandra, leveldb)
  • what about lucene?
    • very few people use it directly these days
  • snowden
  • DC security
  • https everywhere
    • gmail is now ssl only
    • facebook
    • PFS
    • the perception that TLS is expensive
    • spdy
  • webp
  • IE6 is on its deathbed
  • winxp
    • though it’s still in cash terminals
  • mobile growth
    • many sites are on the edge for 50% mobile
    • talk of mobile first and now mobile only
  • 4G
  • bootstrap
  • wearables & IoT
    • fitbit
    • pebble
    • automotive
      • tesla motors
  • security updates
    • wordpress now has autoupdate
  • nagios isn’t dead yet
    • sensu is still the hot new thing
    • riemann
    • flapjack
  • desktops are going away
    • except for gaming
  • centos is now owned by redhat
  • linux mint?
  • systemd
  • ubuntu as a server is now more probably
    • is upstart going away?
  • postgres got built-in replication
  • graph dbs (neo4j)
  • paas
    • people are still excited
    • it got even more complicated to install your own
  • where’s node going?
  • streaming extensions
    • rx in .NET
    • rise of functional
  • linux on the desktop?
    • the XPS13 is good
    • the rise of chromebooks
  • openstack?
    • everyone thinks it’s a great idea
  • private clouds?
    • azure will sell you an on-premise cloud thing
    • what’s the difference between an in-house cloud and a data centre?
  • drones, quadcopters, hexapods
    • for filming
  • what’s coming up? what will be important at the next scale summit?
    • net security is in flux
    • forks of android will be the new linux distro
    • http 2
    • IPv6?
    • anomaly detection
    • software defined networks
    • containerization
    • silicon roundabout?
      • it’s not a playground for children anymore
      • the adults have taken over
    • computing in government
      • US has 18F
      • GDS
    • I’d like there to be a world-class home grown east london startup doing technically challenging stuff
      • startups which solve technical problems don’t generally get funded
      • acquisitions
    • crowdfunding?
      • noone cares
  • what’s going to die?
    • couchdb
    • python 2 will not die

Session 5: mentoring

  • how do we hire & train & new people into our industry?
  • we certainly have struggled to recruit
    • we’ve come to the realization that part of the solution is hiring junior people & growing them into the role
    • I’ve been asked to mentor a junior person but I’ve no idea what to do
  • I’m a recent junior
    • one on one time is quite good
    • I came in having a basic idea what I’d be doing
    • be open for questions
      • the devops world is really overwhelming
      • it’s so useful to be able to ask things
    • that’s one of the ground rules we’ve agreed on
      • ie that I’m interruptible
    • we’ve certainly noticed that hiring in the junior area is useful
  • it’s great having juniors because you get chaos monkeys as well
    • if you’re not prepared to let a junior touch something, you probably need to make it more resilient
  • ETO1: 12-week night course
    • teaches you how to teach
  • how do you get the theory? how do you talk about underlying principles that are independent of the particular situation at hand?
    • pair programming is really good for that
      • does that depend on the teaching style of the pair?
    • make the junior document the things that you’re teaching them
      • it helps ensure that they’ve understood it
  • I get irritated when technical people tweet complaining about the cost of interruptions
    • when you have new people, you have to empower them to interrupt
    • I don’t think you should have your entire team mentor a new starter
    • we use the red flag system
      • you put a red flag up if you don’t want to be interrupted
    • designated interruptible person
    • juniors also have a difficult time saying no
      • you want to make everyone happy and be helpful
    • do you have a system that makes work visible? eg kanban
      • we have a helpdesk system
      • but external people don’t use it for smaller tasks
        • raise a ticket on their behalf
    • how do we teach juniors that it’s ok to say no?
      • also, how to understand what the requestor is trying to achieve, rather than the specific task they want done, and recognize when it’s the wrong fit?
  • juniors are way more engaged if they get a choice (however constrained) on what they get to spend their time on
  • also allow people to fail
    • teach them that it’s okay to fail
    • I troll my junior developers sometimes
      • I lead them down the garden path
      • but then I’m there to pick up the pieces when they fail
    • do something that’s visible to other people in the company
      • so that they can show people what they’re capable of
  • how do you direct people through different areas of knowledge?
    • do you go shallow on lots of tools? Or really deep on one thing?
    • depends on the junior
      • throw things at them and see what sticks
    • go broad with the concepts early on
      • architecture, system, etc
  • onboarding
    • desk & computer should be ready
    • first week should be meeting all the people they need to know about
    • have monthly checkins with the mentor
      • checkins, not reviews!
    • get a sales person to give a demo of whatever it is you build
  • can anyone recommend useful resources for managing developers?
    • how to talk to your kids or something like that
  • how do you improve diversity?
    • how do juniors find your roles?
    • you don’t have to stick to the same old networks when hiring juniors
    • thoughtbot – structured apprentice schemes
    • I wonder if being more explicit & realistic about what experience required and salaries are in job postings?
      • recruiters muddy the waters a lot
      • go direct if you can
  • how do you know when to stop mentoring? and how do you measure success?

lightning talks

tdoran docker to prod in 5 minutes

  • docker + 150 lines of shell

mirroring the internet

  • mirroring cpan, rubygems, npm
  • filesystems are good at serving things that look like files
  • you don’t need to use couch or
  • what was the easiest to mirror?
    • cpan – it has a single line rsync command to create a mirror
  • wikipedia is hard to mirror
    • each wikimedia site has a different set of plugins

analytics and search evaluation

  • it’s important to have good search for your site
  • we use google analytics. you can use this to find click behaviour for particular search terms
    • ie for term X, how often do people click on link 1, 2, 3, 4, etc
  • automate this!
  • crunch the most popular searches
  • identify how many clicks they got
  • use it to calculate how many more clicks we would have got if we had ordered the results better

juju

  • juju is a service orchestration tool

your laptop is not your friend

  • apple, facebook employees hacked via website malware, java vulnerability
  • data in transit protection
  • data at rest protection
  • authentication
    • user to device, user to service, device to service
  • secure boot
    • firmware
  • platform integrity and app sandboxing
  • app whitelisting
    • although key here is to ensure that whitelist doesn’t take too long to modify for new things
  • security policy
  • sounds like configuration management
  • external interface protection (firewalls)
  • device update policy
  • incident response
    • things will go wrong
  • although don’t worry too much about this
    • unless you have to.

write libraries, not services

  • scale using libraries
  • a library has all the modularity properties that services have
  • except you don’t need to worry about the network going down

we’re doing a festival called electromagnetic field

  • august 29th for 3 days
  • go here

outro

@petemounce
Copy link

The book is "how to talk so kids will listen & listen so kids will talk" by Adele Faber and Elaine Mazlish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment