philandstuff/scale-summit.org

## scale-summit.org

      
    Raw
  

              scale-summit.org
            
          
    Scale Summit 2014

Intro, MBS


  Chatham House Rule, so no attribution of ideas to people or
    companies

ideas for sessions


  bootstrapping environments (without object stores)
  service discovery
  removing spofs
  modern monitoring – sensu, runbooks, dashboards
    
      tradeoff between ease of management and sophistication
      elastic sites?
    
  
  surviving DDoS attacks when your site is transactional
  modern cmdbs
  ansible
  icinga re-acknowlegdement
    
      ie I know disk is critical at 10%, but please re-alert at 5%
    
  
session 1: monitoring & metrics


  big infrastructure
    
      shared web servers
      shared tomcat servers
      zenoss over snmp
        
          snmp didn’t scale
        
      
      problem: everything is averaged over 5 minutes
        
          teams are spinning up their own graphite instances to monitor
            their own stuff
        
      
      zenoss required 40 boxes, I expected 2
    
  
  what does graphite look like at scale?
    
      protip: buy fusion io
      it can be hard to rebalance your metrics
        
          particularly if you’re using consistent hashing
        
      
      carbonate for migrating data to another graphite server
        
          though you’ll probably end up with downtime
        
      
  has anyone used skyline?
    
      we looked at it, but we got lots of false alerts
        
          my suspicion is that if we understood maths better, we could
            make it work really well
        
      
  in sensu-community-plugins, there’s a check-graphite
    
      it does nice things like exceeding N std deviations
    
  
  what do people use below graphite?
    
      we’re using collectd
        
          the latest stuff has statsd and jmx connectors
        
      
  anyone using ganglia?
    
      we’re replacing ganglia with sensu stuff and diamond
        
          why are using diamond rather than the in-built sensu stuff?
            
              because we’re a python shop
            
          
          we push data over rabbitmq
          and fan in to a big fat central fusionio graphite
          how do you monitor rabbit?
            
              sensu monitors rabbit using rabbit
              there are healthchecks which should fire if rabbit is
                completely broken
              we have a cron job on every rabbit and every sensu server to
                kill the process every hour
                
                  and it still works
                
              
  is anyone using riemann?
    
      is it worth spending time with?
      where does it add value?
        
          real time anomaly detection
          it does events as well as numbers
          it also has events timeouts – it can notify on an absence of
            events as well as presence
          I think you could replace statsd with riemann
        
      
  does anyone store second or subsecond data for a long time?
  we have a single biggest day each year
    
      we snapshot everything for that day - stats, logs, etc
      use it to drive load testing for the next year
    
  
  we’ve been trying redshift
  how does elasticsearch cope with metrics?
    
      we push quite large documents about everything to do with a web
        request
      I often find log data in kibana much more useful than the same
        data in graphite
      does anyone use realtime queries to drive alerting from
        elasticsearch?
        
          yes, from graylog2
        
      
  one thing we’ve done recently is tuning down the amount of io
    operations that carbon uses per second.  massively reduces disk
    usage
    
      or write to ram disk and sync once per minute
    
  
  how do you get devs to make more metrics available?
    
      you put them on call until they do
    
  
  do people cull metrics at all?
    
      i never have enough data
    
  
  do people have app metrics measured by their continuous delivery
    pipeline?
    
      our apps publish an xml document which is a schema of the types
        of metrics that they can publish
    
  
  if I don’t hate myself, is there anything other than sensu I
    should use for monitoring that environment?
    
      does anyone rely on cloudwatch?
        
          we use it as a source for some data (ELB metrics)
            
              you can get these delivered into S3 these days
            
          
          but it only stores data for two weeks
        
      
  does anyone using sensu miss nagios tactical view?
    
      I miss having a decent dashboard
        
          I don’t miss the 10 different nagioses per environment
          I don’t miss the failover when we lost the primary nagios
            instance and all the state in it
        
      
      we wrote a dashboard to query nagios and sensu
    
  
  from the internet peanut gallery: is anyone using circonus?

Session 2: versioning of artefacts


  my agenda:
    
      the presence of artefacts I don’t necessarily own
        
          large graphical images or video data
          third party applications
        
      
      I may wish to release the same artefact multiple times
        
          we’ll use oracle 11 everywhere at one patch level
            
              but in different configurations
            
          
      windows images (VDIs)
    
  
  fpm is useful
    
      but it never generates a spec file or a source rpm
      makes me uncomfortable
    
  
  I’m not happy about rpms, because you can only have one version of
    one package installed at once
    
      eg a simple webapp where we don’t want to do the loadbalancer
        dance
      that also implies the app is relocatable which vendor binaries
        often aren’t
    
  
  is containerization part of the solution?
    
      it allows you to have multiple overlapping filesystems
      a model: each customer has their own container
        
          we haven’t done it
          that sounds very expensive
        
      
      how do you version control containers?
        
          do you treat them as a single binary?
          do you reconstruct it?
        
      
      a lot of solutions assume all machines are stateless
        
          someone else will deal with the databases
        
      
      containers allow you to minimize surprise
        
          a DBA logging into your container can find things where they
            expect, even if it’s from an underlying frankenstein filesystem
        
      
      I don’t mind snapshots, but they should be generated
        mechanically and repeatably.
    
  
  what tool would you love to exist in an ideal world?
    
      I’d like the deployment database to do effectively dependency
        injection
        
          I know where the dependencies are and what data I’m injecting,
            so I can use system monitoring to know what I’ve deployed
        
      
Session 2b: µservices


  HTTP isn’t the best protocol in the world
  use queues!
  refactoring and testing is a better solved problem within the
    python programming language than over the network
    
      I don’t think it’s hard to test µservices
        
          there are clear contracts
            
              that’s the theory, right?
            
          
  we end up building lots of small monoliths and wiring together
  we switched to using amazon SNS to manage notifications
  how you get your ops team to support µservices is you get them to
    support as little as possible
    
      they only work when the functional team owns the whole stack
        right to the bottom
    
  
  services have a life cycle
    
      we like building things
      we should get better at killing things when they’re not using
        things
    
  
  is there an additional cost to the organization for running
    µservices?
    
      is there an organizational cost to having a 2 million line
        codebase?
    
  
  ownership of services
    
      handover of building team to ongoing running team
      problems can get pushed back to the building team
    
  
  antipattern around µservices:
    
      developers think they’re clever
    
  
  ntp is a µservice
  aren’t µservices and SOA the same thing?
    
      is it SOA done right?
    
  
Session 3: managing OSS software at work


  how do you deal with PRs?
    
      what about things that are not on your roadmap?
        
          by not having a very good roadmap?
        
      
      or moving in directions you don’t want to go?
      it can be awkward because people might have put a lot of work in
        
          but you need to explain “if you want to do that you need to
            fork it”
          you can try to avoid it by writing a decent rationale of what
            you’re trying to do
          though you can’t answer all the questions up front
        
      
  you want to optimize for dragging people into your community
    
      as the implementer, your documentation is going to be awful
      because you already understand the whole system and don’t
        understand when you’re assuming tacit knowledge
      whereas if you can attract users to your irc channel, and
        answer their questions really clearly, they can write great
        docs for you
      I try to have a policy of: if anything confuses you, here’s my
        email, twitter, irc, etc and I will try to help you
      encourage people to raise bugs against docs
      I come from the perl community
        
          there are 10-15 year old projects there where the maintainer
            has changed 4-6 times
          have you got an example?
            
              Catalyst
                
                  ~200 repos (core + plugins)
                  ~450 active committers
                
              
  plugins are interesting: if people are trying to pull the
    project in different directions, you can let them through
    extensions but keep the core very small
  does anyone have experience of running OSS projects at work?
    
      how do you manage time management?
        
          the important PRs to pay attention to are those from new
            contributors
            
              certainly get back within 24 hours
              don’t necessarily have to merge
            
          
  why are you open sourcing this code?
    
      to get the community using
      to get good publicity
    
  
  do you have an OSS landing page?
    
      yes, but it’s out of date
    
  
  the OSS stuff that has mostly been infrastructure-related we’ve
    been trying to put into a separate github org
  you imply some level of support here
    
      running an OSS project is more than just making code open
      to be able to do that successfully, you need to at least
        mentally divest yourself from your parent organization
    
  
  what do you do if that project isn’t your main focus?
    
      my OSS contributions are entirely selfish
      you need a maintainer
        
          there needs to be clear communication channels
        
      
  what does a maintainer do?
    
      is it always one person?
        
          no! not if you can avoid it?
          once a project has a community it’s difficult for one person
            to maintain
          even if you’re not writing code, managing the community can
            rapidly become a full-time job
        
      
      what about the cost of maintenance?
        
          use travis!
          but please review the contribution even if the contribution
            passes the tests
        
      
      problem of selectivity, vision and direction
        
          mozilla in the early days, just accepted everything.
          ended up having to rewrite as firebird (now firefox)
        
      
  how do you ensure governance doesn’t become onerous?
    
      example of people who forked their own project after it had
        become an apache project
      example of gcc fork (egcs) which got merged back in
    
  
  a lot comes back to documenting your original vision
    
      I’ve been added as a maintainer in places, and sometimes there’s
        clear advice and sometimes there isn’t.
    
  
  if you open source a project that you don’t use is a recipe for
    abandonware.
    
      we also have an organization for abandoned code to move it out
        of our main github org
    
  
  forks
    
      how do you transfer maintainership?
      what happens if a project gets abandoned and then forked?
    
  
  what are the good communication channels to have for an OSS
    project?
    
      own website for announcement and discovery?
        
          how do you summarize your project?
          peeve: like <other project> but X
        
      
      community of contributors comes from community of users
        
          so good user documentation will foster contributors
        
      
      issues
        
          is it worth seeding the issues list even if we have an
            internal tracker?
          yes, because it helps users google for error messages
          they are effectively documentation
          do you move to only use the external tracker or do you have an
            internal tracker too?
        
      
      do you need a security contact?
        
          yes, with a GPG key
        
      
      people need to see activity
        
          if all your activity is on your internal tracker & mailing
            list & private irc, people will think it’s dead
        
      
      where do people host mailing lists?
        
          google groups
        
      
      a few people are averse to irc
        
          people don’t realise that they won’t get an immediate response
            necessarily
          irc shouldn’t be used alone
          timezones are also an issue
        
      
      ipython uses hangouts
      gmane: a newsgroup view on your mailing list
      don’t have a separate irc channel per project if you’re managing
        lots of projects
    
  
  how do you host your docs?
    
      you should control your domain?
      when is a README not enough?
      start with github pages, and you can migrate later
      what should it have?
        
          screenshots
          getting started guides
        
      
      github pages are a bad idea because you can’t version them
        
          readthedocs keeps old versions too
        
      
      contributions must update docs when they update behaviour
    
  
  documentation & communication is super super important
    
      careful with contributions from newbies
        
          rejecting a contribution because of lack of tests can be
            tricky
            
              they might not have written many tests in general
              they might not understand your particular test framework
            
          
          but rejecting because of no docs is more reasonable
          you can write tests for them
            
              and use this as a communication channel
              “does this test look like it’s measuring the thing you’re
                trying to build?”
            
          
  how do you handle trolls, griefers and timewasters?
    
      one small doc patch earns you a hundred stupid questions
      love your idiots
    
  
Session 4: what’s changed since last scale camp?


  what’s arrived?  what’s died?
  Big Data is now a thing people talk about
    
      you’re now seeing adverts on the tube about it
    
  
  is couchdb dead?
    
      npm?
      we still use it, but we only used it as a key-value store
    
  
  still going:
    
      mongo
      riak
    
  
  websockets are now standardized and supported by lbs, proxies
  edgeconf
    
      grunt and pig and oink and stuff
      doing a js build and running tests
      angularjs
    
  
  ndoc has gone
  flash is in its death throes
  most video sites work on an ipad
  webgl has taken hold
  epic demoed unreal engine 4 in firefox
  60 fps on the web
  docker!
    
      although solaris has been doing it for yonks
    
  
  golang has taken off
    
      when did go hit 1.0?
      people are rewriting individual bits in go (rather than
        everything)
    
  
  is hacker news dead yet?
  bitcoin happened
    
      VPS providers have been getting attacked for people trying to
        steal them
      people trawling github to find access keys
      bitcoin mining in the browser
    
  
  erlang
    
      nobody’s started writing things in it
      though there’s elixir
      and julia
      and idris
    
  
  what’s falling out of favour?
    
      ruby? no
      scala? no
    
  
  facebook’s hack
    
      seems sensible if you’re already in a php environment
    
  
  bittorrent
    
      an incredibly good way of saturating your network
      though this isn’t new
    
  
  µservices
    
      just due to containerization?
      seems to be a bunch of ex-tw people
    
  
  elasticsearch is now usable
    
      and quite good
      and they acquired logstash and kibana
    
  
  logs being searchable in es
    
      splunk has a reasonable oss competitor
    
  
  graphite has grown
    
      there’s experimentation going on there
        
          storage backends (cassandra, leveldb)
        
      
  what about lucene?
    
      very few people use it directly these days
    
  
  snowden
  DC security
  https everywhere
    
      gmail is now ssl only
      facebook
      PFS
      the perception that TLS is expensive
      spdy
    
  
  webp
  IE6 is on its deathbed
  winxp
    
      though it’s still in cash terminals
    
  
  mobile growth
    
      many sites are on the edge for 50% mobile
      talk of mobile first and now mobile only
    
  
  4G
  bootstrap
  wearables & IoT
    
      fitbit
      pebble
      automotive
        
          tesla motors
        
      
  security updates
    
      wordpress now has autoupdate
    
  
  nagios isn’t dead yet
    
      sensu is still the hot new thing
      riemann
      flapjack
    
  
  desktops are going away
    
      except for gaming
    
  
  centos is now owned by redhat
  linux mint?
  systemd
  ubuntu as a server is now more probably
    
      is upstart going away?
    
  
  postgres got built-in replication
  graph dbs (neo4j)
  paas
    
      people are still excited
      it got even more complicated to install your own
    
  
  where’s node going?
  streaming extensions
    
      rx in .NET
      rise of functional
    
  
  linux on the desktop?
    
      the XPS13 is good
      the rise of chromebooks
    
  
  openstack?
    
      everyone thinks it’s a great idea
    
  
  private clouds?
    
      azure will sell you an on-premise cloud thing
      what’s the difference between an in-house cloud and a data
        centre?
    
  
  drones, quadcopters, hexapods
    
      for filming
    
  
  what’s coming up?  what will be important at the next scale
    summit?
    
      net security is in flux
      forks of android will be the new linux distro
      http 2
      IPv6?
      anomaly detection
      software defined networks
      containerization
      silicon roundabout?
        
          it’s not a playground for children anymore
          the adults have taken over
        
      
      computing in government
        
          US has 18F
          GDS
        
      
      I’d like there to be a world-class home grown east london
        startup doing technically challenging stuff
        
          startups which solve technical problems don’t generally get
            funded
          acquisitions
        
      
      crowdfunding?
        
          noone cares
        
      
  what’s going to die?
    
      couchdb
      python 2 will not die
    
  
Session 5: mentoring


  how do we hire & train & new people into our industry?
  we certainly have struggled to recruit
    
      we’ve come to the realization that part of the solution is
        hiring junior people & growing them into the role
      I’ve been asked to mentor a junior person but I’ve no idea what
        to do
    
  
  I’m a recent junior
    
      one on one time is quite good
      I came in having a basic idea what I’d be doing
      be open for questions
        
          the devops world is really overwhelming
          it’s so useful to be able to ask things
        
      
      that’s one of the ground rules we’ve agreed on
        
          ie that I’m interruptible
        
      
      we’ve certainly noticed that hiring in the junior area is useful
    
  
  it’s great having juniors because you get chaos monkeys as well
    
      if you’re not prepared to let a junior touch something, you
        probably need to make it more resilient
    
  
  ETO1: 12-week night course
    
      teaches you how to teach
    
  
  how do you get the theory?  how do you talk about underlying
    principles that are independent of the particular situation at
    hand?
    
      pair programming is really good for that
        
          does that depend on the teaching style of the pair?
        
      
      make the junior document the things that you’re teaching them
        
          it helps ensure that they’ve understood it
        
      
  I get irritated when technical people tweet complaining about the
    cost of interruptions
    
      when you have new people, you have to empower them to interrupt
      I don’t think you should have your entire team mentor a new
        starter
      we use the red flag system
        
          you put a red flag up if you don’t want to be interrupted
        
      
      designated interruptible person
      juniors also have a difficult time saying no
        
          you want to make everyone happy and be helpful
        
      
      do you have a system that makes work visible?  eg kanban
        
          we have a helpdesk system
          but external people don’t use it for smaller tasks
            
              raise a ticket on their behalf
            
          
      how do we teach juniors that it’s ok to say no?
        
          also, how to understand what the requestor is trying to
            achieve, rather than the specific task they want done, and
            recognize when it’s the wrong fit?
        
      
  juniors are way more engaged if they get a choice (however
    constrained) on what they get to spend their time on
  also allow people to fail
    
      teach them that it’s okay to fail
      I troll my junior developers sometimes
        
          I lead them down the garden path
          but then I’m there to pick up the pieces when they fail
        
      
      do something that’s visible to other people in the company
        
          so that they can show people what they’re capable of
        
      
  how do you direct people through different areas of knowledge?
    
      do you go shallow on lots of tools?  Or really deep on one
        thing?
      depends on the junior
        
          throw things at them and see what sticks
        
      
      go broad with the concepts early on
        
          architecture, system, etc
        
      
  onboarding
    
      desk & computer should be ready
      first week should be meeting all the people they need to know
        about
      have monthly checkins with the mentor
        
          checkins, not reviews!
        
      
      get a sales person to give a demo of whatever it is you build
    
  
  can anyone recommend useful resources for managing developers?
    
      how to talk to your kids or something like that
    
  
  how do you improve diversity?
    
      how do juniors find your roles?
      you don’t have to stick to the same old networks when hiring
        juniors
      thoughtbot – structured apprentice schemes
      I wonder if being more explicit & realistic about what
        experience required and salaries are in job postings?
        
          recruiters muddy the waters a lot
          go direct if you can
        
      
  how do you know when to stop mentoring?  and how do you measure
    success?

lightning talks

tdoran docker to prod in 5 minutes


  docker + 150 lines of shell

mirroring the internet


  mirroring cpan, rubygems, npm
  filesystems are good at serving things that look like files
  you don’t need to use couch or
  what was the easiest to mirror?
    
      cpan – it has a single line rsync command to create a mirror
    
  
  wikipedia is hard to mirror
    
      each wikimedia site has a different set of plugins
    
  
analytics and search evaluation


  it’s important to have good search for your site
  we use google analytics. you can use this to find click behaviour
    for particular search terms
    
      ie for term X, how often do people click on link 1, 2, 3, 4,
        etc
    
  
  automate this!
  crunch the most popular searches
  identify how many clicks they got
  use it to calculate how many more clicks we would have got if we
    had ordered the results better

juju


  juju is a service orchestration tool

your laptop is not your friend


  apple, facebook employees hacked via website malware, java vulnerability
  data in transit protection
  data at rest protection
  authentication
    
      user to device, user to service, device to service
    
  
  secure boot
    
      firmware
    
  
  platform integrity and app sandboxing
  app whitelisting
    
      although key here is to ensure that whitelist doesn’t take too
        long to modify for new things
    
  
  security policy
  sounds like configuration management
  external interface protection (firewalls)
  device update policy
  incident response
    
      things will go wrong
    
  
  although don’t worry too much about this
    
      unless you have to.
    
  
write libraries, not services


  scale using libraries
  a library has all the modularity properties that services have
  except you don’t need to worry about the network going down

we’re doing a festival called electromagnetic field


  august 29th for 3 days
  go here

outro