Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
scale-summit 2014

Scale Summit 2014

Intro, MBS

ideas for sessions

  • bootstrapping environments (without object stores)
  • service discovery
  • removing spofs
  • modern monitoring – sensu, runbooks, dashboards
    • tradeoff between ease of management and sophistication
    • elastic sites?
  • surviving DDoS attacks when your site is transactional
  • modern cmdbs
  • ansible
  • icinga re-acknowlegdement
    • ie I know disk is critical at 10%, but please re-alert at 5%

session 1: monitoring & metrics

  • big infrastructure
    • shared web servers
    • shared tomcat servers
    • zenoss over snmp
      • snmp didn’t scale
    • problem: everything is averaged over 5 minutes
      • teams are spinning up their own graphite instances to monitor their own stuff
    • zenoss required 40 boxes, I expected 2
  • what does graphite look like at scale?
    • protip: buy fusion io
    • it can be hard to rebalance your metrics
      • particularly if you’re using consistent hashing
    • carbonate for migrating data to another graphite server
      • though you’ll probably end up with downtime
  • has anyone used skyline?
    • we looked at it, but we got lots of false alerts
      • my suspicion is that if we understood maths better, we could make it work really well
  • in sensu-community-plugins, there’s a check-graphite
    • it does nice things like exceeding N std deviations
  • what do people use below graphite?
    • we’re using collectd
      • the latest stuff has statsd and jmx connectors
  • anyone using ganglia?
    • we’re replacing ganglia with sensu stuff and diamond
      • why are using diamond rather than the in-built sensu stuff?
        • because we’re a python shop
      • we push data over rabbitmq
      • and fan in to a big fat central fusionio graphite
      • how do you monitor rabbit?
        • sensu monitors rabbit using rabbit
        • there are healthchecks which should fire if rabbit is completely broken
        • we have a cron job on every rabbit and every sensu server to kill the process every hour
          • and it still works
  • is anyone using riemann?
    • is it worth spending time with?
    • where does it add value?
      • real time anomaly detection
      • it does events as well as numbers
      • it also has events timeouts – it can notify on an absence of events as well as presence
      • I think you could replace statsd with riemann
  • does anyone store second or subsecond data for a long time?
  • we have a single biggest day each year
    • we snapshot everything for that day - stats, logs, etc
    • use it to drive load testing for the next year
  • we’ve been trying redshift
  • how does elasticsearch cope with metrics?
    • we push quite large documents about everything to do with a web request
    • I often find log data in kibana much more useful than the same data in graphite
    • does anyone use realtime queries to drive alerting from elasticsearch?
  • one thing we’ve done recently is tuning down the amount of io operations that carbon uses per second. massively reduces disk usage
    • or write to ram disk and sync once per minute
  • how do you get devs to make more metrics available?
    • you put them on call until they do
  • do people cull metrics at all?
    • i never have enough data
  • do people have app metrics measured by their continuous delivery pipeline?
    • our apps publish an xml document which is a schema of the types of metrics that they can publish
  • if I don’t hate myself, is there anything other than sensu I should use for monitoring that environment?
    • does anyone rely on cloudwatch?
      • we use it as a source for some data (ELB metrics)
        • you can get these delivered into S3 these days
      • but it only stores data for two weeks
  • does anyone using sensu miss nagios tactical view?
    • I miss having a decent dashboard
      • I don’t miss the 10 different nagioses per environment
      • I don’t miss the failover when we lost the primary nagios instance and all the state in it
    • we wrote a dashboard to query nagios and sensu
  • from the internet peanut gallery: is anyone using circonus?

Session 2: versioning of artefacts

  • my agenda:
    • the presence of artefacts I don’t necessarily own
      • large graphical images or video data
      • third party applications
    • I may wish to release the same artefact multiple times
      • we’ll use oracle 11 everywhere at one patch level
        • but in different configurations
    • windows images (VDIs)
  • fpm is useful
    • but it never generates a spec file or a source rpm
    • makes me uncomfortable
  • I’m not happy about rpms, because you can only have one version of one package installed at once
    • eg a simple webapp where we don’t want to do the loadbalancer dance
    • that also implies the app is relocatable which vendor binaries often aren’t
  • is containerization part of the solution?
    • it allows you to have multiple overlapping filesystems
    • a model: each customer has their own container
      • we haven’t done it
      • that sounds very expensive
    • how do you version control containers?
      • do you treat them as a single binary?
      • do you reconstruct it?
    • a lot of solutions assume all machines are stateless
      • someone else will deal with the databases
    • containers allow you to minimize surprise
      • a DBA logging into your container can find things where they expect, even if it’s from an underlying frankenstein filesystem
    • I don’t mind snapshots, but they should be generated mechanically and repeatably.
  • what tool would you love to exist in an ideal world?
    • I’d like the deployment database to do effectively dependency injection
      • I know where the dependencies are and what data I’m injecting, so I can use system monitoring to know what I’ve deployed

Session 2b: µservices

  • HTTP isn’t the best protocol in the world
  • use queues!
  • refactoring and testing is a better solved problem within the python programming language than over the network
    • I don’t think it’s hard to test µservices
      • there are clear contracts
        • that’s the theory, right?
  • we end up building lots of small monoliths and wiring together
  • we switched to using amazon SNS to manage notifications
  • how you get your ops team to support µservices is you get them to support as little as possible
    • they only work when the functional team owns the whole stack right to the bottom
  • services have a life cycle
    • we like building things
    • we should get better at killing things when they’re not using things
  • is there an additional cost to the organization for running µservices?
    • is there an organizational cost to having a 2 million line codebase?
  • ownership of services
    • handover of building team to ongoing running team
    • problems can get pushed back to the building team
  • antipattern around µservices:
    • developers think they’re clever
  • ntp is a µservice
  • aren’t µservices and SOA the same thing?
    • is it SOA done right?

Session 3: managing OSS software at work

  • how do you deal with PRs?
    • what about things that are not on your roadmap?
      • by not having a very good roadmap?
    • or moving in directions you don’t want to go?
    • it can be awkward because people might have put a lot of work in
      • but you need to explain “if you want to do that you need to fork it”
      • you can try to avoid it by writing a decent rationale of what you’re trying to do
      • though you can’t answer all the questions up front
  • you want to optimize for dragging people into your community
    • as the implementer, your documentation is going to be awful
    • because you already understand the whole system and don’t understand when you’re assuming tacit knowledge
    • whereas if you can attract users to your irc channel, and answer their questions really clearly, they can write great docs for you
    • I try to have a policy of: if anything confuses you, here’s my email, twitter, irc, etc and I will try to help you
    • encourage people to raise bugs against docs
    • I come from the perl community
      • there are 10-15 year old projects there where the maintainer has changed 4-6 times
      • have you got an example?
        • Catalyst
          • ~200 repos (core + plugins)
          • ~450 active committers
  • plugins are interesting: if people are trying to pull the project in different directions, you can let them through extensions but keep the core very small
  • does anyone have experience of running OSS projects at work?
    • how do you manage time management?
      • the important PRs to pay attention to are those from new contributors
        • certainly get back within 24 hours
        • don’t necessarily have to merge
  • why are you open sourcing this code?
    • to get the community using
    • to get good publicity
  • do you have an OSS landing page?
    • yes, but it’s out of date
  • the OSS stuff that has mostly been infrastructure-related we’ve been trying to put into a separate github org
  • you imply some level of support here
    • running an OSS project is more than just making code open
    • to be able to do that successfully, you need to at least mentally divest yourself from your parent organization
  • what do you do if that project isn’t your main focus?
    • my OSS contributions are entirely selfish
    • you need a maintainer
      • there needs to be clear communication channels
  • what does a maintainer do?
    • is it always one person?
      • no! not if you can avoid it?
      • once a project has a community it’s difficult for one person to maintain
      • even if you’re not writing code, managing the community can rapidly become a full-time job
    • what about the cost of maintenance?
      • use travis!
      • but please review the contribution even if the contribution passes the tests
    • problem of selectivity, vision and direction
      • mozilla in the early days, just accepted everything.
      • ended up having to rewrite as firebird (now firefox)
  • how do you ensure governance doesn’t become onerous?
    • example of people who forked their own project after it had become an apache project
    • example of gcc fork (egcs) which got merged back in
  • a lot comes back to documenting your original vision
    • I’ve been added as a maintainer in places, and sometimes there’s clear advice and sometimes there isn’t.
  • if you open source a project that you don’t use is a recipe for abandonware.
    • we also have an organization for abandoned code to move it out of our main github org
  • forks
    • how do you transfer maintainership?
    • what happens if a project gets abandoned and then forked?
  • what are the good communication channels to have for an OSS project?
    • own website for announcement and discovery?
      • how do you summarize your project?
      • peeve: like <other project> but X
    • community of contributors comes from community of users
      • so good user documentation will foster contributors
    • issues
      • is it worth seeding the issues list even if we have an internal tracker?
      • yes, because it helps users google for error messages
      • they are effectively documentation
      • do you move to only use the external tracker or do you have an internal tracker too?
    • do you need a security contact?
      • yes, with a GPG key
    • people need to see activity
      • if all your activity is on your internal tracker & mailing list & private irc, people will think it’s dead
    • where do people host mailing lists?
      • google groups
    • a few people are averse to irc
      • people don’t realise that they won’t get an immediate response necessarily
      • irc shouldn’t be used alone
      • timezones are also an issue
    • ipython uses hangouts
    • gmane: a newsgroup view on your mailing list
    • don’t have a separate irc channel per project if you’re managing lots of projects
  • how do you host your docs?
    • you should control your domain?
    • when is a README not enough?
    • start with github pages, and you can migrate later
    • what should it have?
      • screenshots
      • getting started guides
    • github pages are a bad idea because you can’t version them
      • readthedocs keeps old versions too
    • contributions must update docs when they update behaviour
  • documentation & communication is super super important
    • careful with contributions from newbies
      • rejecting a contribution because of lack of tests can be tricky
        • they might not have written many tests in general
        • they might not understand your particular test framework
      • but rejecting because of no docs is more reasonable
      • you can write tests for them
        • and use this as a communication channel
        • “does this test look like it’s measuring the thing you’re trying to build?”
  • how do you handle trolls, griefers and timewasters?

Session 4: what’s changed since last scale camp?

  • what’s arrived? what’s died?
  • Big Data is now a thing people talk about
    • you’re now seeing adverts on the tube about it
  • is couchdb dead?
    • npm?
    • we still use it, but we only used it as a key-value store
  • still going:
    • mongo
    • riak
  • websockets are now standardized and supported by lbs, proxies
  • edgeconf
    • grunt and pig and oink and stuff
    • doing a js build and running tests
    • angularjs
  • ndoc has gone
  • flash is in its death throes
  • most video sites work on an ipad
  • webgl has taken hold
  • epic demoed unreal engine 4 in firefox
  • 60 fps on the web
  • docker!
    • although solaris has been doing it for yonks
  • golang has taken off
    • when did go hit 1.0?
    • people are rewriting individual bits in go (rather than everything)
  • is hacker news dead yet?
  • bitcoin happened
    • VPS providers have been getting attacked for people trying to steal them
    • people trawling github to find access keys
    • bitcoin mining in the browser
  • erlang
    • nobody’s started writing things in it
    • though there’s elixir
    • and julia
    • and idris
  • what’s falling out of favour?
    • ruby? no
    • scala? no
  • facebook’s hack
    • seems sensible if you’re already in a php environment
  • bittorrent
    • an incredibly good way of saturating your network
    • though this isn’t new
  • µservices
    • just due to containerization?
    • seems to be a bunch of ex-tw people
  • elasticsearch is now usable
    • and quite good
    • and they acquired logstash and kibana
  • logs being searchable in es
    • splunk has a reasonable oss competitor
  • graphite has grown
    • there’s experimentation going on there
      • storage backends (cassandra, leveldb)
  • what about lucene?
    • very few people use it directly these days
  • snowden
  • DC security
  • https everywhere
    • gmail is now ssl only
    • facebook
    • PFS
    • the perception that TLS is expensive
    • spdy
  • webp
  • IE6 is on its deathbed
  • winxp
    • though it’s still in cash terminals
  • mobile growth
    • many sites are on the edge for 50% mobile
    • talk of mobile first and now mobile only
  • 4G
  • bootstrap
  • wearables & IoT
    • fitbit
    • pebble
    • automotive
      • tesla motors
  • security updates
    • wordpress now has autoupdate
  • nagios isn’t dead yet
    • sensu is still the hot new thing
    • riemann
    • flapjack
  • desktops are going away
    • except for gaming
  • centos is now owned by redhat
  • linux mint?
  • systemd
  • ubuntu as a server is now more probably
    • is upstart going away?
  • postgres got built-in replication
  • graph dbs (neo4j)
  • paas
    • people are still excited
    • it got even more complicated to install your own
  • where’s node going?
  • streaming extensions
    • rx in .NET
    • rise of functional
  • linux on the desktop?
    • the XPS13 is good
    • the rise of chromebooks
  • openstack?
    • everyone thinks it’s a great idea
  • private clouds?
    • azure will sell you an on-premise cloud thing
    • what’s the difference between an in-house cloud and a data centre?
  • drones, quadcopters, hexapods
    • for filming
  • what’s coming up? what will be important at the next scale summit?
    • net security is in flux
    • forks of android will be the new linux distro
    • http 2
    • IPv6?
    • anomaly detection
    • software defined networks
    • containerization
    • silicon roundabout?
      • it’s not a playground for children anymore
      • the adults have taken over
    • computing in government
      • US has 18F
      • GDS
    • I’d like there to be a world-class home grown east london startup doing technically challenging stuff
      • startups which solve technical problems don’t generally get funded
      • acquisitions
    • crowdfunding?
      • noone cares
  • what’s going to die?
    • couchdb
    • python 2 will not die

Session 5: mentoring

  • how do we hire & train & new people into our industry?
  • we certainly have struggled to recruit
    • we’ve come to the realization that part of the solution is hiring junior people & growing them into the role
    • I’ve been asked to mentor a junior person but I’ve no idea what to do
  • I’m a recent junior
    • one on one time is quite good
    • I came in having a basic idea what I’d be doing
    • be open for questions
      • the devops world is really overwhelming
      • it’s so useful to be able to ask things
    • that’s one of the ground rules we’ve agreed on
      • ie that I’m interruptible
    • we’ve certainly noticed that hiring in the junior area is useful
  • it’s great having juniors because you get chaos monkeys as well
    • if you’re not prepared to let a junior touch something, you probably need to make it more resilient
  • ETO1: 12-week night course
    • teaches you how to teach
  • how do you get the theory? how do you talk about underlying principles that are independent of the particular situation at hand?
    • pair programming is really good for that
      • does that depend on the teaching style of the pair?
    • make the junior document the things that you’re teaching them
      • it helps ensure that they’ve understood it
  • I get irritated when technical people tweet complaining about the cost of interruptions
    • when you have new people, you have to empower them to interrupt
    • I don’t think you should have your entire team mentor a new starter
    • we use the red flag system
      • you put a red flag up if you don’t want to be interrupted
    • designated interruptible person
    • juniors also have a difficult time saying no
      • you want to make everyone happy and be helpful
    • do you have a system that makes work visible? eg kanban
      • we have a helpdesk system
      • but external people don’t use it for smaller tasks
        • raise a ticket on their behalf
    • how do we teach juniors that it’s ok to say no?
      • also, how to understand what the requestor is trying to achieve, rather than the specific task they want done, and recognize when it’s the wrong fit?
  • juniors are way more engaged if they get a choice (however constrained) on what they get to spend their time on
  • also allow people to fail
    • teach them that it’s okay to fail
    • I troll my junior developers sometimes
      • I lead them down the garden path
      • but then I’m there to pick up the pieces when they fail
    • do something that’s visible to other people in the company
      • so that they can show people what they’re capable of
  • how do you direct people through different areas of knowledge?
    • do you go shallow on lots of tools? Or really deep on one thing?
    • depends on the junior
      • throw things at them and see what sticks
    • go broad with the concepts early on
      • architecture, system, etc
  • onboarding
    • desk & computer should be ready
    • first week should be meeting all the people they need to know about
    • have monthly checkins with the mentor
      • checkins, not reviews!
    • get a sales person to give a demo of whatever it is you build
  • can anyone recommend useful resources for managing developers?
    • how to talk to your kids or something like that
  • how do you improve diversity?
    • how do juniors find your roles?
    • you don’t have to stick to the same old networks when hiring juniors
    • thoughtbot – structured apprentice schemes
    • I wonder if being more explicit & realistic about what experience required and salaries are in job postings?
      • recruiters muddy the waters a lot
      • go direct if you can
  • how do you know when to stop mentoring? and how do you measure success?

lightning talks

tdoran docker to prod in 5 minutes

  • docker + 150 lines of shell

mirroring the internet

  • mirroring cpan, rubygems, npm
  • filesystems are good at serving things that look like files
  • you don’t need to use couch or
  • what was the easiest to mirror?
    • cpan – it has a single line rsync command to create a mirror
  • wikipedia is hard to mirror
    • each wikimedia site has a different set of plugins

analytics and search evaluation

  • it’s important to have good search for your site
  • we use google analytics. you can use this to find click behaviour for particular search terms
    • ie for term X, how often do people click on link 1, 2, 3, 4, etc
  • automate this!
  • crunch the most popular searches
  • identify how many clicks they got
  • use it to calculate how many more clicks we would have got if we had ordered the results better

juju

  • juju is a service orchestration tool

your laptop is not your friend

  • apple, facebook employees hacked via website malware, java vulnerability
  • data in transit protection
  • data at rest protection
  • authentication
    • user to device, user to service, device to service
  • secure boot
    • firmware
  • platform integrity and app sandboxing
  • app whitelisting
    • although key here is to ensure that whitelist doesn’t take too long to modify for new things
  • security policy
  • sounds like configuration management
  • external interface protection (firewalls)
  • device update policy
  • incident response
    • things will go wrong
  • although don’t worry too much about this
    • unless you have to.

write libraries, not services

  • scale using libraries
  • a library has all the modularity properties that services have
  • except you don’t need to worry about the network going down

we’re doing a festival called electromagnetic field

  • august 29th for 3 days
  • go here

outro

@petemounce

This comment has been minimized.

Copy link

commented Mar 22, 2014

The book is "how to talk so kids will listen & listen so kids will talk" by Adele Faber and Elaine Mazlish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.