philandstuff/scale-summit-18.org

## scale-summit-18.org

      
    Raw
  

              scale-summit-18.org
            
          
    scalable build pipelines

A


  jenkins as a build tool
  microservices
  how do we standdardize unit testing in our pipeline?
  is a failure because we broke the pipeline, or because the code
    is bad?
  we’ve recently been trying jenkins declarative pipelines
  bash scripts

-
B


  it’s good to have jenkins config in source control

C


  what we did to scale up (in terms of lots of people able to
    create new projects)
  need some toil to set up jenkins
    
      automated this: jenkins detects new projects
      version control deployments
      team segregations, bringing up new agents
        
          just change a couple of yaml lines
        
      
B


  we use bazel within our jenkins & teamcity builds
    
      bazel/teamcity autodetects flaky tests & reruns
      remote caching & remote execution
    
  
A


  we have tension between dev teams
  we want entirely reproducible builds
  dev teams want to be able to control their own builds
  (do you version control your shared libraries too?) yes
  we have some standard pipelines rather than a jenkinsfile per
    repo
  reduces influence the dev teams can have on the build
  (does it cause tension because it slows them down?)
    
      i find it hard to talk about from dev perspective
    
  
B


  we have a different approach: we have 2 monorepos (!)
    
      one server-side / platform / continuously deployed
      one packaged software with longer release cadence
      we source control the jenkins jobs in the platform monorepo
      people will copy/paste the job builder files around
        
          github status checks, etc
        
      
question


  who uses something they didn’t build themselves for CI?
    
      who’s using AWS codebuild, etc
    
  
D


  circleci, travisci, codeship
  all have different advanctages & disadvantages (eg none supports
    windows)

B


  charging models are tricky for CI SaaS
    
      charging for build disincentives devs from using it
    
  
  https://buildkite.com/ was the nicest one we found

D


  problem with circle & codeship
    
      basic/pro tiers
      completely different implementation
      pro for codeship: ship us a dockerfile
      basic for codeship: web interface
      circleci are scrapping the web config
    
  
B


  (how do you scale the number of agents?)
    
      you add more agents
    
  
(survey around concurrent builds)


  a few people doing more than 100
  how do you trade off making things fast for dev teams against
    the size of the cluster for running jobs?

infrastructure testing

A


  we (as far as i’m aware) don’t test our infrastructure code
  in previous places I’ve done bits & bobs with test kitchen &
    serverspec

B


  we use terraform & puppet
  we share both in common modules
  don’t really test terraform
  puppet: test-kitchen
    
      central product
      jenkins master
      rspec testing
    
  
  (do you want to just run terraform plan to keep an eye on your
    infrastructure or something else?)
  we have terraform running from jenkins
  goss

C


  what is infrastructure testing for?
  tests & declarations duplicate each other
  testing infrastructure is more at the integration level
    
      check that graphite is actually running on this port
      (is that testing or is it monitoring?)
        
          depends how frequently you run it
          monitoring is continuous
        
      
  interesting to hear about goss for healthchecks

D


  https://github.com/aelsabbahy/goss
  you can integrate it into your monitoring system
  you can specify sets of tests you want
  runs on server so it’s really fast - quick feedback
  can autogenerate tests from an existing “perfect” environment
    
      analyse state of all ports and create a config
    
  
E


  i don’t think there needs to be a distinction between testing &
    monitoring
  the test pyramid for app code testing doesn’t work for
    infrastructure
    
      (eg: your cloud platform might make a breaking change to your
        API: your code might still pass but your system is broken)
    
  
  we should talk about feedback cycles:
    
      when you check in a piece of code, you want to know when it’s
        broken
    
  
  swiss cheese model:
    
      different layers of tests
    
  
F


  we used puppet testing to manage our rollout from puppet 3 to
    puppet 5
    
      infrastructure testing is useful for regression testing
    
  
D


  it depends on the infrastructure you inherit
    
      if you have inherited some pet-style (rather than cattle-style)
        infrastructure
      you need to manage the pets even as you migrate to cattle
      how do you build that pet (even if you don’t want to)
    
  
G


  BDD models are more useful for infrastructure testing

H


  you can repurpose tests like “is graphite running?” as monitoring
  the more unit-level tests - serverspec - can ossify the codebase
    
      it just tells me a person wrote the algorithm the way they
        intended to write it
    
  
I


  triggering builds based on github labels
    
      mark a PR as possibly affecting performance
      webhook carries label, triggers gatling run
      results are written back to PR as comment
    
  
A


  problem with developers making gradual changes against their
    local environments
  we have a weekly local-environment-teardown ceremony
  (what does “development environment” mean here?)
    
      some people are using the kubernetes docker thing
      some people have their own kubernetes clusters in aws
      (are you talking about minikube?)
        
          don’t know
        
      
J


  most tools are instance-centric
    
      with the growth of FaaS these tools don’t fit any more
      IAM policies
      security groups
    
  
  eg you have a hosted RDS db that’s only accessible from a lambda
    fn
    
      you want to verify that only certain ports open on the hosted
        db
    
  
K


  security/audit people asking for documentation for infrastructure
    
      then, asking for the tests
      “tell me the ports you have open”
    
  
  lynis - node auditing

GDPR


  how do you get from the “panic” phase to a reasonable,
    constructive way of attacking this?
    
      bad but effective: stand back long enough for someone else to
        screw it up
      but:
        
          we won’t know the outcome on machine learning stuff for a while
          other things like: consent has to be granular: this is really
            clear
            
              you can’t do pre-checked boxes
              problematic overlays between cookies and consent regulations
            
          
          even doing nothing is a decision, and your lawyers need to be
            aware
          when you get reported and the ICO come knocking you need to
            explain your decision
          “significant automated decision making”
            
              you need a data protection officer who reports directly to
                board
            
          
          machine learning can codify existing biases
        
      
  we dwell on machine learning because it’s a gnarly edge case
    
      there’s lots of lower hanging fruit to start with
      poor security practices
      being unaware of where data comes from and goes to
    
  
  what data is kept where, and what consent was given by data
    subject when they submitted that data
    
      you need version control for your privacy policy
    
  
  if you have a marketing email list based on dark-pattern
    pre-checked tickboxes, can you use that list any more after 25th
    may?
  ICO has limited money & people
    
      they won’t go after everyone on day one
      there will probably be some benefit of the doubt
      some high-profile cases will generate case law
    
  
immutable root / OS


  containeros / smartos / redhat
  what are the motivations for wanting to do immutable OS?
    
      security
      consistency
      avoiding snowflakes
      thinking about host os
    
  
  unikernels
    
      i’ve done one
        
          you’re responsible for everything
        
      
      it’s an interesting concept
      a friend does mirage
      whole mindset is fighting against mindset of cracking open a
        shell and working out what’s going on
        
          when something goes wrong, it’s hard to debug
        
      
      a lot of unikernel things are academic projects
        
          there’s a lot of duct tape
        
      
      firmwares are like unikernels
        
          no visibility, telemetry
        
      
      a lot of them did xen initially
        
          they’re bright people
          it’ll take them a while to get there
        
      
faas / serverless


  how do you find monitoring lambda?
    
      cloudwatch stats are basically everything
      newrelic had some functionality but it didn’t look useful
      x-ray is great for microservices generally (lambda or not)
      https://docs.aws.amazon.com/xray/latest/devguide/xray-services-lambda.html
    
  
  service map - visualization of all the traces
    
      https://docs.aws.amazon.com/xray/latest/devguide/xray-console.html#xray-console-servicemap
    
  
  what are the biggest problems people have seen with serverless?
    
      terraform is tricky
      API gateway isn’t one thing - it’s loads of different things
        bundled
        
          it has a hidden cloudfront in the middle of it
        
      
      main pain point we see isn’t lambda, it’s dynamo
      response time
        
          we found putting lambdas into VPCs added an order of magnitude
            to response time (with standard test traffic)
        
      
      concurrent execution limits
    
  
  how do you view x-ray? do you aggregate into cloudwatch?
    
      if there’s a jump in response time, you use x-ray to investigate
    
  
  anyone using kinesis much?
    
      shipping logs
      my experience: it works fine, but you have to specify the number
        of shards
    
  
  what languages are people writing lambdas in?
    
      python, node, java
      google functions
    
  
  lambci - run lambda in a docker container
  anyone using serverless marketplace?

gRPC / protobuf vs REST


  how do you onboard people?
    
      it was a real headfuck at first
      RPC’s bad, right?
      what is this binary format
      i don’t know why we’re doing this
      generating code as part of workflow? feels weird
      you have to think about types a bit more
    
  
  if you don’t have a monorepo, where do schemas live?
    
      for a while we had one repo just for protos. that was terrible
      two monorepos: one for platform (continuous deployment), one for
        product (6-8 week packaged software release)
      linting is important to check you haven’t reused an index
      you need to structure your repository to match your system
        boundaries
    
  
  graphql as another alternative?
    
      we like being able to deprecate things easily and to have a
        schema
      django impl called grapheme
      js frontend
        
          flow for types
          relay - state management
          same types going all the way through
        
      
  we use thrift as part of our content api
  binary messaging formats
    
      capnproto
        
          a friend looked at it and liked it for low-latency work
          no deserialization - directly dump into memory
          did they need that? I’m skeptical
          changing schemas
        
      
      simple binary encoding
      https://dataintensive.net/ has a chapter on serialization
        formats (chapter 4: encoding and evolution)
    
  
  https://gafferongames.com/ has some articles on designing custom
    network protocols for games
    
      where TCP doesn’t fit but you still need reliability so you have
        to build something custom on UDP
    
  
standards vs autonomy


  if you introduce something, who’s going to manage it when you go
    away?
  autonomy doesn’t mean free from responsibility
    
      SRE model: if you want to put something in prod, you have to
        tick these boxes
      standard tooling
        
          release engineering
        
      
      well-trodden path
    
  
  standards should empower autonomy
    
      there should be multiple implementations of the standard to make
        things work well
    
  
  what does your organization value? what are they concerned about?
    
      this will inform your approach to standards and consistency
    
  
  what are the tactics for fixing standards?
    
      standards always slip and they’re never up-to-date
        
          if you accept this, you build for it
          where i currently am, we have fully autonomous code teams
            
              if they want to use a new thing, they have to own it
              no tossing over the fence
            
          
      we had a devops rota
        
          no contraints on language
          team that did on call was drawn from devs on all different
            teams
          extremely good monetary reward for being called out
            
              (although: watch out for incentivising people to build
                crappy code)
            
          
      changing standards is often a power question
        
          a lot of orgs are top-down
            
              but then enforcement is lacking
            
          
          automation and testing makes top-down command and control
            easier by detected breaches
        
      
      we need to precisely define who is involved in standards process
      who’s accountable for the standard being met?
      our org had the spotify model (guilds/squads/etc)
        
          engineering guild owned how on call works
          it was in my interest to be a member of the guild
          we also had an “interested parties review” for a team to
            propose a novel change
            
              amount of rigour proportional to size of change
                
                  from just a conversation on slack to a full document on
                    wiki + formal meeting
                
              
      our org is full TOGAF
        
          services that are expensive have to budget for operational
            cost
          have to go to a technical design authority with design for
            system for approval
            
              they can check against standard or authorize an exception
            
          
          it all sounds like it meets what we’re talking about but none
            of it works
          ops team gets budget cut every year and says no to everything
          TDA say no to everything and have never updated standards
          none of the delivery teams ever deliver anything
        
      
      where you can automate compliance checking
        
          make it clear how to get the standard updated
          http://danger.systems/ - failure reports have link to how to
            update the repo
        
      
      it boils down to good team culture, good onboarding
      we have a process where people have to present ideas
        
          but it’s not so much about yes/no vs: it looks like you’re
            trying to do monitoring.  are you aware of these other things
            that are going on?
        
      
      there’s different return on investment at different scales
        
          if you only have 10 devs, you probably don’t want too much
            process
        
      
  we’re talking about different kinds of standards:
    
      languages / libraries / style
        
          easily automatable
        
      
      processes
        
          much more difficult to automate
        
      
  another org: lots and lots of little things
    
      7 programming languages
      all operated by same ops
      people burn out
      we want to get somewhere where these people don’t feel like
        everything is their responsibility
      if you put an app into heroku, and it crashes all the time,
        heroku won’t fix it for you
    
  
  question: how do you introduce standards where divergence already
    exists? has anyone done this before?
  question: how do you deprecate standards? how do you squash those
    last bits of (eg) php? has anyone done this?

Architecture decision records


  we’ve only been doing ADRs for just over a year
    
      we’re already seeing benefits from people asking questions
    
  
  joyent have an open rfd process
  rust have rfcs
  feels like RFCs are about product decisions, whereas ADRs are
    documenting a technical decision based on an already-established
    product need
  what’s the process for who gets to merge a PR?
    
      two thumbs up 👍👍
    
  
  numbered ADRs
    
      don’t change a decision, supersede it
    
  
  https://github.com/npryce/adr-tools
  some people don’t like them because their opinion wasn’t the
    popular one
  what’s useful to have in ADRs?
  where are ADRs appropriate?
    
      some things are relevant within a single file, can just be a
        code comment
        
          “this is modelled as a finite state machine, so don’t worry if
            it looks like two nested switch statements which you’d
            normally run away from”
        
      
      ADRs are for decisions at a larger scale
    
  
  do you do anything to control drift from previous decisions?
    
      no, not really
    
  
when tech problems are really human problems


  tech lead: derisking
  talk about:
    
      what to build
      how you might build it
      milestones
    
  
  during reviews of decisions, i found i didn’t want to okay them
    
      or at least i wanted to postpone them
    
  
  i introspected to understand why
  it was because i didn’t trust the specification of the product in
    the first place
  i was trying to insulate the tech from what was actually a human
    problem
  sometimes “we can’t deploy to this environment” means “we don’t
    trust your team to do it correctly”
  once you realise you’re talking about a human problem as well as a
    tech problem, what do you do?
  how do you build trust?  esp within an outside party who doesn’t
    trust your team’s decision making?

live tweeting tech conferences


  @bridgetkromhout
  why?
    
      visibility!
    
  
  put your twitter handle on every slide please
    
      lower the barrier to entry for people to attribute things to you
    
  
  pre-livetweeting checklist
    
      find conference hashtag
      decide which talks (if multi-track)
      find speaker twitter handles
      draft tweets in tweetbot
    
  
  even if you don’t use twitter, you use mastodon dot social, you
    could still park a twitter handle for people to use as a reference
    to you
  who is the hashtag for?
    
      people at the conference
        
          attendees - which talk should i go to?
          speakers
          organizers who are too busy to go to talks!
        
      
      people not at the conference
        
          people with FOMO
          your followers who aren’t interested in the event
            
              they can mute the hashtag
            
          
  #keep #hashtags #simple
    
      you don’t need the year in the hashtag (looking at you
        #scalesummit18)
    
  
  take photos!
    
      choose the right angle
      try to include:
        
          speaker
          slide
          some of the room
        
      
      what is your goal here?
    
  
  when i don’t tweet
    
      E_TOO_MANY_DISTRACTIONS (emails etc)
    
  
  kindness > negativity
    
      (backchannels > subtweets)
    
  
  life is too short to arguing on the internet
    
      mute early, mute often
      mute “well actually”
    
  
  incidents & accidents
    
      misquotes / misunderstandings
    
  
  trip reports
    
      highlight activities of your team
    
  
  pages for your own talks
    
      before: title, description, date, event, location
      shortly after: slides, embedded tweets
      eventually: video
      https://bridgetkromhout.com/speaking/
    
  
  conference twitter isn’t the only twitter
    
      we are entire human beings