Skip to content

Instantly share code, notes, and snippets.

@philandstuff
Last active March 16, 2018 17:48
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save philandstuff/312a0d67ed63c06ae3b660a448d87508 to your computer and use it in GitHub Desktop.
Save philandstuff/312a0d67ed63c06ae3b660a448d87508 to your computer and use it in GitHub Desktop.
scale summit 2018

scalable build pipelines

A

  • jenkins as a build tool
  • microservices
  • how do we standdardize unit testing in our pipeline?
  • is a failure because we broke the pipeline, or because the code is bad?
  • we’ve recently been trying jenkins declarative pipelines
  • bash scripts

-

B

  • it’s good to have jenkins config in source control

C

  • what we did to scale up (in terms of lots of people able to create new projects)
  • need some toil to set up jenkins
    • automated this: jenkins detects new projects
    • version control deployments
    • team segregations, bringing up new agents
      • just change a couple of yaml lines

B

  • we use bazel within our jenkins & teamcity builds
    • bazel/teamcity autodetects flaky tests & reruns
    • remote caching & remote execution

A

  • we have tension between dev teams
  • we want entirely reproducible builds
  • dev teams want to be able to control their own builds
  • (do you version control your shared libraries too?) yes
  • we have some standard pipelines rather than a jenkinsfile per repo
  • reduces influence the dev teams can have on the build
  • (does it cause tension because it slows them down?)
    • i find it hard to talk about from dev perspective

B

  • we have a different approach: we have 2 monorepos (!)
    • one server-side / platform / continuously deployed
    • one packaged software with longer release cadence
    • we source control the jenkins jobs in the platform monorepo
    • people will copy/paste the job builder files around
      • github status checks, etc

question

  • who uses something they didn’t build themselves for CI?
    • who’s using AWS codebuild, etc

D

  • circleci, travisci, codeship
  • all have different advanctages & disadvantages (eg none supports windows)

B

  • charging models are tricky for CI SaaS
    • charging for build disincentives devs from using it
  • https://buildkite.com/ was the nicest one we found

D

  • problem with circle & codeship
    • basic/pro tiers
    • completely different implementation
    • pro for codeship: ship us a dockerfile
    • basic for codeship: web interface
    • circleci are scrapping the web config

B

  • (how do you scale the number of agents?)
    • you add more agents

(survey around concurrent builds)

  • a few people doing more than 100
  • how do you trade off making things fast for dev teams against the size of the cluster for running jobs?

infrastructure testing

A

  • we (as far as i’m aware) don’t test our infrastructure code
  • in previous places I’ve done bits & bobs with test kitchen & serverspec

B

  • we use terraform & puppet
  • we share both in common modules
  • don’t really test terraform
  • puppet: test-kitchen
    • central product
    • jenkins master
    • rspec testing
  • (do you want to just run terraform plan to keep an eye on your infrastructure or something else?)
  • we have terraform running from jenkins
  • goss

C

  • what is infrastructure testing for?
  • tests & declarations duplicate each other
  • testing infrastructure is more at the integration level
    • check that graphite is actually running on this port
    • (is that testing or is it monitoring?)
      • depends how frequently you run it
      • monitoring is continuous
  • interesting to hear about goss for healthchecks

D

  • https://github.com/aelsabbahy/goss
  • you can integrate it into your monitoring system
  • you can specify sets of tests you want
  • runs on server so it’s really fast - quick feedback
  • can autogenerate tests from an existing “perfect” environment
    • analyse state of all ports and create a config

E

  • i don’t think there needs to be a distinction between testing & monitoring
  • the test pyramid for app code testing doesn’t work for infrastructure
    • (eg: your cloud platform might make a breaking change to your API: your code might still pass but your system is broken)
  • we should talk about feedback cycles:
    • when you check in a piece of code, you want to know when it’s broken
  • swiss cheese model:
    • different layers of tests

F

  • we used puppet testing to manage our rollout from puppet 3 to puppet 5
    • infrastructure testing is useful for regression testing

D

  • it depends on the infrastructure you inherit
    • if you have inherited some pet-style (rather than cattle-style) infrastructure
    • you need to manage the pets even as you migrate to cattle
    • how do you build that pet (even if you don’t want to)

G

  • BDD models are more useful for infrastructure testing

H

  • you can repurpose tests like “is graphite running?” as monitoring
  • the more unit-level tests - serverspec - can ossify the codebase
    • it just tells me a person wrote the algorithm the way they intended to write it

I

  • triggering builds based on github labels
    • mark a PR as possibly affecting performance
    • webhook carries label, triggers gatling run
    • results are written back to PR as comment

A

  • problem with developers making gradual changes against their local environments
  • we have a weekly local-environment-teardown ceremony
  • (what does “development environment” mean here?)
    • some people are using the kubernetes docker thing
    • some people have their own kubernetes clusters in aws
    • (are you talking about minikube?)
      • don’t know

J

  • most tools are instance-centric
    • with the growth of FaaS these tools don’t fit any more
    • IAM policies
    • security groups
  • eg you have a hosted RDS db that’s only accessible from a lambda fn
    • you want to verify that only certain ports open on the hosted db

K

  • security/audit people asking for documentation for infrastructure
    • then, asking for the tests
    • “tell me the ports you have open”
  • lynis - node auditing

GDPR

  • how do you get from the “panic” phase to a reasonable, constructive way of attacking this?
    • bad but effective: stand back long enough for someone else to screw it up
    • but:
      • we won’t know the outcome on machine learning stuff for a while
      • other things like: consent has to be granular: this is really clear
        • you can’t do pre-checked boxes
        • problematic overlays between cookies and consent regulations
      • even doing nothing is a decision, and your lawyers need to be aware
      • when you get reported and the ICO come knocking you need to explain your decision
      • “significant automated decision making”
        • you need a data protection officer who reports directly to board
      • machine learning can codify existing biases
  • we dwell on machine learning because it’s a gnarly edge case
    • there’s lots of lower hanging fruit to start with
    • poor security practices
    • being unaware of where data comes from and goes to
  • what data is kept where, and what consent was given by data subject when they submitted that data
    • you need version control for your privacy policy
  • if you have a marketing email list based on dark-pattern pre-checked tickboxes, can you use that list any more after 25th may?
  • ICO has limited money & people
    • they won’t go after everyone on day one
    • there will probably be some benefit of the doubt
    • some high-profile cases will generate case law

immutable root / OS

  • containeros / smartos / redhat
  • what are the motivations for wanting to do immutable OS?
    • security
    • consistency
    • avoiding snowflakes
    • thinking about host os
  • unikernels
    • i’ve done one
      • you’re responsible for everything
    • it’s an interesting concept
    • a friend does mirage
    • whole mindset is fighting against mindset of cracking open a shell and working out what’s going on
      • when something goes wrong, it’s hard to debug
    • a lot of unikernel things are academic projects
      • there’s a lot of duct tape
    • firmwares are like unikernels
      • no visibility, telemetry
    • a lot of them did xen initially
      • they’re bright people
      • it’ll take them a while to get there

faas / serverless

  • how do you find monitoring lambda?
  • service map - visualization of all the traces
  • what are the biggest problems people have seen with serverless?
    • terraform is tricky
    • API gateway isn’t one thing - it’s loads of different things bundled
      • it has a hidden cloudfront in the middle of it
    • main pain point we see isn’t lambda, it’s dynamo
    • response time
      • we found putting lambdas into VPCs added an order of magnitude to response time (with standard test traffic)
    • concurrent execution limits
  • how do you view x-ray? do you aggregate into cloudwatch?
    • if there’s a jump in response time, you use x-ray to investigate
  • anyone using kinesis much?
    • shipping logs
    • my experience: it works fine, but you have to specify the number of shards
  • what languages are people writing lambdas in?
    • python, node, java
    • google functions
  • lambci - run lambda in a docker container
  • anyone using serverless marketplace?

gRPC / protobuf vs REST

  • how do you onboard people?
    • it was a real headfuck at first
    • RPC’s bad, right?
    • what is this binary format
    • i don’t know why we’re doing this
    • generating code as part of workflow? feels weird
    • you have to think about types a bit more
  • if you don’t have a monorepo, where do schemas live?
    • for a while we had one repo just for protos. that was terrible
    • two monorepos: one for platform (continuous deployment), one for product (6-8 week packaged software release)
    • linting is important to check you haven’t reused an index
    • you need to structure your repository to match your system boundaries
  • graphql as another alternative?
    • we like being able to deprecate things easily and to have a schema
    • django impl called grapheme
    • js frontend
      • flow for types
      • relay - state management
      • same types going all the way through
  • we use thrift as part of our content api
  • binary messaging formats
    • capnproto
      • a friend looked at it and liked it for low-latency work
      • no deserialization - directly dump into memory
      • did they need that? I’m skeptical
      • changing schemas
    • simple binary encoding
    • https://dataintensive.net/ has a chapter on serialization formats (chapter 4: encoding and evolution)
  • https://gafferongames.com/ has some articles on designing custom network protocols for games
    • where TCP doesn’t fit but you still need reliability so you have to build something custom on UDP

standards vs autonomy

  • if you introduce something, who’s going to manage it when you go away?
  • autonomy doesn’t mean free from responsibility
    • SRE model: if you want to put something in prod, you have to tick these boxes
    • standard tooling
      • release engineering
    • well-trodden path
  • standards should empower autonomy
    • there should be multiple implementations of the standard to make things work well
  • what does your organization value? what are they concerned about?
    • this will inform your approach to standards and consistency
  • what are the tactics for fixing standards?
    • standards always slip and they’re never up-to-date
      • if you accept this, you build for it
      • where i currently am, we have fully autonomous code teams
        • if they want to use a new thing, they have to own it
        • no tossing over the fence
    • we had a devops rota
      • no contraints on language
      • team that did on call was drawn from devs on all different teams
      • extremely good monetary reward for being called out
        • (although: watch out for incentivising people to build crappy code)
    • changing standards is often a power question
      • a lot of orgs are top-down
        • but then enforcement is lacking
      • automation and testing makes top-down command and control easier by detected breaches
    • we need to precisely define who is involved in standards process
    • who’s accountable for the standard being met?
    • our org had the spotify model (guilds/squads/etc)
      • engineering guild owned how on call works
      • it was in my interest to be a member of the guild
      • we also had an “interested parties review” for a team to propose a novel change
        • amount of rigour proportional to size of change
          • from just a conversation on slack to a full document on wiki + formal meeting
    • our org is full TOGAF
      • services that are expensive have to budget for operational cost
      • have to go to a technical design authority with design for system for approval
        • they can check against standard or authorize an exception
      • it all sounds like it meets what we’re talking about but none of it works
      • ops team gets budget cut every year and says no to everything
      • TDA say no to everything and have never updated standards
      • none of the delivery teams ever deliver anything
    • where you can automate compliance checking
      • make it clear how to get the standard updated
      • http://danger.systems/ - failure reports have link to how to update the repo
    • it boils down to good team culture, good onboarding
    • we have a process where people have to present ideas
      • but it’s not so much about yes/no vs: it looks like you’re trying to do monitoring. are you aware of these other things that are going on?
    • there’s different return on investment at different scales
      • if you only have 10 devs, you probably don’t want too much process
  • we’re talking about different kinds of standards:
    • languages / libraries / style
      • easily automatable
    • processes
      • much more difficult to automate
  • another org: lots and lots of little things
    • 7 programming languages
    • all operated by same ops
    • people burn out
    • we want to get somewhere where these people don’t feel like everything is their responsibility
    • if you put an app into heroku, and it crashes all the time, heroku won’t fix it for you
  • question: how do you introduce standards where divergence already exists? has anyone done this before?
  • question: how do you deprecate standards? how do you squash those last bits of (eg) php? has anyone done this?

Architecture decision records

  • we’ve only been doing ADRs for just over a year
    • we’re already seeing benefits from people asking questions
  • joyent have an open rfd process
  • rust have rfcs
  • feels like RFCs are about product decisions, whereas ADRs are documenting a technical decision based on an already-established product need
  • what’s the process for who gets to merge a PR?
    • two thumbs up 👍👍
  • numbered ADRs
    • don’t change a decision, supersede it
  • https://github.com/npryce/adr-tools
  • some people don’t like them because their opinion wasn’t the popular one
  • what’s useful to have in ADRs?
  • where are ADRs appropriate?
    • some things are relevant within a single file, can just be a code comment
      • “this is modelled as a finite state machine, so don’t worry if it looks like two nested switch statements which you’d normally run away from”
    • ADRs are for decisions at a larger scale
  • do you do anything to control drift from previous decisions?
    • no, not really

when tech problems are really human problems

  • tech lead: derisking
  • talk about:
    • what to build
    • how you might build it
    • milestones
  • during reviews of decisions, i found i didn’t want to okay them
    • or at least i wanted to postpone them
  • i introspected to understand why
  • it was because i didn’t trust the specification of the product in the first place
  • i was trying to insulate the tech from what was actually a human problem
  • sometimes “we can’t deploy to this environment” means “we don’t trust your team to do it correctly”
  • once you realise you’re talking about a human problem as well as a tech problem, what do you do?
  • how do you build trust? esp within an outside party who doesn’t trust your team’s decision making?

live tweeting tech conferences

  • @bridgetkromhout
  • why?
    • visibility!
  • put your twitter handle on every slide please
    • lower the barrier to entry for people to attribute things to you
  • pre-livetweeting checklist
    • find conference hashtag
    • decide which talks (if multi-track)
    • find speaker twitter handles
    • draft tweets in tweetbot
  • even if you don’t use twitter, you use mastodon dot social, you could still park a twitter handle for people to use as a reference to you
  • who is the hashtag for?
    • people at the conference
      • attendees - which talk should i go to?
      • speakers
      • organizers who are too busy to go to talks!
    • people not at the conference
      • people with FOMO
      • your followers who aren’t interested in the event
        • they can mute the hashtag
  • #keep #hashtags #simple
    • you don’t need the year in the hashtag (looking at you #scalesummit18)
  • take photos!
    • choose the right angle
    • try to include:
      • speaker
      • slide
      • some of the room
    • what is your goal here?
  • when i don’t tweet
    • E_TOO_MANY_DISTRACTIONS (emails etc)
  • kindness > negativity
    • (backchannels > subtweets)
  • life is too short to arguing on the internet
    • mute early, mute often
    • mute “well actually”
  • incidents & accidents
    • misquotes / misunderstandings
  • trip reports
    • highlight activities of your team
  • pages for your own talks
  • conference twitter isn’t the only twitter
    • we are entire human beings
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment