Skip to content

Instantly share code, notes, and snippets.

@daks
Forked from philandstuff/cfgmgmtcamp2016.org
Created March 13, 2017 19:46
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save daks/80a49c610768034cda23de09a4d8584d to your computer and use it in GitHub Desktop.
Save daks/80a49c610768034cda23de09a4d8584d to your computer and use it in GitHub Desktop.
configuration management camp 2016 notes

mark shuttleworth, juju

q&a

mitchellh, vault

what is it?

  • secret management
    • “secret” vs “sensitive”
    • secrets: creds, certs, keys, passwords
    • sensitive: phone numbers, mother’s maiden name, emails, dc locations, PII
    • we think correctly about secrets
    • maybe not so much with sensitive data
  • my definition: “anything which makes the news”
    • this includes “secrets” and “sensitive” categories above
  • vault is designed for this case
  • certificates
    • specific type of secret
    • backed by RFCs
    • used almost universally
    • historically a pain to manage

the current state

  • unencrypted secrets in files
    • fs perms
  • encrypted secrets with the key on the same fs
    • eg chef databags
  • also, how does the secret get on to the system?
  • why don’t we use config management for secrets? because config management tools don’t have features we need:
    • no access control
    • no auditing
    • no revocation
    • no key rolling
  • why not (online) databases?
    • rdbms, consul, zookeeper, etc
      • (one of the big motivations to create vault was we didn’t want people using consul to store secrets)
    • not designed for secrets
    • limited access controls
    • typically plaintext storage
      • (even if the filesystem is encrypted, which has its own issues)
    • no auditing or revocation abilities
  • how to handle secret sprawl?
    • secret material is distributed
      • don’t want one big secret with all the access to everything
    • who has acces?
    • when were they used?
    • what is the attack surface?
    • in the event of a compromise, what happened? what was leaked?
      • can we design a system that allows us to audit what was compromised?
  • how to handle certs?
    • openssl command line (ugh)
    • where do you store the keys?
    • if you have an internal CA, how do you manage that?
    • how do you manage CRLs?

the glorious future

  • vault goals:
    • single source for secrets, certificates
    • programmatic application access (automated)
    • operator access (manual)
    • practical security
    • modern data center friendly
      • (private or cloud, commodity hardware, highly available, etc)
  • vault features
    • secure secret storage
    • full cert mgmt
    • dynamic secrets
    • leasing, renewal
    • auditing
    • acls
  • secure secret storage
    • encrypt data in transit and rest
    • rest: 256bit AES in GCM
    • TLS 1.2 for clients
    • no HSM required (though you can if you want)
  • inspired by unix filesystem
    • mount points, paths
$ vault write secret/foo bar=bacon
success!
$ vault read secret/foo
Key             Value
lease_id        ...
lease_duration  ...
lease_renewable false
bar             bacon
  • dynamic secrets
    • never provide “root” creds to clients
    • secrets made on request per client
$ vault mount postgres
$ vault write postgresql/config/connection value=...
# written once, can never be read back

$ vault read postgresql/creds/production
-> get back a freshly-created user
-> if you don't come back to vault within an hour, vault will drop the user
  • auditing
    • pluggable audit backends
    • request and response loggin
    • prioritizes safety over availability
    • secrets hashed in audits (salted HMAC)
      • searchable but not reversible
  • rich ACLs
  • flexible auth
    • pluggable backends
    • machine-oriented vs operator-oriented
    • tokens, github, appid, user/pass, TLS cert
    • separate authentication from authorization
  • high availability
    • leader election
    • active/standby
    • automatic failover
    • (depending on backend: consul, etcd, zookeeper)
  • “unsealing the vault”
    • data in vault is encrypted
    • vault requires encryption key, doesn’t store it
    • must be provided to unseal the vault
    • must be entered on every vault restart
    • turtles problem!
      • this secret is sometimes on a piece of paper in a physical safe
    • alternative: shamir’s secret sharing
      • we split the key
      • a number of operators have a share of the key
      • N shares, T required to recompute master
      • default: N:5 T:3

vault in practice

  • ways to access vault
    • http + TLS API, JSON
    • CLI
    • consul-template (for software you won’t rewrite to talk directly to vault)
  • application integration
    • the best way to access vault
    • vault-aware
    • native client libraries
    • secrets only in memory
    • safest but high-touch
  • consul-template
    • secrets templatized into application configuration
    • vault is transparent
    • lease management is automatic
    • non-secret configuration still via consul
    • (then put your secrets on to a ramdisk and make sure the ramdisk can’t be swapped)
  • PII
    • it’s everywhere
    • “transit” backend for vault
    • encrypt/decrypt data in transit
    • avoid secret management in client apps
    • builds on vault foundation
    • web server has no encryption keys
    • requires two-factor compromise (vault + datastore)
    • decouples storage from encryption and access control
$ vault write -f transit/keys/foo
$ vault read -f transit/keys/foo
-> don't key key, just metadata about key
to send:
$ echo msg | vault write transit/encrypt/foo plaintext=-

to recv:
$ vault write transit/decrypt/foo ciphertext=-
  • rich libraries to do this automatically: vault-rails
  • disadvantage: everything round-trips through vault
    • can increase performance with another trick (…?)
  • CA
    • vault acts as internal CA
    • vault stores root CA keys
    • dynamic secrets - generates signed TLS keys
    • no more tears
  • mutual TLS
$ vault mount pki
$ vault write pki/root/generate/internal common_name=myvault.com ttl=87600h
...
...

Q&A

  • do you support letsencrypt certs?
    • just started planning this morning
  • if I use vault with consul, can I use that consul for something else?
    • in theory yes
    • vault uses a subpath within consul, and encrypts everything
    • we recommend you use an ACL to prevent access to other apps (it’s all encrypted garbage but if someone flips a bit you lose everything)
  • what if there’s a rogue operator who’s keylogged?
    • if you’re running it the right way it should be okay
    • if the operator doesn’t have root on the machine running vault it should be okay
    • if the operator has root then they can just coredump the vault process and get the key that way
    • a rogue root user is not in our threat model

ignites

centos config mgmt sig, julien pivotto

  • puppet chef ansible salt juju
  • deeply linked with the OS, from the start, until EOL

where do you get tools?

  • vendors, epel, make install, gems etc

YAG

  • regular/commodity users -> EPEL
  • (gap)
  • advanced users

vendors packages

  • where is the srpm?
  • where is the buildchain?

we depend on those tools

  • they have bugs

CentOS everything

  • public build system
  • everything needed to build software

SIG

  • software interest group
  • topic-focussed
  • release RPMs
  • can we make a CFGMGMT SIG?
  • objectives:
    • recent versions of cfg mgmt tools

a post-CM infrastructure delivery pipeline: @beddari

  • we were using CM but not winning
  • what we had built with love
    • automated tests
    • monitoring
  • but it was a total failure
    • non-managable rebuild times
    • envs were starting to leak
  • our systems are “eventually repeatable”
    • darn it, test that change in prod
  • docker docker docker docker
  • solution: stop doing configuration management!
  • artifacts and pipelines
  • inputs are typically managed artifacts
  • change
  • feed input to packer which in turn runs a builder that applies change producing output
  • output: a versioned artifact (rpm)
    • repos, packages, images, containers
  • abstraction is key to doing changes
  • defns:
    • a input-change-output chain is a project
    • a project is versioned in git
    • all artifacts are testable
  • your new job is: describing state to produce artifacts and keeping that state from drifting
  • http://nubis-docs.readthedocs.org/en/latest/MANIFESTO/
  • change from stateful VMs to managing artifacts
    • this worked really well
    • packer with masterless puppet
    • terraform and ansible
    • masterless puppet to audit and correct drift
    • yum upgrade considered harmful

SSHave your puppets! - Thomas Gelf

  • own the tools you run
  • replace a bunch of shell scripts
  • puppet infrastructure setup sucks
    • tried to scale PuppetDB or PE?
  • puppet kick (deprecated)
  • mcollective is moribund
  • SSH - why not?
    • centralized secrets
ssh <node> -c 'facter -j'
puppet master --compile > catalog
ssh <node> -c 'puppet apply'
  • problem: file server
  • shipping files with ssh?
    • fix the catalog on the fly – change puppet:// urls
  • problem: pluginsync
    • custom facts
  • reports, exported resources
    • they just work
  • one node per catalog - why?
    • I want partial catalogs, different schedules, different user accounts
  • tasks on demand
  • testing
  • docker docker

stop sharing “war stories” @felis_rex

  • “war stories”
  • “in the trenches”
  • how is this so much of a thing?
  • pop culture depictions of war v common
  • mainstream media depictions of war
  • yes, our jobs can be stressful
    • we can be under pressure
    • but it’s still not an actual war situation
  • this is about culture
    • we’re a corner of OSS community
    • it’s not about making great code
    • the way we talk to each other matters
    • we should be inclusive
  • alternatives to “war story”?
    • anecdote
    • story
    • experience report

a pythonic approach to continuous delivery

  • I have working python code, how do I start now?
  • a proper deployment artifact:
    • python package
      • (debian package? docker?)
  • it should be uniquely versioned
  • it should manage dependencies
  • https://github.com/blue-yonder/pyscaffold
  • CI:
    • run tests, build package
  • push to artifact repository
  • http://doc.devpi.net
  • automated deploy: ansible
    • we use virtualenvs to isolate dependencies
  • pip doesn’t do true dependency resolution
  • maintain and refactor your deployment
  • pypa/pip issue #988 (pip needs a dependency resolver)
  • OS package managers v pip: the two worlds should unite
  • pip is still optimized for a manual workflow (eg no –yes option)
  • you can build your own CD pipeline!

felt - samuel vandamme

  • front-end load testing
  • browser-based load testing tool with scenarios
  • history:
    • looked at Squish - selenium alternative
      • uses chrome
    • looked at phantomjs
      • headless so it’s good
      • missed some APIs
    • looked at SlimerJS
      • not headless but still performant
  • use cases for felt
    • quick load test
    • FE apps (eg angularJS)
    • simulate user

essential application management with tiny puppet @alvagante

  • cfgmgmt not just about files
  • but also files
  • puppet module: example42/tp
  • http://tiny-puppet.com/
  • tp::conf { ‘nginx’: … }
  • tp::conf { ‘nginx::example.com.conf’: … }
  • tp::dir { ‘nginx::www.example42.com: … }
  • tp::test { ‘redis’: }
  • tp::install { ‘redis’: …}

moving to puppet 4 at spotify

history:

  • puppet 2.7
    • dark, hacky features (eg dynamic scoping)
  • puppet 3
    • functional insanity with some pretty cool new tools and toys
    • rspec-puppet, librarian-puppet etc came along
    • we upped our game
  • puppet 4
    • language spec!
    • type system!
    • lambdas
    • iterations
    • all the things
    • sanity
    • ponies!
  • as a module maintainer, it’s painful
    • maintaining compatibility with 3 and 4 is frustrating

how we did it

  • step 1: breathe
    • talk it through on your team
  • step 2: get to puppet 3.8 first
    • the last 3.x release
    • it starts throwing deprecation warnings at you
      • fix these
      • scoping, templates, etc
    • upgrade your modules
      • vox ppupli and puppetlabs modules Just Work™
  • step 3: enable the future parser
    • (don’t do this on puppet < 3.7.4)
    • types of defaults will matter
      • eg where default is empty string but you can pass in an array
      • that won’t work any more
    • clustered node data
      • $facts hash
        • unshadowable, unmodifiable
        • unlike the old $operating_system fact lookup style
  • step 4: upgrade to puppet 4
    • two options:
      • distro packages
      • move to puppetlabs’s omnibus packages
    • recommend using omnibus, but changes some things:
      • /var/lib/puppet moves to /opt/puppetlabs/puppet
  • step 5: caek

the actual upgrade

results

  • we did it in:
    • 1-2 weeks of prep
    • 1 week of rollout
    • 2-3 days of cleanup
    • 0 production incidents
    • (over 10k nodes)
  • but.. we cheated
    • we migrated to the future parser over a year ago :)

vox pupuli

  • 60 modules & tooling
  • 50 contributors
    • basically everyone has commit to everything
  • join the revolution!

Q&A

  • bob: how many things did break when you enabled futureparser?
    • weird scoping with templates

etcd, Jonathan Boulle @baronboulle

what is etcd?

  • name: /etc distributed
  • a clustered key-value store
    • GET and SET ops
  • a building block for higher order systems
    • primitives for distributed systems
      • distributed locks
      • distributed scheduling

history of etcd

  • 2013.8: alpha (v0.x)
  • 2015.2: stable (v2.0+)
    • stable replication engine (new Raft impl)
    • stable v2 API
  • 2016.? (v3.0+)
    • efficient, powerful API
      • some operations we wanted to support couldn’t be done in the existing API
    • highly scalable backend
    • (ed: what does this mean?)

etcd today

  • production ready

why did we build etcd?

  • coreos mission: “secure the internet”
  • updating servers = rebooting servers
  • move towards app container paradigm
  • need a:
    • shared config store (for service discovery)
    • distributed lock manager (to coordinate reboots)
  • existing solutions were inflexible
    • (zookeeper undocumented binary API – expected to use C bindings)
    • difficult to configure

why use etcd?

  • highly available
  • highly reliable
  • strong consistency
  • simple, fast http API

how does it work?

  • raft
    • using a replicated log to model a state machine
    • “In Search of an Understandable Consensus Algorithm” (Ongaro, 2014)
      • response to paxos
      • (zookeeper had its own consensus algorithm)
      • raft is meant to be easier to understand and test
  • three key concepts:
    • leaders
    • elections
    • terms
  • the cluster elects a leader for every term
  • all log appends (…)
  • implementation
    • written in go, statically linked
    • /bin/etcd
      • daemon
      • 2379 (client requests/HTTP + JSON api)
      • 2380 (p2p/HTPP + protobuf)
    • /bin/etcdctl
      • CLI
    • net/http, encoding/json

etcd cluster basics

  • eg: have 5 nodes
    • can lose 2
    • lose 3, lose quorum -> cluster unavailable
  • prefer odd-numbers for cluster sizes
  • the more nodes you have, the more failures you can tolerate
    • but the lower throughput becomes because every operation needs to hit a majority of nodes

simple HTTP API (v2)

  • GET /v2/keys/foo
  • GET /v2/keys/foo?wait=true
    • poll for changes, receive notifications
  • PUT /v2/keys/foo -d value=bar
  • DELETE /v2/keys/foo
  • PUT /v2/keys/foo?prevValue=bar -d value=ok
    • atomic compare-and-swap

etcd applications

  • locksmith
    • cluster wide reboot lock - “semaphore for reboots”
    • CoreOS updates happen automatically
      • prevent all machines restarting at once
    • set key: Sem=1
      • take a ticket by CASing and decrementing the number
      • release by CASing and incrementing
  • flannel
    • virtual overlay network
      • provide a subnet to each host
      • handle all routing
    • uses etcd to store network configuration, allocated subnets, etc
  • skydns
    • service discovery and DNS server
    • backed by etcd for all configuration and records
  • vulcand
    • “programmatic, extendable proxy for microservices”
    • HTTP load balancer
    • config in etcd
    • (though actual proxied requests don’t touch etcd)
  • confd
    • simple config templating
    • for “dumb” applications
    • watch etcd for chagnes, render templates with new values, reload
    • (sounds like consul-template mentioned in the vault talk?)

scaling etcd

  • recent improvements (v2)
    • asynchronous snapshotting
      • append-only log-based system
        • grows indefinitely
      • snapshot, purge log
      • safest: stop-the-world while you do this
        • this is problematic because it blocks all writes
      • now: in-memory copy, write copy to disk
        • can continue serving while you purge the copy
    • raft pipelining
      • raft is based around a series of RPCs (eg AppendEntry)
      • etcd previously used synchronous RPCs
      • send next message only after receiving previous response
      • now: optimistically send series of messages without waiting for replies
      • (can these messages be reordered?)
  • future improvements (v3)
    • “scaling etcd to thousands of nodes”
    • efficient and powerful API
      • flat binary key space
      • multi-object transaction
        • extends CAS to allow conditions on multiple keys
      • native leasing API
      • native locking API
      • gRPC (HTTP2 + protobuf)
        • multiple streams sharing a single tcp connection
        • compacted encoding format
    • disk-backed storage
      • historically: everything had to fit in memory
      • keep cold historical data on disk
      • keep hot data in memory
      • support “entire history” watches
      • user-facing compaction API
    • incremental snapshots
      • only save the delta instead of the full data set
      • less I/O and CPU cost per snapshot
      • no bursty resource usage, more stable performance
    • upstream recipes for common usage patterns
      • leases: attaching ownership to keys
      • leader election
      • locking resources
      • client library to support these higher level use cases

War Games: flight training for devops, Jorge Salamero Sanz @bencerillo

  • server density: monitoring
  • the cost of uptime
  • expect downtime
    • prepare
    • respond
    • postmortem
  • incident example:
    • power failure to half our servers
      • primary generator failed
      • backup generator failed
      • UPS failed
    • automated failover unavailable
      • (known failure condition)
    • manual DNS switch required
    • expected impact: 20 minutes
    • actual impact: 43 minutes
  • human factor
    • unfamiliarity with process
    • pressure of time sensitive event (panic)
    • escalation introduces delays
  • documented procedures
    • checklists! ✓
    • not to follow blindly – knowledge and experience still valuable
    • independent system
    • searchable
    • list of known issues and documented workarounds/fixes
  • checklists – why?
    • humans have limitations
      • memory and attention
      • complexity
      • stress and fatigue
      • ego
    • pilots, doctors, divers:
      • Bruce Willis Ruins All Films
    • checklists help humans
      • increase confidency
      • reduce panic
  • realistic scenarios for your game day
    • replica environment
    • or mock command line
    • record actions and timing
    • multiple failures
    • unexpected results
  • simluartion goals
    • team and individual test of response
    • run real commands
    • training the people
    • training the procedures
    • training the tools
  • postmortem
    • failure sucks
      • but it happens, and we should recognize this
    • fearless, blameless
    • significant learning
    • restores confidence
    • increases credibility
    • timing
      • short regular updates
      • even “we’re still looking in to it”
      • ~1 week to publish full version
        • follow-up incidents
        • check with 3rd party providers
        • timeline for required changes
    • content
      • root cause
      • turn of events which led to failure
      • steps to identify & isolate the cause

Empowering developers to deploy their own data stores with Terraform and puppet, @bobtfish

  • http://www.slideshare.net/bobtfish/empowering-developers-to-deploy-their-own-data-stores
  • https://github.com/bobtfish/AWSnycast
  • puppet data in modules
    • this is amazing. it changed our lives
    • apply regex to hostname search1-reviews-uswest1aprod to parse out cluster name
    • elasticsearch_cluster { 'reviews': }
    • developers can create a new cluster by writing a yaml file
    • pull the data out of the puppet hierarchy
    • resuse the same YAML for service discovery and provisioning
  • puppet ENC - external node classifier
    • a script called by puppetmaster to generate node definition
    • our ENC looks at AWS tags
      • cluster name, role, etc
    • puppet::role::elasticsearch_cluster => cluster_name = reviews
    • stop needing individual hostnames!
      • host naming schemes are evil!
        • silly naming schemes (themed on planets)
        • “sensible” naming schemes (based on descriptive role)
          • do you identify mysql master in hostname?
          • what happens when you failover?
    • customize your monitoring system to actually tell you what’s wrong
      • “the master db has crashed” v “a db has crashed”
  • terraform has most of the pieces
    • it’s awesome
      • as long as you don’t use it like puppet
      • roles/profiles => sadness
    • treat it as a low level abstraction
    • keep things in composable units
    • add enough workflow to not run with scissors
    • don’t put logic in your terraform code
    • it’s a sharp tool
      • can easily trash everything
    • it’s the most generic abstraction possible
    • map JSON (HCL) DSL => CRUD APIs
      • it will do anything
      • as a joke I wrote a softlayer terraform provider which used twilio to phone a man and request a server to be provisioned
    • cannot do implicit mapping
      • but puppet/ansible/whatever can?
      • “Name” tag => namevar
      • Only works in some cases - not everything has tags!
    • implicit mapping is evil
      • eg: puppet AWS
      • in March 2014, I wanted to automate EC2 provisioning
        • I could write a type and provider in puppet to generate VPCs
        • @garethr stole it and it’s now puppet AWS
      • BUG - prefetch method eats exceptions (fixed now)
        • you ask AWS for all VPCs up front (in prefetch)
        • if you throw an exception while prefetching, it was silently swallowed
        • so it looks like there are no VPCs
        • now you generate a whole bunch of duplicates
        • workaround: an exception class with an overridden to_s method which would kill -9 itself
          • works, but not pretty
        • I wouldn’t recommend puppet-aws unless you’re on puppet 4 which fixed this bug
  • terraform modules
    • reusable abstraction (in theory)
    • sharp edges abound if you have deep modules or complicated modules
      • these are bugs and will be fixed
      • you can’t treat terraform like puppet
    • use modules, but don’t nest modules
      • use version tags
      • use other git repos
        • split modules into git repos
  • state
    • why even is state?
    • how do you cope with state?
      • use hashicorp/atlas
        • it will run terraform for you
        • it solves these problem
      • we.. reinvented atlas
        • workflow (locking!) is your problem
        • if two people run terraform concurrently, you’ll have a bad time
        • state will diverge
        • merging is not fun
    • split state up by team
      • search team owns search statefile
    • S3 store
      • many read, few write
    • wrap it yourself (make, jenkins, etc)
      • don’t install terraform in $PATH
        • you don’t want people running terraform willy-nilly
  • jenkins to own the workflow
    • force people to generate a plan and okay it
      • people aren’t evil, but they will take shortcuts
      • if they can just run terraform apply without planning first, they will do so
      • protecting me from myself
    • “awsadmin” machine + IAM Role as slave
    • Makefile based workflow
    • jenkins job builder to template things
      • you shouldn’t have shell scripts typed in to the jenkins text boxes
    • split up the steps
      • refresh state, and upload the refreshed state to S3
      • plan + save as an artefact
      • filter plan!
        • things in AWS that terraform doesn’t know about
        • lambda functions which tag instances based on who created them
        • terraform doesn’t know these tags, so will remove them
        • we filter this stuff out
      • approve plan
      • apply plan, save state
  • nirvana
    • self service cluster provisioning
      • devs define their own clusters
      • 1 click from ops to approve
    • owning team gets accounted
      • aws metadata added as needed
      • all metadata validated
    • clusters built around best practices
      • and when we update best practices, clusters get updated to match
    • can abstract further in future
    • opportunities to do clever things around accounting
      • dev requested m4.xlarges, but we have m4.2xlarges as reserved instances

inspec: skynet testing, @arlimus

  • slides: http://ow.ly/XPkvT
  • InSpec: Infrastructure Specification
    • v similar to server-spec
    • started on top of the server-spec project
  • code breaks
    • normal accident theory
  • why?
    • reduce number of defects
    • security and compliance testing
  • test any target
    • bare metal / VMs / containers
    • linux / windows / other / legacy
  • tiny howto
    • install from rubygems, or clone git repo
    • (see slides)
    • test local node
    • test remote via ssh
      • (no ruby / agent on node)
    • test remote via winRM (still no agent)
    • test docker container
  • example test
describe package('wget') do
  it { should be_installed }
end

describe file('/fetch-all.sh') do
  it { should be_file }
  its('owner') { should eq 'root' }
  its('mode') { should eq 0640 }
end
inspec exec dtest.rb -t docker://f02e
  • run via test-kitchen
    • kitchen-inspec
  • demo: solaris box running within test kitchen (on vagrant)
    • test-kitchen verify normally takes a long time because it installs a bunch of stuff on the box
    • much faster to verify with inspec
    • solaris test:
describe os do
  it { should be_solaris }
end

describe package('network/ssh') do
  it { should be_installed }
end

describe service('/network/ssh') do
  it { should be_running }
end

being expressive

describe file('/etc/ssh/sshd_config') do
  its(:content) { should match /Protocol 2/ }
end

this regex is brittle. comment? prefix/suffix?

Better:

describe sshd_config do
  its('Protocol') { should cmp 2 }
end

custom resources help with this.

Configuration Management vs. Container Automation, @johscheuer and Arnold Bechtoldt

  • “containers suck too”
    • “docker security is a mess!”
      • physical separation?
    • “images on docuker hub are insecure!”
      • just community contributions
        • lots of docker images contain bash with shellshock vulnerability
      • docker images are artefacks, treat them like vmdk/vhd/vdi/deb/rpm
      • build your own lightweight base images
      • use base images without lots of userland tools if possible (eg alpine linux)
    • dockerfile is “return of the bash”
      • over-engineered dockerfiles
      • replace large shell scripts with CM running outside the container
      • what we want is configuration management with a smaller footprint
      • avoid requiring ruby/python/etc inside the container just to get your CM tool running
    • scheduling/orchestration is a whole new area
  • http://rexify.org - a perl-based CM tool
  • “it doesn’t matter how many resources you have, if you don’t know how to use them, it will never be enough”
  • use cases
    • continuous integration & delivery

automating AIX with chef, Julian Dunn

  • AIX was first released in 1987 (?)
  • I first came in as an engineering manager, but I knew nothing about this platform
  • some quirks, some pains
  • Test Kitchen support – rough and unreleased
  • traditional management of AIX:
    • manually
    • SMIT - menu-driven config tool
    • transforming old-school shops has two routes:
      • migrate AIX to linux, then automate with chef
      • manage AIX with chef, then migrate to linux
        • second route is easier as it abstracts away the OS so there’s less to learn at each step
  • challenges
    • lack of familiar with platform, hardware, setup
      • hypervisor-based but all in hardware
    • XLC - proprietary compiler
      • if you use XLC, output is guaranteed forward-compatible forever
      • binaries from 1989 still run on AIX today
    • can’t use GNU-isms
      • no bash, it’s korn shell
      • no less!
    • no real package manager
      • bff
    • you can use rpm
      • but no yum or anything
    • two init systems (init and SRC)
      • key features which are missing!
    • virtualization features
      • sometimes cool, sometimes not
  • platform quirks & features
    • all core chef resources work out of the box on AIX
    • special resources in core
      • bff_package
      • service - need to specify init or SRC, some actions don’t work
    • more specific AIX resources in aix library cookbook
      • manage inittab etc
  • chef’s installer is sh-compatible which is necessary for AIX
# WARNING: REQUIRES /bin/sh
#
# - must run on /bin/sh on solaris 9
# - must run on /bin/sh on AIX 6.x
#

rkt and Kubernetes: What’s new with Container Runtimes and Orchestration, Jonathan Boulle @baronboulle

  • appc pods ≅ kubernetes pods
  • rkt
    • simple cli tool
    • no (mandatory) daemon
      • big daemon running as root feels like not the best default setup
    • no (mandatory) API
    • bash/systemd/kubelet -> rkt run -> application(s)
  • stage0 (rkt binary)
    • primary interface to rkt
    • discover, fetch, manage app images
    • set up pod filesystems
    • manage pod lifecycle
      • rkt run
      • rkt image list
  • stage1 (swappable execution engines)
    • default impl
      • systemd-nspawn+systemd
      • linux namespaces + cgroups
    • kvm impl
      • based on lkvm+systemd
      • hw virtualization for isolation
    • others?
  • rkt TPM measurement
    • used to “measure” system state
    • historically just use to verify bootloader/OS
    • CoreOS added support to GRUB
    • rkt can now record information about running pods in the TPM
    • tamper-proof audit log
  • rkt API service
    • optional gRPC-based API daemon
    • exposes information on pods and image
    • runs as unprivileged user
    • read-only
    • easier integration
  • recap: why rkt?
    • secure, standards, composable
  • rktnetes
    • using rkt as the kubelet’s container runtime
    • a pod-native runtime
    • first-class integration with systemd hosts
    • self-contained pods process model -> no SPOF
    • multiple-image compatibility (eg docker 2aci)
    • transparently swappable container engines
  • possible topologies
    • kubelet -> systemd -> rkt pod
    • could remove systemd and run pod directly on kubelet (kubelet -> rkt pod)
  • using rkt to run kubernetes
    • kubernetes components are largely self-hosting, but not entirely
    • need a way to bootstrap kubelet on the host
    • on coreos, this means in a container (because that’s the only way to run things on coreos)..
    • ..but kubelet has some unique reuirements
      • like mounting volumes on the host
    • rkt “fly” feature (new in rkt 0.15.0)
      • unlike rkt run, doesn’t run in pod; uncontained
      • has full access to host mount (and pid..) namespace
  • rkt networking
  • future
  • summary
    • use rkt
    • use kubernetes
    • get involved and help define future of app containers

Q&A

  • does use of KVM mean non-linux hosts can be run inside?
    • currently no
  • image format for a registry?
    • we don’t have a centralized registry
    • we want to get away from that model
  • can rkt run docker images
    • yes
    • the current kubernetes api only accepts docker images so it’s the only thing it can run
  • what do I have to actually do to use rkt?
    • there’s a using rkt with kubernetes guide
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment