- app stats
- 32M page views/month
- 3.2M uniques/month
- Between 5k and 12k requests per minute
- range 80 - 200 per second
- my dad went to MIT, worked for IBM
- always in a computer in the house
- went to school for MIS
- began at the VA
- java/.NET
- MUMPS (Massachusetts General Hospital Utility Multi-Programming System)
- ACID NOSQL before it was cool
- language and DB together
- variables
- global
- namespaced
- 8 byte maximum for variable and method names
- first three character represented application
- 5 characters to name your variable
- met my current co-founder, Alex
- Wufoo (8 years ago)
- 2 years after inception
- bugged Ryan to hire me
- dragged Alex along
- PHP/mysql
- no framework, just muscle
- SurveyMonkey
- Wore my shackles
- Learned Ops
- wrote Doula
- wrote local dev stack w/vagrant
- Chris Coyier wanted to retain ownership of code snippets
- Believed that current tools were lacking
- Inside SM for a year, working nights/weekends
- Told management about after code went live
- lucky they said it was non-competitive
- RCE - remote code execution
- native code execution
- Haml/Slim - raw ruby
- file-system access
- SCSS/SASS -
data-url
- SCSS/SASS -
- Docker
- container per service
- read-only root system
- web server for speed
- forked call to protect instance variables
- restricted number of file descriptors so you can't shell out
- preprocessors
- the round-trip is amazing.
- On each key press
- ajax contents of the editor to
/preprocessors
- AWS load-balanced call to application
- nginx proxy-passed to internal load-balancer
- sent to docker container doing A/B load balancing for the app stack
- proxy passed logical router (software to determine which preprocessor to choose)
- http call to appropriate preprocessor
- docker container with A/B load balancer proxy pass
- docker container with preprocessing server
- preprocessing
- call returns
- ajax call to boomerang, code stored in redis
- call returns
- iframe preview of code refreshes, calling boomerang
- ajax contents of the editor to
What does your infrastructure look like right now? (Web servers, databases, instances, load balancers, languages, etc..)
- languages
- node and ruby
- frameworks
- rails, sinatra (padrino), express
- web servers
- 4 m3.large: web apps
- 3 m3.large: preprocessor servers, containing docker containers
- 3 t2.small: specialized preprocessor servers
- data store
- 3 m3.xlarge sql boxes: master/slave/export
- 2 r3.large solr: master/slave
- 2 m3.large: redis master/slave
- misc
- 1 deployment box
- 1 sidekiq
- 1 t2.small gitlab
- 1 vpn
- 1 m1.micro: hit counter
- 1 monitoring box (icinga)
- 1 NAT
- more
- load balancers
- two
- yes, but for fault-tolerance instead of scale.
- if a web server goes out, another one is cycled in
- we push code bundles to s3
- ansible-pull playbooks run by cloud-init
- we have CPU rules, but we don't run them because of race conditions
- simply: we never switched over to a CND-ish way
- earlier this year we made our code CDN-ready
- we've not pulled the trigger
What is the most painful part of your infrastructure? (ie: what requires the most attention or fails first?)
- every part... it depends on which time period
- when we started, all on one box
- big ball of mud, no monitoring
- everything failed
- moved web servers out
- deployment slowdowns
- move to faster disks (ramdisk)
- use turbosprockets
- bought bigger web boxes
- tuning phusion passenger to handle our CPU/memory constraints
- moved storage stack out
- mysql failed because of index cache size exceeded (pens too large)
- moved git in via gitlab
- memory constraints broke deployment
- installed solr
- corpus on disk exceeded memory
- search calls were holding processes open too long
- bigger boxes
- timeouts
- throttles
- when we started, all on one box
- DNS in AWS stopped working
- All of our servers name-based
- freaked out, watched death-spiral
- waited for DNS to return
- contingency: swap IPs for host names in code
- mega-contingency: multi-az, but waiting in RDS
- depends on what you're referring to
- code deployments:
- phusion passenger handles graceful application restart
- web-server changes (server class changes, for example)
- build new load balancer
- place new boxes behind new LB
- change the apex DNS entry
- data-loss, time sensitive ones
- add a VIP and switch it at the control plane
- several reasons, depending on server type
- preprocessors:
- security:
- sandboxed code, small footprint
- cgroup lockdown for cpu
- ease-of-use development
- 5 containers total for preprocessors
- all started with
docker-compose up
on localhost
- security:
- process isolation
- one version of ansible to push web app code
- another version of ansible to push preprocessors
- vagrant has slow startup time
- if you make a mistake, you're re-provisioning
- the union file system caches previous steps
- you can do one-off
docker run
commands with theit
flag - if you make a mistake, you kill the container and start again
- the sharing model is nicer/more secure
- internal registry
- disk management - easy to create bloated containers
- understanding all the widgets:
- volumes
- linked containers
- caching
- networking
- deployment
- orchestration
- No, but greatly reduced it
- ansible for glue or branching logic
- Dockerfile for linear tasks
- I could never clearly explain to someone how Chef worked.
- chef solo vs server
- pull vs push model
- try explaining variable presidence to someone
- my business partner asked me to switch
- Server configuration, minimal
- glue code for deployments
- provisioning new servers
- ansible-pull via cloud-init
- We use capistrano to deploy the queueing code
- because it is simple
ansible playbooks
- preprocessor deployment
- pulls new image from registry
- starts new container
- tests container
- if working, take down old container
- delete old container
- web-app deployment
- deployment server
- git pull
- bundle install with vendoring
- sprockets precompile
- zip it all
- push bundle to s3
- update a s3 bucket with latest bundle ID
- web servers
- pull latest bundle from s3
- unzip it
- symlink to new version
- touch restart.txt
- delete old versions
- deployment server
- simple caching (volatile)
- queuing
- persistent storage (non-volitile)
- cache of most commonly-used items
- counts
- pens can get huge
- 11 billion calls in last 153 days
- 1016 requests per second
- 2 gigs, with 3.7g peak
- hashes mostly
- sorted sets for timed operations
- screenshot after 3 minutes
- save may happen 20 times during that time
- zadd for popularity scores
- incr for counts
- Yes, but wish we did a better job of fail over.
- an outage would require a manual cut over
- we want to get sentinel going
- Nope, we don't have that kinda volume