Skip to content

Instantly share code, notes, and snippets.

@tsabat
Last active August 29, 2015 14:23
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tsabat/a7b9fe3cfcc020e98793 to your computer and use it in GitHub Desktop.
Save tsabat/a7b9fe3cfcc020e98793 to your computer and use it in GitHub Desktop.

You

CodePen stats

  • app stats
  • 32M page views/month
  • 3.2M uniques/month
  • Between 5k and 12k requests per minute
  • range 80 - 200 per second

Can you give us a brief overview of your work/projects background?

  • my dad went to MIT, worked for IBM
    • always in a computer in the house
  • went to school for MIS
  • began at the VA
    • java/.NET
    • MUMPS (Massachusetts General Hospital Utility Multi-Programming System)
      • ACID NOSQL before it was cool
      • language and DB together
      • variables
        • global
        • namespaced
        • 8 byte maximum for variable and method names
        • first three character represented application
        • 5 characters to name your variable
    • met my current co-founder, Alex
  • Wufoo (8 years ago)
    • 2 years after inception
    • bugged Ryan to hire me
    • dragged Alex along
    • PHP/mysql
      • no framework, just muscle
  • SurveyMonkey
    • Wore my shackles
    • Learned Ops
    • wrote Doula
    • wrote local dev stack w/vagrant

How did you get started with CodePen?

  • Chris Coyier wanted to retain ownership of code snippets
  • Believed that current tools were lacking
  • Inside SM for a year, working nights/weekends
  • Told management about after code went live
  • lucky they said it was non-competitive

Codepen

What kinds of security issues have you run into by letting people code in your sandboxes?

  • RCE - remote code execution
  • native code execution
    • Haml/Slim - raw ruby
  • file-system access
    • SCSS/SASS - data-url

How do you run these sandboxes?

  • Docker
    • container per service
    • read-only root system
    • web server for speed
    • forked call to protect instance variables
    • restricted number of file descriptors so you can't shell out

What's the most resource-intensive part of your app?

  • preprocessors
  • the round-trip is amazing.
  • On each key press
    • ajax contents of the editor to /preprocessors
    • AWS load-balanced call to application
    • nginx proxy-passed to internal load-balancer
    • sent to docker container doing A/B load balancing for the app stack
    • proxy passed logical router (software to determine which preprocessor to choose)
    • http call to appropriate preprocessor
    • docker container with A/B load balancer proxy pass
    • docker container with preprocessing server
    • preprocessing
    • call returns
    • ajax call to boomerang, code stored in redis
    • call returns
    • iframe preview of code refreshes, calling boomerang

AWS

What does your infrastructure look like right now? (Web servers, databases, instances, load balancers, languages, etc..)

  • languages
    • node and ruby
  • frameworks
    • rails, sinatra (padrino), express
  • web servers
  • 4 m3.large: web apps
  • 3 m3.large: preprocessor servers, containing docker containers
  • 3 t2.small: specialized preprocessor servers
  • data store
  • 3 m3.xlarge sql boxes: master/slave/export
  • 2 r3.large solr: master/slave
  • 2 m3.large: redis master/slave
  • misc
  • 1 deployment box
  • 1 sidekiq
  • 1 t2.small gitlab
  • 1 vpn
  • 1 m1.micro: hit counter
  • 1 monitoring box (icinga)
  • 1 NAT
  • more
  • load balancers
    • two

Are you using an autoscaling environment? If yes, how does that work?

  • yes, but for fault-tolerance instead of scale.
    • if a web server goes out, another one is cycled in
    • we push code bundles to s3
    • ansible-pull playbooks run by cloud-init
  • we have CPU rules, but we don't run them because of race conditions

It looks like you guys are hosting assets on a subdomain. Any reason for this over S3?

  • simply: we never switched over to a CND-ish way
  • earlier this year we made our code CDN-ready
  • we've not pulled the trigger

What is the most painful part of your infrastructure? (ie: what requires the most attention or fails first?)

  • every part... it depends on which time period
    • when we started, all on one box
      • big ball of mud, no monitoring
      • everything failed
    • moved web servers out
    • deployment slowdowns
      • move to faster disks (ramdisk)
      • use turbosprockets
    • bought bigger web boxes
      • tuning phusion passenger to handle our CPU/memory constraints
    • moved storage stack out
      • mysql failed because of index cache size exceeded (pens too large)
    • moved git in via gitlab
      • memory constraints broke deployment
    • installed solr
      • corpus on disk exceeded memory
      • search calls were holding processes open too long
        • bigger boxes
        • timeouts
        • throttles

You had an outage back in March of this year, can you tell us what happened?

  • DNS in AWS stopped working
  • All of our servers name-based
  • freaked out, watched death-spiral
  • waited for DNS to return
  • contingency: swap IPs for host names in code
  • mega-contingency: multi-az, but waiting in RDS

How do you change servers without downtime?

  • depends on what you're referring to
  • code deployments:
    • phusion passenger handles graceful application restart
  • web-server changes (server class changes, for example)
    • build new load balancer
    • place new boxes behind new LB
    • change the apex DNS entry
  • data-loss, time sensitive ones
    • add a VIP and switch it at the control plane

Docker

Are you using Docker in production now? If yes, why?

  • several reasons, depending on server type
  • preprocessors:
    • security:
      • sandboxed code, small footprint
      • cgroup lockdown for cpu
    • ease-of-use development
      • 5 containers total for preprocessors
      • all started with docker-compose up on localhost
  • process isolation
    • one version of ansible to push web app code
    • another version of ansible to push preprocessors

How is it different than Vagrant?

  • vagrant has slow startup time
    • if you make a mistake, you're re-provisioning
    • the union file system caches previous steps
    • you can do one-off docker run commands with the it flag
    • if you make a mistake, you kill the container and start again
  • the sharing model is nicer/more secure
    • internal registry

What are some challenges with Docker?

  • disk management - easy to create bloated containers
  • understanding all the widgets:
    • volumes
    • linked containers
    • caching
    • networking
    • deployment
    • orchestration

Ansible

Has Docker completely replaced your need for Ansible or Chef?

  • No, but greatly reduced it
    • ansible for glue or branching logic
    • Dockerfile for linear tasks

You used Chef quite a bit in the past, why the switch to Ansible?

  • I could never clearly explain to someone how Chef worked.
  • chef solo vs server
  • pull vs push model
  • try explaining variable presidence to someone
  • my business partner asked me to switch

What do (or did) you use Ansible and Chef for?

  • Server configuration, minimal
  • glue code for deployments
  • provisioning new servers
    • ansible-pull via cloud-init

Capistrano

Are you still using Capistrano?

  • We use capistrano to deploy the queueing code
  • because it is simple

What did you replace it with

ansible playbooks

  • preprocessor deployment
    • pulls new image from registry
    • starts new container
    • tests container
    • if working, take down old container
    • delete old container
  • web-app deployment
    • deployment server
      • git pull
      • bundle install with vendoring
      • sprockets precompile
      • zip it all
      • push bundle to s3
      • update a s3 bucket with latest bundle ID
    • web servers
      • pull latest bundle from s3
      • unzip it
      • symlink to new version
      • touch restart.txt
      • delete old versions

Redis

How are you using Redis?

  • simple caching (volatile)
  • queuing
  • persistent storage (non-volitile)

What kind of data are you storing? How much?

  • cache of most commonly-used items
  • counts
  • pens can get huge
  • 11 billion calls in last 153 days
  • 1016 requests per second
  • 2 gigs, with 3.7g peak

What kinds of data types are you using? (Hashes, lists, sorted sets..?) How are you using them?

  • hashes mostly
  • sorted sets for timed operations
    • screenshot after 3 minutes
    • save may happen 20 times during that time
  • zadd for popularity scores
  • incr for counts

Are you using replication with Master-Slave?

  • Yes, but wish we did a better job of fail over.
  • an outage would require a manual cut over
  • we want to get sentinel going

Are you clustering Redis?

  • Nope, we don't have that kinda volume
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment