tsabat/interview.md Secret

## interview.md

      
    Raw
  

              interview.md
            
          
    You

CodePen stats


app stats
32M page views/month
3.2M uniques/month
Between 5k and 12k requests per minute
range 80 - 200 per second

Can you give us a brief overview of your work/projects background?


my dad went to MIT, worked for IBM

always in a computer in the house


went to school for MIS
began at the VA

java/.NET
MUMPS (Massachusetts General Hospital Utility Multi-Programming System)

ACID NOSQL before it was cool
language and DB together
variables

global
namespaced
8 byte maximum for variable and method names
first three character represented application
5 characters to name your variable


met my current co-founder, Alex


Wufoo (8 years ago)

2 years after inception
bugged Ryan to hire me
dragged Alex along
PHP/mysql

no framework, just muscle


SurveyMonkey

Wore my shackles
Learned Ops
wrote Doula
wrote local dev stack w/vagrant


How did you get started with CodePen?


Chris Coyier wanted to retain ownership of code snippets
Believed that current tools were lacking
Inside SM for a year, working nights/weekends
Told management about after code went live
lucky they said it was non-competitive

Codepen

What kinds of security issues have you run into by letting people code in your sandboxes?


RCE - remote code execution
native code execution

Haml/Slim - raw ruby


file-system access

SCSS/SASS - data-url


How do you run these sandboxes?


Docker

container per service
read-only root system
web server for speed
forked call to protect instance variables
restricted number of file descriptors so you can't shell out


What's the most resource-intensive part of your app?


preprocessors
the round-trip is amazing.
On each key press

ajax contents of the editor to /preprocessors
AWS load-balanced call to application
nginx proxy-passed to internal load-balancer
sent to docker container doing A/B load balancing for the app stack
proxy passed logical router (software to determine which preprocessor to choose)
http call to appropriate preprocessor
docker container with A/B load balancer proxy pass
docker container with preprocessing server
preprocessing
call returns
ajax call to boomerang, code stored in redis
call returns
iframe preview of code refreshes, calling boomerang


AWS

What does your infrastructure look like right now? (Web servers, databases, instances, load balancers, languages, etc..)


languages

node and ruby


frameworks

rails, sinatra (padrino), express


web servers
4 m3.large: web apps
3 m3.large: preprocessor servers, containing docker containers
3 t2.small: specialized preprocessor servers
data store
3 m3.xlarge sql boxes: master/slave/export
2 r3.large solr: master/slave
2 m3.large: redis master/slave
misc
1 deployment box
1 sidekiq
1 t2.small gitlab
1 vpn
1 m1.micro: hit counter
1 monitoring box (icinga)
1 NAT
more
load balancers

two


Are you using an autoscaling environment? If yes, how does that work?


yes, but for fault-tolerance instead of scale.

if a web server goes out, another one is cycled in
we push code bundles to s3
ansible-pull playbooks run by cloud-init


we have CPU rules, but we don't run them because of race conditions

It looks like you guys are hosting assets on a subdomain. Any reason for this over S3?


simply: we never switched over to a CND-ish way
earlier this year we made our code CDN-ready
we've not pulled the trigger

What is the most painful part of your infrastructure? (ie: what requires the most attention or fails first?)


every part... it depends on which time period

when we started, all on one box

big ball of mud, no monitoring
everything failed


moved web servers out
deployment slowdowns

move to faster disks (ramdisk)
use turbosprockets


bought bigger web boxes

tuning phusion passenger to handle our CPU/memory constraints


moved storage stack out

mysql failed because of index cache size exceeded (pens too large)


moved git in via gitlab

memory constraints broke deployment


installed solr

corpus on disk exceeded memory
search calls were holding processes open too long

bigger boxes
timeouts
throttles


You had an outage back in March of this year, can you tell us what happened?


DNS in AWS stopped working
All of our servers name-based
freaked out, watched death-spiral
waited for DNS to return
contingency: swap IPs for host names in code
mega-contingency: multi-az, but waiting in RDS

How do you change servers without downtime?


depends on what you're referring to
code deployments:

phusion passenger handles graceful application restart


web-server changes (server class changes, for example)

build new load balancer
place new boxes behind new LB
change the apex DNS entry


data-loss, time sensitive ones

add a VIP and switch it at the control plane


Docker

Are you using Docker in production now? If yes, why?


several reasons, depending on server type
preprocessors:

security:

sandboxed code, small footprint
cgroup lockdown for cpu


ease-of-use development

5 containers total for preprocessors
all started with docker-compose up on localhost


process isolation

one version of ansible to push web app code
another version of ansible to push preprocessors


How is it different than Vagrant?


vagrant has slow startup time

if you make a mistake, you're re-provisioning
the union file system caches previous steps
you can do one-off docker run commands with the it flag
if you make a mistake, you kill the container and start again


the sharing model is nicer/more secure

internal registry


What are some challenges with Docker?


disk management - easy to create bloated containers
understanding all the widgets:

volumes
linked containers
caching
networking
deployment
orchestration


Ansible

Has Docker completely replaced your need for Ansible or Chef?


No, but greatly reduced it

ansible for glue or branching logic
Dockerfile for linear tasks


You used Chef quite a bit in the past, why the switch to Ansible?


I could never clearly explain to someone how Chef worked.
chef solo vs server
pull vs push model
try explaining variable presidence to someone
my business partner asked me to switch

What do (or did) you use Ansible and Chef for?


Server configuration, minimal
glue code for deployments
provisioning new servers

ansible-pull via cloud-init


Capistrano

Are you still using Capistrano?


We use capistrano to deploy the queueing code
because it is simple

What did you replace it with

ansible playbooks

preprocessor deployment

pulls new image from registry
starts new container
tests container
if working, take down old container
delete old container


web-app deployment

deployment server

git pull
bundle install with vendoring
sprockets precompile
zip it all
push bundle to s3
update a s3 bucket with latest bundle ID


web servers

pull latest bundle from s3
unzip it
symlink to new version
touch restart.txt
delete old versions


Redis

How are you using Redis?


simple caching (volatile)
queuing
persistent storage (non-volitile)

What kind of data are you storing? How much?


cache of most commonly-used items
counts
pens can get huge
11 billion calls in last 153 days
1016 requests per second
2 gigs, with 3.7g peak

What kinds of data types are you using? (Hashes, lists, sorted sets..?) How are you using them?


hashes mostly
sorted sets for timed operations

screenshot after 3 minutes
save may happen 20 times during that time


zadd for popularity scores
incr for counts

Are you using replication with Master-Slave?


Yes, but wish we did a better job of fail over.
an outage would require a manual cut over
we want to get sentinel going

Are you clustering Redis?


Nope, we don't have that kinda volume