davedash/Deploying software at Pinterest.md Secret

## Deploying software at Pinterest.md

      
    Raw
  

              Deploying software at Pinterest.md
            
          
    tl;dr Pinterest utilizes Github Enterprise, Jenkins CI and
Zookeeper to continually deliver the best experiences for Pinners.
One of my first tasks on the Pinterest ops team was to fix bugs in our deploy
scripts.  I used this as an opportunity to define release engineering at our
company and create some solid release tools.
The Tools of the Trade

We use a variety of tools that are for the most part fairly accessible to most
companies to deliver Pinterest.

Github Enterprise is our version-control overlay.  It
manages code-reviews, facilitates code-merging and most importantly
has a great API.  The API let's us interact with our repository in
a very clean programatic way.
Jenkins is our continuous integration system, we use
it to package builds and run unit tests after each check in.
Zookeeper manages our state.  It tells each node what version of
code they should be running, it reports the status of what each node is
actually running and it contains all our service configuration.
Amazon S3 is where we keep our builds.  It is a simple way to share data
and scales with no intervention on our end.

All of these tools can be substituted and you'd still be able to achieve a
system similar to Pinterest's.  We chose these specific tools because we were
familiar with them.
Additionally we have our custom "deploy-tools" that glue everything together.
The build pipeline

In practice, Pinterest is a continuous delivery* shop.

Here's our build pipeline from the inception of a change to being served in
production.

An engineer makes a git branch.
She pushes it to her fork in Github Enterprise and submits a Pull Request.
Jenkins runs automated tests against her pull request (for services in our
main repository).
After the request is approved it is merged (by the original engineer).
The newly integrated master branch triggers a Jenkins job that runs the same
automated tests in step 3.
Upon success of the aforementioned job we have a build task that creates
a tarball of the files and pushes it into S3.  We also branch this build in
git as jenkins-stable.
Some systems are automatically deployed to (e.g. an internal copy of the
site).
A deploy is then manually initiated.  We usually choose the same build that
is currently jenkins-stable, but we can choose anything that's available
in S3.

The Deploy

A build can be deployed to a set of canary hosts, or to our entire fleet.  We
record this state in Zookeeper.
"Canarying" to a few hosts gives us time to validate that everything is working
as planned.  For the most part this involves looking at charts, error logs,
error aggregators, and clicking around on the site.  All very manual.
On each node sits a fairly robust Zookeeper client called deployd.  This
listens to either enabled or enabled/canary depending on it's role (which
is also defined in Zookeeper).  If there is any change to these nodes a few
checks are performed:

The node checks that it's serving the correct build.
If it's not, it downloads the correct build and extracts it.
It atomically flips a symlink on the node to point at the new build.
It does any post-install steps, like installing dependencies.  For example we
use pip to manage our python requirements so we do something like
pip install -r requirements.txt after we've installed a new build.
The service is restarted in the most graceful way.
Current status of the node is reported back to Zookeeper.

Graceful restarts are somewhat complicated.  We employ different strategies for
different services.  In the most advanced form, multiple copies of the same
service are running on a machine.  Through iptables we are able to turn off
traffic to a few instances while we restart it.  With most services we define a
restart concurrency that defines how many nodes will be restarted at any given
time.  In some cases we can restart almost all the nodes at once with no user
impact.
We can then monitor our deploy in our state-of-the-art deploy monitoring tool.

I'd like to point out that I'm no expert at curses, but had the help of Erik
Rose's blessings module which makes
light work of the terminal.
Design principles

Early on with my intern at the time, Owen, we setup some design principles.  We
were both opinionated developers so setting up some ground rules made
sense.
Don't touch the deployed code directly.

Initially our deploy scripts lived in the same repository as the code we were
deploying.  This was painful.  I would update the deploy scripts, do a deploy
and lose my changes because part of the deploy meant doing some git operations
on the repository I was working on.  Most of what we needed to do was simple
tagging, sometimes we needed to do compilation work.
Even after we moved our tools to their own repository, we still did repository
manipulations.  This was gross.  If someone wanted to deploy we could ruin
their working copy because we decied to muck with it.  Furthermore it was slow.
If we wanted to avoid issues, we'd have to do a clean git clone.
We moved all the git operations into Github API calls.  Any pull request that
removed a subprocess.call(['git', ...]) and replaced it with a call to the
API was a winner in my book.  As a result our deploys sped up and we had less
requirements about from where we could deploy.
Keep a consistent interface

For a while our release engineering team of two (me and my intern) handled the
majority of deploys.  Despite that other engineers would occasionally still
need to deploy.  We wanted the experience to be consistent regardless of any
underlying changes we made.
Eventually my intern left and our team gained Nuo Yan who took over a lot of
the day-to-day build out of the tool.  We started transitioned deploys from our
team to service owners.  It was even more important that we kept some of the
promise to these other teams.
We would have some services use a combination of SSH and git to do deploys,
while others used the S3/Zookeeper based approach we use now.  When we made a
service switch, we made sure that it was as minimally noticeable as possible.
The difficult thing about building tools is teaching people how to use them.
If you don't think ahead of time of how you want things to work, it can be a
chore to keep teaching people: "Oh now you do X instead of Y."
We also made services conform to our requirements.  Every service follows
roughly the same flow.  Whether it's in our main repository or not, or whether
it's python or Java is irrelevant to the tool.  Everything operates exactly the
same.  This helps our team not have to context switch when debugging issues.
Design to make things easy for the deployer

Our plan has never been to keep things manual.  At every step of the game we
think, how can we make the deploy easier for the deployer.  After all, for some
time it was just us doing the deploys.  Doing 3-6 deploys a day was long and
arduous.  It did, however, force us to automate as much as we could.
We would write manual scripts which would validate if a deploy worked.  We would
then teach people how to run them.  We would then figure out a way for them to
get run automatically.  Eventually it would find itself as part of the main
deploy command.
Our plan is to continue to automate validations and any other manual steps.
Our eventual goal is that we can automate any decision making to the point
where we can remove the human element altogether.
Crash early... Crash often

Our deploy daemon attempts to catch all exceptions inside eventlets and the
main thread and exit as quickly as possible.  Rather than try to write
complicated code that tries to fix the state of things, we write code that
crashes early.  We rely on a tool called supervisor to restart our daemons.
This usually has a side effect of fixing everything.
The last thing we wanted our code to do was to get to a point where it was in a
for-loop and needed to be killed.  Thanks to health checks that close this
loop, we can verify very quickly when a node is never coming up correctly and
bug fix then.  But this is rarely necessary.
 
We probably have more design decisions than these listed, but these are some of
the main ones.  We never really wrote these down before the fact, but we would
discuss them, and they would definitely come out in code-reviews.  These don't
need to be set in stone, but it helps to have some idea of what things you are
willing to do and what things you'd like to avoid when building a tool.
Our biggest challenges

We had a few challenges that we had to push through.  Some were necessary, and
some taught us some valuable lessons.
Provisioning

We use puppet as part of our provisioning process.  A lot of our classes were
coupled in such a way that almost everybox had rsync'd a git checkout of our
code base.  Going forward we did not want to do that anymore.  So we spent a
lot of time refactoring our puppet code, decoupling where we could and having a
two provider system for providing our software:

The classic way of rsync'ing a git repository.
The modern way of using our deployd clients to pull the right version from
S3.

To this day we have a lot of checks in place that determine whether we'll use
rsync or let deployd do it's thing.  As our puppet code base matures we
decouple this more and more, but we'll be happy when we have very clear
manifest files that don't try to stomp over each other.
Configuration

Initially configuration was bundled with the tools.  This was cumbersome to
update so we moved service configuration into a configuration file.  To update
it we'd have to change the file and run puppet.  Some boxes might not get the
file, because of puppet failures, some boxes would get updates before others.
Nuo rewrote the configuration to store everything in Zookeeper.  This meant
that if we had to change the number of hosts restarting at once to 200 from
100, we could do it instantly and not have to worry if puppet ran successfully.
A design decision of our operations team in general is to have as few moving
parts for a given system as possible.  Taking puppet out of the equation made a
lot of sense.
Python Packaging

Python packaging was only tangentially related to our deploy process.  Once upon
a time puppet used to run pip install -r requirements.txt every few hours.
Unfortunately this was never right after a deploy, so we trained our engineers
that if they wanted to use a new package that they would have to deploy a
requirements.txt change, and then later deploy the code that used it.
This was horribly inefficient.  We later just had the deploy tools handle
python package installation.  We learned a lot of things about the pip tool.
The most important was that pip did not work great with system packages.  We
also found an edge case where sometimes the cached copies of files were
bad.  We found many of these problems went away with pip 1.4 and
have since changed our entire fleet to use that version (rather than a mix of
pip's from 0.8-1.3.x).
We also were building a lot of system tools that had their own dependencies.
We now have the option to use virtualenv for our python services.  So they
don't conflict with our tools.
New services

Pinterest engineering was moving to SOA fast, and new services would be
created while we were scrambling to fix bugs with our Zookeeper system and
to transition older services to Zookeeper.
One of our earlier new services forced us to document how to create a
Pinterest service and integrate it into the deploy tool.  This documentation
made it very clear to us how fragile our frameworks were.
Later on some of our services were being written in Java.  I naively assumed
this would just work, but when it came time to integrate I realized we had so
many assumptions about the code that we'd deploy:

It would be from the same repository
It would have the same monitoring tools.
It would always be python

This forced us to write service configuration.  While it wasn't the code we
wanted to write, it was the code that we needed.  The service level
configuration made it easier for other teams to use our tools.  Eventually the
configuration made its way to Zookeeper, making the barrier to deployability
even lower.
More recently we deployed a new service that was in a different repository than
all our other python services.  This forced us to revisit all our assumptions
we made about deploying python.
Now we have a tool that's a bit more robust and useful for other teams.
Current Issues

Despite all our valiant efforts we have some remaining issues.
Jenkins is slow for our main repository that houses most of our services..
Which is to say that we have a lot of tests, and many of them are slow.  While
this doesn't effect our deploys it does effect how fast we can respond to
problems.  Our strategy has been to find pieces of our code that can be
factored into libraries.  This means we move unit tests for those pieces of
code with the libraries which results in speedier builds.  Often we can find
slow tests through Jenkins' test profiling.  We will also experiment with
parallelizing builds across multiple executors.
Another issue we're having is some services after restart perform poorly after
the first few requests.  We are continually investigating ways to warm up our
services so our deploys can be more stealthy.
The other major issue is S3 and our build sizes.  Our largest build is nearly 200MB
and can take quite some time to download.  A download can take 30 seconds when
S3 is behaving, or sometimes several minutes.  We've recently instrumented our
tools, and will focus on cutting this number down.
Other directions

As I mentioned earlier, there were a lot of options that we could have chosen.
In fact, we never really discount any options.  We went with python because
that's what the scripts originally were written in.  If we started over, we
might consider something like Go.
For downloads we might look to a tool like
Herd which uses BitTorrent to distribute
code rather than S3.
We might even decide that changing machines in place isn't serving us well, but
perhaps adding baked AMIs into our deploy infrastructure is something that we
want.
We have a lot of options for these tools going forward.  We'll continue to
collect data, optimize and repeat.
Final thoughts

While some of this process suffers from NIH, existing tools for deploying
didn't seem to meet our needs.  We'd love to one day pull the
Pinterest-specific bits out of our tools and open source them.  In the meantime
if you have questions feel free to contact us.
Of course, if you like solving problems like these, join our team that helps
give Pinners the best experiences as quickly as possible.
* Continuous delivery means that your master branch of code is always deployable.