rohansingh/rollout.md

## rollout.md

      
    Raw
  

              rollout.md
            
          
    rollout command

The helios rollout command rolls out a new version of a job to a set of hosts. This involves:


Determining the list and sequence of hosts to deploy to.


Undeploying any prior versions of the same job from a host before deploying the new version.


Deploying the new version of the job to each host sequentially, and waiting for the job to reach a RUNNING state before continuing.


Rolling back to the prior version(s) of the job in case an error occurs in the deployment.


Why?

This type of rollout of a new job is what users generally do after they have a new image built. We already have two different Python scripts internally that do this kind of thing (helios-helper and spheliosdeploy). This is definitely something that users want.
Legitimizing "rollout" as a first-class Helios operation gives us benefits over existing scripts:


Reduced confusion over what is part of Helios and what is part of the existing deployment scripts. This has proven to be a problem internally, and caused issues for teams trying to troubleshoot their deployment pipelines.


Ability to test the rollout mechanism as part of Helios's test suite. This in stark contrast to the existing scripts, which have no integration test coverage.


More robust rollouts, which survive failure of the client and can guarantee completion or rollback as long as ZooKeeper is up. With existing scripts, the client or build machine failing generally means the deployment and rollback are aborted.


A central point for managing rollout strategies and behavior, and collecting metrics on rollouts. We can iterate on rollouts in this one place, with any improvements become available to all Helios users. For example, we could add new rollout strategies or options over time.


How?

First, the user creates a new job version with the Helios CLI. For example, myservice:0.2. Next, they create a rollout that specifies the new job and a host filter regex. Here's what this might look like:
helios rollout myservice:0.2 '.*-myservice-.*'

In the example above, the goal of the rollout would be to deploy myservice:0.2 to all Helios agents whose hostnames match the regex: .*-myservice-.*
In the future, this could be augmented with additional host selectors (agent tags, Puppet, etc.) or rollout strategies.
Creating the rollout

When a master receives a rollout request, it determines the list of applicable hosts creates a rollout configuration in ZooKeeper. This rollout configuration has these initial fields:


Rollout operation ID

A UUID that is used to identify all tasks that are created by this rollout.


Rollback operation ID

A UUID that is used to identify all tasks that are created by a rollback of this rollout.


New job ID

The ID of the new job to rollout.


Rollout sequence

The exact sequence of hosts that the new job will be deployed to.


Rollout index

An atomic long used to track which host is currently being rolled out to.


Rollout status

Initially set to CREATED. Possible values include ROLLING_OUT, ROLLING_BACK DONE, and FAILED.


Processing the rollout

Each master watches the rollout node in ZooKeeper for rollouts. For any rollouts that are CREATED, ROLLING_OUT, or ROLLING_BACK, each master attempts to create the appropriate deploy or undeploy task for the current host as determined by the current rollout index. In the same transaction, the master increments the rollout index (or decrements it in case of a rollback).
When creating deploy and undeploy tasks for the rollout, every master uses the same predetermined rollout or rollback operation ID. This ensures that only one master will "win" and actually create the task and increment/decrement the rollout index.
If the rollout status is currently CREATED, the master updates it to ROLLING_OUT as part of the same transaction.
If the rollout status is currently ROLLING_OUT, the master waits for the job on the previous host to reach RUNNING before creating any tasks. If this doesn't happen within a reasonable timeout, the master updates the rollout status to ROLLING_BACK and decrements the rollout counter.
Once the last host is deployed or rolled back to, the master sets the rollout status to DONE or FAILED, respectively.
Deploy and undeploy tasks

Rolling out to each host actually consists of creating an undeploy task for any previous versions of the job to be rolled out, and creating a deploy task for the new version. When doing this, the master must also record the job ID for the undeployed version in ZooKeeper, if any.
When rolling back, the master creates an undeploy task for the new job and a deploy task for the previously-deployed version that was recorded earlier.