igorwwwwwwwwwwwwwwwwwwww/team-blue-terraform-update-july-2016.md Secret

## team-blue-terraform-update-july-2016.md

      
    Raw
  

              team-blue-terraform-update-july-2016.md
            
          
    Team 💙 Terraform Update (July 2016)

Hello from 💙! I wanted to give you all an update on one of the things we are working on and talk a bit about how our stuff works.
One of the main responsibilities of Team Blue is managing our build infrastructure(s). This means keeping our workers and build instances running on GCE, AWS, and MacStadium.
What is a ☁️

The way that "cloud" providers like GCE and AWS work is that you get virtualized hardware that you need to manage yourself. You press a button, you get a (linux) server. You can SSH into it. Because pressing buttons is not so much fun, we have technology for that. You can automate things via the GCE and AWS APIs.
These APIs are great for one-off creation of machines. However, managing such servers also means being able to change some machine configuration, re-create parts of the infrastructure, dealing with failure (e.g. a machine crashed), keeping things up-to-date, rolling out new images, etc.
Terraform

And that's where terraform comes in. Terraform is a HashiCorp™ tool for managing cloud infra "declaratively". You describe the things you want, press one button, terraform Makes It So.
We have been using terraform for quite some time, unfortunately at some point the setup became out of sync with production and we have mostly been tweaking things manually since then. This has slowed down our ability to roll out changes quickly.
Over the last few months, we have been working on re-doing and overhauling our terraform set-up. At this point the GCE and AWS staging envs have been rebuilt and are completely managed with terraform again. However, there are still some pieces missing before we can productionize this.
Pudding 🍮 => Cyclist 🚴

One of the tools that we use in AWS is pudding. Pudding has quite a broad scope. It does the following:


Make and store cloud-init scripts: When you create an instance, you specify an image to boot from, as well as a cloud-init script, that is executed exactly once on boot. That script allows you to inject configuration like keys, URLs, etc. The AWS auto-scaling group launches instances and has them fetch the cloud-init script from pudding. This means that changing the cloud-init script is quite dynamic involves quite a bit of manual work to change. The new staging env has the auto-scaling groups as well as the cloud-init scripts managed by terraform directly, no pudding needed.


Instance status: Pudding stores some state about which instances exist and what state they are in. Worker periodically pings pudding to report its status, and also know if it is expected to shut down gracefully. Instance status can be queried via an API in pudding.


ChatOps: We have some fancy slack integration allowing you to see instance status, create, and terminate instances. This is a bot that talks to pudding's API.


Lifecycle hooks: Auto-scaling groups dynamically spin-up and terminate instances based on some metric. In our case the metric is our build queue backlog. It allows us to only run as many instances as we need, and automatically scale down (or scale-in) when demand decreases. However, by default AWS just terminates the instance. In our case we want to perform a graceful shutdown and finish the currently running jobs before we shut down. This can be done via lifecycle hooks. Pudding does that by talking to AWS SNS.


We are trying to move as much of this responsibility out of pudding and into terraform. The main aspect that we will need to keep is the lifecycle hooks part, which also requires holding some state about instances that are shutting down, as well as getting notified about scale-out and scale-in events from the auto-scaling group.
We are in the process of building a much smaller app to handle that aspect specifically. It will be much smaller and easier to manage. This app is called cyclist (yes, the name is amazing, we know).
JRuby worker => Go worker

Another goal of the infra overhaul is to move from the JRuby worker to the Go worker on AWS.
We have two versions of our worker app. The old one is written in jruby and is called travis-worker, the new one is written in go and is called worker.
We are running the go worker on MacStadium and GCE, but we are still running the jruby one on AWS. We want to move to the go worker everywhere and finally retire the jruby one. AWS is the last missing piece for that.
Roll-out

Since this is quite a large undertaking with many moving parts, we are still in the process of getting everything in place on staging.
Once we have a fully working staging environment we can start to plan the production roll-out.
There is no ETA yet. (Soon™?) Also this is just one of the things that 💙 is working on, so please 🐻 with us.
Thanks!
💙