Skip to content

Instantly share code, notes, and snippets.

What would you like to do?
Concourse on Kubernetes

Concourse on Kubernetes

This document outlines Brandwatch's Concourse installation running on Kubernetes. The full configuration can be found at (internal only currently). It's a fairly new installation (1-2 weeks) and we're slowly migrating work from our existing BOSH installation to this.

Comments/questions welcome below.


  • Google GKE
  • ConcourseCI (from stable/concourse chart)
  • Prometheus / Alert Manager (Metrics, monitoring, alerting)
  • nginx-ingress-controller (TLS termination, routing)
  • kube-lego (letsencrypt certificates)
  • preemptible-killer (controlled shutdown of preemptible VM instances)
  • delete-stalled-concourse-workers (periodically checks for and kills stalled workers)


  • Kubernetes nodes run Ubuntu images, to allow for overlay baggageclaimDriver. We did not find any configuration that could run successfully on COS instances.
  • Runs across 2 AZs (so we can run minimum 2 nodes in a node-pool)
  • Cluster split into two node-pools
    • node-pool for Concourse Workers (auto-scaling). n1-standard-4 machines. We’ve generally found much better behaviour from workers once they have around 4CPUs available.
    • node-pool for everything else. n1-standard-2 machines.
  • All instances are currently preemptible, so we trade off some stability of workers for much reduced cost (but continue to work on increasing stability).


Concourse is installed via the Helm charts.

  • Concourse v3.8.0 currently
  • baggageclaimDriver: overlay
  • Two web replicas
  • Between 2-6 workers (we scale up/down for work/non-work hours)
  • Service: clusterIP
  • Ingress (uses nginx-ingress-controller)


The Nginx Ingress Controller is a pretty vanilla, installed by the helm stable chart.

  • v0.9.0
  • 2 replicas
  • kube-system/default-http-backend
  • Service bound to Google Network Load Balancer IP


Prometheus is installed via the Prometheus operator.

  • 1 replica
  • 2 alert-managers


kube-lego process runs in the cluster and finds Ingress objects requiring TLS certificates. It deals with letsencrypt and setting up the HTTP challenge. Installed by the helm stable chart.

Preemptible work arounds

There's a bunch of clutter related to wanting to run workers on preemptible GKE instances. Preemptbile GKE instances cost approx 30% the price of standard instances but can be preempted (shutdown) at any time, and at least once every 24h.

If you are happy paying for non-preemptible instances you'll likely get more stability of workers without any of these work arounds. On the other hand, you never know when a node will die underneath you for other reasons, so this is a more general problem which would be good to solve.


A basic attempt to control preemptible VM shutdowns. The controller adds annotations to preemptible nodes and within 24 hours does a controlled termination of all pods and shuts down the VM. This is preferable to the VM dying underneath us with no warning, which leads to stalled workers. Will likely adapt this to force restart of preemptible VMs just prior to working hours, to reduce chance of forced restarts during working hours.

We have experimented with shutdown scripts on preemptible nodes, but cannot get them to successfully delete worker pods during the shutdown phase. More experimentation required here, because I don’t understand why it’s not possible. We currently work around this problem with…

stalled worker cleanup

We run delete-stalled-concourse-workers in the cluster which every minute checks for stalled workers via the Concourse API. If it finds any it prunes them.

Copy link

william-tran commented Feb 5, 2018

Thanks for the write-up, why did you decide to migrate off BOSH deployed Concourse? We're running KOPS k8s on AWS, and are using all the things listed above minus the preemptible-killer. stalled worker cleanup for us runs every 10 seconds, prunes retiring workers as well, and restarts jobs that errored so no manual intervention is needed to deal with jobs that errored out due to transient issues.

Copy link

ahume commented Feb 6, 2018

@william-tran The migration from BOSH was primarily about moving to a platform we have experience and skills with. I created the BOSH deployment without really getting to grips with the underlying platform, and upgrading/scaling/maintaining in general was always uncomfortable. I never managed to successfully complete the 1.7 & postgres upgrade, which was the final push really.

Your stalled worker clean-up is definitely more sophisticated than ours. Two questions to clarify if you don't mind. It is presumably safe to prune the worker before it has completed retiring? How do you differentiate between a job failing for some transient reason, and a legitimate CI/CD build failure?

Copy link

william-tran commented Feb 12, 2018


It is presumably safe to prune the worker before it has completed retiring

This is probably too aggressive, and will result in errored builds; any build using that worker for a task will see that task disconnect, and the build will go orange. This isn't a big deal for us though, because of our job restarter.

How do you differentiate between a job failing for some transient reason, and a legitimate CI/CD build failure?

I guess what I really mean by "some transient reason" is builds with a status of errored rather than failed when retrieved from /api/v1/builds. errored builds are ones that didn't finish because of some concourse related issue, while failed builds happen when a process exits non-zero.

Copy link

jschaul commented May 29, 2018

(...) restarts jobs that errored so no manual intervention is needed to deal with jobs that errored out due to transient issues.

@william-tran Are you able to share this code regarding job-restarts (or is the code that does the above already available somewhere - I wasn't able to find something here)?

Copy link

rohithmn3 commented Jun 18, 2019


Could you please have this link open to external..!?
It would be really helpful.

Copy link

dlbock commented Jan 13, 2021

@ahume, awesome writeup! We've recently started running Concourse at Instana (~4-5 months) and we've arrived at a similar-ish setup to yours after some recent tuning changes to avoid workers from just picking up tasks without limit and eventually getting overwhelmed. We're just starting to look into auto-scaling configuration for the workers. I wonder if you have any tips/tricks/gotchas for that?

Copy link

sabbir123222 commented Nov 19, 2021

Hello Friends,

How to use Kubernetes concourse for auto-scaling functionality to reduce costs?

Copy link

dlbock commented Dec 8, 2021

I recently wrote this up:, mostly for my future self just in case, but maybe it'll be helpful to someone else

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment