warpfork/position-you-dont-want-ci-triggered-cd.md Secret

## position-you-dont-want-ci-triggered-cd.md

      
    Raw
  

              position-you-dont-want-ci-triggered-cd.md
            
          
    You probably don't really want CI-triggered CD

First, some terms and disambiguation:

CI -- Continuous Integration -- means tests which get triggered automatically when changes are pushed into source control.
CD -- Continuous Deployment -- means being able to deploy any checked in of source control into a fully active environment.

(It's funny because "integration" usually means the kind of tests you run with
all parts of a system together, and meanings have drifted so hard that we'd
now typically say that requires a "deployment" (to "stage", typically)...
but never mind that historical baggage and comedy.)
CI is good.  This is pretty much universally acknowledged.
CD is good.  This term is newer, but the concept is typically well-regarded.
I've chosen the definition of CD here -- and I say "chosen" and "here" because
an industry-wide agreed upon definition is illusive -- that is purposefully vague
on a few points though:

CD doesn't necessarily mean deployments need to be triggered on every commit
CD doesn't necessarily have to be triggered by your CI.

And yet these are often confused.
(My theory is it's mostly because CI companies would love to capture the new buzzwords as well,
despite the fact it's not necessarily something they're well-positioned to do well.)
In fact, I think you probably do not want CI-triggered CD.
Deployment is bigger than any single project

Having a single source of truth that tracks the versions of all components in production is invaluable.
Splitting that information up between several version control repositories is not viable
(and neither is punting completely and simply not having it, obviously).
If version info for a project is based on
"what's on master branch (in $repo_hub) (on $date)", and you have multiple projects,
you have several problems:

branches are not themselves version control.  They can move.  They are not audit logs.
it's pretty much impossible to express that two services expect to be deployed in correlated versions
because there's no way to link them (other than something very hand-wavey involving date strings and prayers; do not go here).
you lack an enumeration of how many projects are expected in deployments in the first place.

That last one is killer.
You need a single-source-of-truth if you have multiple projects.
It can be another git repo, or something more creative; you just need something
you can point to and say "There.  That is Everything we expect in a deployment".
For declarative cluster systems users

(e.g. you live completely in k8s or something)
It's possible to have a production control system which itself tracks the total
intended state of production, and modifies the set of machines and services
running in order to anneal towards that state.
Such a system can hypothetically be the solution to deployment tracking in a
microservices environment.
Questions remain, however.

Does your cluster-keeper system keep a history of desired states, such that
you can audit "what was running in Prod on YYYY-MM-DD:1301Z"?
Does your cluster-keeper system have a clear source of truth itself, which
perfectly maps to the human operators' intentions?
Can you cleanly go from the cluster-keeper state for a service to looking up
the source in version control that spawned that service and its configuration?

Be careful of answering too glibly, here.
For example, if you've set up a pipeline using kubernetes which involves
kubectrl apply * from files in a git repo... what happens if you delete a
service's config files?  Does it get removed from the cluster?  (Spoiler: nope!)
Due to its commonplace presence in the industry right now, docker and
dockerhub/OCI-registry based images also deserve an explicit mention here.
Images and registries tend to badly obscure what their sources are;
make sure you account for this and work around it somehow.
For mono-repo advocates

"monorepos" -- the act of combining all of your company's development into
one, massive single repository -- are sometimes touted as an answer to... well,
all development organization problems, really.  And I don't believe that in general;
monorepos just pose very different problems that fester more slowly; they'll still
generate organizational work that you'll end up spending years digging yourself out of.
But that's a whole other series of topics.
"monorepos" do (sort of) address this particular heading: having one repo
means you can version all the different projects and their deployments at once.
However, it's not time to gloat and praise the monorepo; it doesn't solve:

Deployments are not atomic, meaning you're setting up this poor repo to
end up in a state where it's doomed to tell lies simply because development
and deployment move at different paces (see two sections down);
I bet your monorepo still deploys as several different packages (no? well, ok,
I guess no one can stop you from rsyncing the entire rails app to every machine
regardless of what they do, but maybe you'd like to join us in a future that's
not that bleak?);
it's still important to make sure this is accounted for clearly;
Hang on for the next section, because it also still completely applies to monorepos.

Version control for your source != Version control for your deployments

We already kinda covered why in the previous section on
"deployment is bigger than any single project", but there's another reason to
avoid conflating version control for source code with version tracking for deployments:
deployments to production have state; you can't "rollback"; only roll forward.
We certainly try to have a well-defined "rollback" plan for any major operation,
but it's critical to be honest about this: unless "rollback" is a full destruction
of every piece of state in the system followed by a hard reset to a complete snapshot
before the rollout, it's not a "rollback".  It's a state change that's attempting to
roll "back" semantically... but it's still a change, not an undo, and there's a difference.
This is different than the branching models we can use in project source repos.
In source repos, we can make branches -- prod doesn't branch --
and we can delete them freely -- droptables in prod isn't ideal --
and we can merge them -- prod doesn't "merge", what would that even mean?! --
and we can rebase them -- you wish you could rebase prod! --
in short, almost none of the things we can do to track progress in a source
repository make any sense at all for describing how to track changes in production.
That's not to say we can't use git (or other version control systems) to track
production -- but we have to go into it with eyes open.
The correct way to represent history in production is by a linear series of commits;
no branches, no tags, nothing; every commit is equally important, and critically they
are in an absolutely straight line because there's no other sane way to have the
model semantically line up with how other state -- SQL databases, hadoop state,
log files, etc -- may have been mutating in production.
Since the way we track source development and the ways we can model changes
in production deployment have completely different flow, it's simply asking
for trouble and manufacturing a source of confusion if we try to
cram them into the one shared instance of version control.
Make a separate repo for your production deploy history.
Consider deployment itself a project.
Much fog will be lifted.
Deployment is not atomic nor instant

Deployments do not happen in a finger-snap moment.
Deployments to production should typically involve a human's direct attention.
In theory, your testing pipeline and quality control is so good that nothing
ever goes wrong, and no one ever needs to push the "rollback" button, and
you don't even need a canary system.
In practice, ahheh.
Deployments in a large system target a number of machines.  At this point,
whether you like it or not, you have a full blown distributed systems situation
on your hands, and partial progress and partial failure are both options -- not
every machine in a swarm will update in a single quantum of time; some may be
separated by many minutes, even.  And you probably want this: not every server
in a busy system should reboot at the same time.
This conflicts fundamentally with how we perceive and what we expect from CI.
CI is supposed to complete in minutes (at most); less if possible; seconds is
ideal if you have enough budget for resources and parallelizing the process.
Production deployments and their monitoring exist on a different timescale.
That means the UX and UI for dealing with them is completely different.
A production deploy should not block a CI resource queue.
A production deploy from a commit 20 minutes ago also shouldn't get buried
under the 30 other commits your development team has pushed, triggering further CI.
A production deploy, once started, cannot be trivially canceled with no side-effects.
And so on.
CI services having keys to prod is Not Great

The job of a CI service is to run half-baked code.
It's Not Ideal to trust the same service that's in charge of building and
running half-baked code with your keys to production.
In many companies, CI may also be an out-of-house purchased commodity.
In that situation, CI having the ability to deploy into production fundamentally
means trusting that external company providing CI services with the ability to
completely torpedo your company.  Not necessarily by malice; any slight slip-up
they make -- and remember, their job is to run other people's code; their job
is fundamentally to provide shell -- may escalate quickly to a situation where
your company has to make A Security Disclosure.
A deployment system certainly does want to incorporate feedback from the CI system.
But putting those two systems together is simply reckless.
Named environments are Bad Mkay

This is the real money shot.  All the other reasons pale in comparison to this.
If you link someone to just one part of the article, it should probably be right here.
CI-triggered CD gets you on a swampy, soggy path to having named environments
not just for prod, but for "stage".
And you don't want that.
A brief aside on named environments: it always starts with the best of intentions.
At first you have "prod".  Of course you do.  But shortly after that, you'll create
"stage", because you need somewhere to test changes which have interactions between
all the individual services -- you need to test deployment itself.  Natural enough.
At first this is good, and perhaps you and the devops team will rest.  But not for long.
Because after "stage", a few months later you'll find someone -- maybe in another
nearby department, surely not in your own well-honed team -- saying that they need
to test combination X-Y-Z of things in "stage", so could your team please make
new changes to system Y in a different environment?  Maybe we'll call it the
"testing" environment.
And this will keep happening.  Soon you'll end up with a "stage-old", or, just
as nonsensically, a "stage-new".  And then a "dev", which you push to before
"testing" (and you've forgotten what the original justification for "testing" was,
but now we've got it, and someone's using it... aren't they?  Probably).
Don't get on this road.
The solution is to avoid named environments like the plague they are.
The instant an environment is named, it will become a pet -- and just like
"cattle, not pets" is the goal with individual servers, so too it should be
with environments.
...
Okay, but how did this relate to CD, and CI-triggered CD?
The problem here is that when you use CI as the triggering system for CD,
you've got a bunch of other questions to answer.  And all the answers you can
give are limited to pretty crappy options.  For example:
Question: Where shall I put the rest of the config for my environment?  DB passwords,
the top level external DNS entries, the replica counts for each service, things
like that which are naturally different?
Bad answer 1: Put them in the CI service's dashboard!  (This is Not Great,
because the result is a lot of critical configuration ends up in a web dashboard,
instead of in version control -- and yes, it is definitely important enough to version control.)
Bad answer 2: Put them in the project source repo!  But we need at least two --
one for prod and one for "stage" so we can test rollouts -- so hm, then we'll
just use one env var and set it the CI dashboard to switch which one we're deploying!
(This is better -- things are in version control -- but what just happened here?)
Both Bad Answer 1 and Bad Answer 2 have the same downstream effect: you're
configuring the range of named environments your team can deploy to,
and you're doing it manually because it needs to be done in the CI service's dashboard,
and the set became finite because you're doing it manually.
Congrats, you ended up with a wickedly high-touch manual system, and it needs
to be configured again and kept up-to-date in every repo your organization has.
There's a third option, though.  And it's still bad.
Bad answer 3: We'll put the templates for deploying to any environment into
the source repo, and then hint to the CI service how to fill things in with the
{branch name, tag}!  (This is better -- we got to templates!  Named environments
are almost eliminated.  But again, what other issues just got created?)
Each bad answer here is getting slightly less bad, but with this third one,
we've reached the real crux of the issue here:
git metadata is not meant to be used as a message bus...
and furthermore,
git metadata like branches and tags aren't themselves version controlled.
What you wanted was deploytool temp-env --destroy-after=1h and what you got
was a hairball of stateful git manipulations, some very complicated watchdogs
trying to trigger on them, a new entire section of the employee handbook teaching
new hires about how to use (and how not to use) magically named strings in
their git branches, you've got people trained to delete tags sometimes,
none of these changes are entirely clear in any version control, you definitely
don't have a rollback strategy in place for when someone deletes the wrong tag,
and oh boy.
That's it.  There's really no further improvement possible here.
Git is not a message bus; don't try to make it pretend to be one.
Or.
Do make git a message bus -- but don't do it in the metadata like branches and tags.
Do it in the repo contents.  Do it in full files.
This leads you to the concept of a Deployments Repo -- which -- TODO -- needs another article.
Teaser: the word "linearization" will come into play, and it will make your life much better.
Recap


Be mindful that deployments and automated testing consume different scales of time, and different kinds of resources.  Testing can be stateless and from scratch; production is not so simple.

CI-triggered CD attempts to blur this boundary; this confusion will not make life better.


Be careful of scattering your deployment intentions too widely.  Make sure you can name every single service that should involved in a full deployment.

CI-triggered CD makes this harder, not easier.


Be careful of using version-control metadata as if it was version control.  It's not.

CI-triggered CD actively leads you towards doing the wrong thing here.  Resist.


Be careful of giving keys to your kingdom to a service that wasn't designed with arbitrating that kind of power in mind.

CI services were generally not designed with that power in mind.


Be careful around accidentally generating more than one named, "pet" environment.

CI-triggered CD makes this harder, not easier.


You do want CD.  But you want it on your terms.