nmccready/Terraform-State-Ideas.md

## Terraform-State-Ideas.md

      
    Raw
  

              Terraform-State-Ideas.md
            
          
    Issues with Terraform State Management

The idea of "state" is the lynchpin of Terraform, and yet Terraform's workflow is fraught with gotchas that can lead to
the loss or destruction of state. This doc is a set of notes about issues I've encountered, what caused them, and in many
cases ideas about how to improve Terraform to avoid or reduce the chances of them.
Each of these scenarios has occured at least within my team. Each time one of these occurs it erodes people's confidence
in Terraform, giving it a reputation for being fragile and unforgiving of errors. This this document is not written
just to criticize but rather to identify ways in which the situation could be improved.
Stale Plan Files

This one is not strictly related to Terraform State itself, but has implications for state integrity and can be solved in
terms of it.
When running terraform plan -out=tfplan, a tfplan file is created with a serialized version of the created plan.
This plan can then be applied with terraform apply tfplan.
Once applied, the tfplan file is left in the local directory and can potentially be accidentally applied again.
For many changes this results in an error, but in some cases it results in duplicated resources where only the new
resources are actually tracked in the state.
One particular case where Terraform encourages mistakes is that when plan produces an empty diff the tfplan file
is not updated to reflect that empty diff, leaving behind the result of some previous plan. However, since Terraform
exited successfully the user (or some automated system looking at the exit status) is often tempted to run
terraform apply tfplan anyway, at which point the stale plan is re-applied.
This gotcha could be addressed by the following improvements:

When writing out a plan file, include in the plan the serial number of the state payload that it was derived from.
Before applying the plan, verify that the current serial matches what's in the plan and fail with an error if not.
When plan produces an empty diff and the -out argument is provided, write the empty diff out to the given file
so that a subsequent terraform apply on that file will be a no-op.

terraform remote config can overwrite states

When running terraform remote config in a directory that already has a state file present, Terraform will try to
upload the current state to the newly-configured location.
If some data was already present at the new location, this data is unconditionally overwritten. If the existing data
happens to be another Terraform state, that state may then be lost.
This is particularly troublesome for configurations that are intended to be deployed multiple times with different
variables: one must be very careful when switching between the states for different instances of the configuration
to avoid replacing one instance's state with another.
The core issue of accidentally replacing objects could be addressed by:

Making each fresh state file contain a "lineage" property that is unique for each fresh state. (#4389)
Making terraform remote config first try to Read the configured location and, if it gets a non-error response,
ensure that the retrieved data is a valid Terraform state of the same lineage as what is being written.
For extra safety: fail also if the already-stored remote state has a serial greater than the local serial.
Making this check only during terraform remote config would not comprehensively deal with all situations of
accidentally downgrading a state, but it would catch some mistakes and there's little legitimate reason to
actually downgrade a state.

Forgetting to run terraform remote config

For any project using remote state it's important to always run terraform remote config to set up the remote state
before taking any other actions that interact with the state. However, it's easy to forget to do this.
If this is forgotten then running terraform apply will likely produce a duplicate set of resources due to the
absense of a local state. If the operator panics and then tries to run terraform remote config after this, rather
than destroying the erronously-created resources directly, the previous issue causes the "true" state to be overwritten
by the new state.
This could be addressed by:

In the very short term, a mechanism in the Terraform configuration to indicate that remote state is required so that
Terraform can refuse to run if it's not configured.
In the longer term, allowing a specific remote configuration to be provided within the configuration, using variable
interpolations to accommodate configurations that produce multiple instances depending on arguments. (#1964)

Incorrect provider config can completely destroy the state

Consider the following configuration:
variable "region" {}

provider "aws" {
    region = "${var.region}"
}

resource "aws_instance" "main" {
    // ....
}
When this is planned the user might terraform plan -var="region=us-west-2" to deploy the app to us-west-2, and then use us-west-1
with a separate state to deploy the same instance in that region.
In this scenario the user must be very careful to keep the state selection aligned with the region variable. If plan is run
with the region set to us-west-2 but the state for the us-west-1 deployment, the "Refresh" phase will look up the AWS instance
in the wrong region, see that it doesn't exist and remove it from the state before generating a diff to replace it.
The only way to recover from this is to manually revert to an earlier version of the state that had the resource instance still
listed.
This one is tough to address due to Terraform's architecture but here are some ideas:

Allow providers to mark some attributes as "resource identity attributes", and require some sort of manual resolution when
they change.
Have the AWS provider in particular remember the region that each resource was created in and only pay attention to the
provider-specified region during Create, with Read, Update and Delete using the resource-recorded region. In this case
moving a resource to another region would require tainting it.

Can accidentally run terraform apply with no plan file

Terraform supports running just terraform apply as a shorthand for terraform plan -out=tfplan && terraform apply tfplan.
This combined operation is handy when you're new to Terraform and you want to experiment, but it's generally a bad idea
to do this on any real production deployment, since you don't get a chance to review what changes will be made and so
you can end up inadvertently destroying important infrastructure.
I've observed people not quite understanding the flow and doing this:

terraform plan -out=tfplan
terraform apply

This appears to work and so people don't realize it's wrong, but then one day they end up applying something slightly different
than what was planned.
There is a particularly awkward variation on this whose consequences are worse:

terraform plan -out=tfplan -target=something.something
terraform apply

Here the user wanted to apply just a subset of the config, but inadvertently ended up applying all of it.
There isn't any good way for Terraform to recognize and block this mistake automatically, since the plan file can be called
anything and might be stale.
However, we could allow a new top-level setting in the Terraform config that allows the config author to express that
this config must always be planned separately from apply:
workflow {
    require_explicit_plan = true
}

When this flag is set, running terraform apply without a plan file would generate an error:
$ terraform apply
This configuration requires an explicit separate plan step.

To create a plan, run:
    terraform plan -out=tfplan
    
Once you've reviewed the plan and verified that it will act as expected, you can then apply it using:
    terraform apply tfplan