digitalronin/article.md

## article.md

      
    Raw
  

              article.md
            
          
    Automated testing tools for Infrastructure as Code

In software development, automated testing has long been accepted as best
practice.
Test-driven development (TDD) or behavior-driven development (BDD) approaches
often go as far as writing tests for functionality before it is implemented,
and continuous integration and continuous delivery (CI/CD) pipelines are
commonplace, automatically running tests against your codebase whenever a
change is pushed to version control.
This kind of testing rigor, where your code is comprehensively and
automatically exercised to ensure it does what it's supposed to do, and that
recent changes haven't broken existing functionality (often referred to as
introducing a "regression"), accelerates the software development process so
that teams can go faster safely.
"Infrastructure as Code" (IaC) brings many of the benefits and tooling around
modern software development to infrastructure - the servers, network
components, backend storage and so on that our application code runs on in
order to provide services to users. As this approach to infrastructure becomes
more widespread, naturally people want to apply the same kind of testing rigor
to their infrastructure code as their application code.

In this article I'm mainly going to discuss terraform as the code part of
IaC, and AWS as the cloud provider. That's just because those are very common
choices for defining and running infrastructure. The ideas and principles I'm
talking about are also applicable to other IaC technologies and providers,
and nothing here is meant as a criticism of either Terraform or AWS.
Infrastructure testing is hard, whether you're deploying terraform code to AWS
or Ansible code on Google Cloud. The same problems apply.

Problems with automated testing for IaC

Although they're both "code", testing infrastructure code is different from
testing application code.
The main problem is time.
TDD/BDD works best when you have fast feedback. Very often, you'll have your
tests running automatically in another window whenever you save your file, so
you see any problems almost immediately.
When your code is creating infrastructure, this kind of fast feedback is
impossible.  Spinning up servers, creating virtual private clouds (VPCs), and
setting up load-balancers takes time - often several minutes, depending on the
type of infrastructure resource and the cloud provider.
For example, creating an AWS RDS instance takes around 20 minutes - longer if
you're creating a cluster or setting up read replicas. That's not a criticism
of AWS - all of the major cloud providers have similar limitations. Building
infrastructure just takes time, even in a modern, cloud-centric environment.
Optimising for speed

There are ways you can minimise these delays - for instance you can use a
cloud datastore instead of creating database servers, or launch docker
containers instead of virtual servers - but they're impossible to eliminate.
And the more your development environment diverges from your production
infrastructure, the less reliable your IaC tests become.
It's never going to be as fast to run code which creates a virtual server, or
launches a pod in a kubernetes cluster, as fast as you can instantiate objects
in memory. So, you're going to get feedback from your IaC tests much more
slowly than from your application tests.
Emulating infrastructure

Another possiblity to get faster feedback from your IaC tests is emulation. So
rather than actually building the infrastructure your code defines, you use
emulation to try to gain insights into its correctness.
A simple example of this would be to run terraform plan on your terraform
code and see if it looks like it's going to do what you expect.
Although emulation can add value, the problem is that you're now exposed to
multiple sources of errors; mistakes in your IaC code, and errors in your
emulation layer where it's not providing a completely accurate representation
of the behaviour of your infrastructure provider.
When you run terraform plan the output is saying, in effect, "These are the
AWS API calls I made, and these are the results I expect those calls will
have." Very often, terraform plan will be completely correct, but sometimes
the API has behaviors that are not emulated completely correctly.
An example of this is parameter name lengths. It's quite common for terraform plan to be completely happy with some code, but then the AWS API rejects a
particular call because the name assigned to, say an RDS instance, is too long.
That's just one example, that I happen to have seen a lot. I'm sure there are
others, and again, this isn't a criticism of terraform in particular -
emulating the entire API of a cloud provider is a huge task, and it's not
surprising that (as far as I know), nobody does it perfectly.
Cost

For the sake of completeness, I should mention that running a suite of IaC
tests that actually builds and then tears down cloud infrastructure generally
costs a lot more than running application tests, where you're just paying
whatever it costs for the compute power to execute your test code. However, the
hosting costs of briefly running some test infrastructure is almost always
much smaller than the cost of engineers' time spent finding and fixing
problems that could have been avoided with better testing.
IaC Tools

Testing IaC code is slower and more difficult than testing application code,
but it's still important, and the better you test your IaC code, the fewer
problems you're going to have. So, how do you do it?
Dedicated IaC testing tools

Whichever IaC technology you're using, there are usually several dedicated
tools designed to help you create automated tests for it.
Some examples include terratest for terraform, litmus for testing Puppet
modules, test-kitchen for testing Chef code (these are examples, rather
than endorsements)

The landscape of IaC testing tools changes quite quickly, so it's worth doing
some basic due diligence to ensure that any tool you're planning to use is
still being actively supported and developed, before you sink a lot of time
and effort into it.

Testing AWS infrastructure code with terratest

Let's walk through a very simple example (ripped off from inspired by this
terratest example) of using terratest to test some IaC code.
We're going to create some terraform code that deploys an EC2 instance, and
some terratest code to check that our terraform code tags our instance the way
we want it to.
In an empty directory, create two directories terraform and test, and the
following files:

This code is for terraform version 0.13.3 (the latest version at time of
writing).

terraform/variables.tf
variable "aws_access_key" {
  default = ""
}

variable "aws_secret_key" {
  default = ""
}

variable "aws_region" {
  default = "us-west-2"
}

variable "instance_name" {
  default = "Webserver"
}


terraform/main.tf
provider "aws" {
  access_key = var.aws_access_key
  secret_key = var.aws_secret_key
  region     = var.aws_region
}


terraform/outputs.tf
output "instance_id" {
  value = aws_instance.webserver.id
}


terraform/ec2_instance.tf
resource "aws_instance" "webserver" {
  ami           = "ami-0841edc20334f9287"  // AWS Linux AMI
  instance_type = "t2.micro"

  tags = {
    InstanceName = var.instance_name
  }
}


We need AWS credentials to run this code:
export TF_VAR_aws_access_key="XXXXXXXXXXXXXXXXXXXX"
export TF_VAR_aws_secret_key="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
export AWS_ACCESS_KEY_ID=${TF_VAR_aws_access_key}
export AWS_SECRET_ACCESS_KEY=${TF_VAR_aws_secret_key}

We need the TF_VAR_ environment variables for our terraform code, but
terratest also needs the AWS credentials to query the API and get details of
our EC2 instance, so we need the same values with different environment
variable names.

If you run this code, it will launch an EC2 instance with a Name tag with the
value Webserver.

Don't forget to terraform destroy anything you create, or you may incur charges from AWS.

Now let's add a test to check that our terraform code applies the Name tag correctly:
test/ec2_instance_test.go
package test

import (
	"fmt"
	"testing"

	"github.com/gruntwork-io/terratest/modules/aws"
	"github.com/gruntwork-io/terratest/modules/random"
	"github.com/gruntwork-io/terratest/modules/terraform"
	"github.com/stretchr/testify/assert"
)

func TestEC2Instance(t *testing.T) {
	t.Parallel()

	// Use a unique name, so we don't affect any of our "real" instances
	expectedName := fmt.Sprintf("terratest-%s", random.UniqueId())

	terraformOptions := &terraform.Options{
		TerraformDir: "../terraform",

		// Variables to pass to our Terraform code using -var options
		Vars: map[string]interface{}{
			"instance_name": expectedName,
		},
	}

	defer terraform.Destroy(t, terraformOptions)

	terraform.InitAndApply(t, terraformOptions)

	instanceID := terraform.Output(t, terraformOptions, "instance_id")

	awsRegion := "us-west-2" // Must match ../terraform/variables.tf "aws_region"

	instanceTags := aws.GetTagsForEc2Instance(t, awsRegion, instanceID)

	nameTag, containsNameTag := instanceTags["Name"]
	assert.True(t, containsNameTag)
	assert.Equal(t, expectedName, nameTag)
}
To run this test, cd into your test directory and run:
go mod init your_github_name/terratest_example  # This just needs to be a unique package name
go test -v -timeout 30m
The first time you run this, it will download all the packages it needs, and
then it will start to apply the terraform code. You can follow along in the AWS
console and watch it create and then destroy an AWS instance.
At the end of the test run, you should see something like this (the time taken
to run the test will be different):
--- PASS: TestEC2Instance (106.38s)
PASS
ok      your_github_name/terratest_example  106.807s

Try breaking the terraform code (e.g. by changing Name to InstanceName),
and the test should fail.
Some of the benefits of terratest are:

You write your tests in go, which is a language with wide adoption in the
devops community. Both terraform and kubernetes are written in go, along with
many other infrastructure tools, so lots of engineers are likely to have
experience with the language (or be keen to acquire some).
It supports a wide range of target platforms, including cloud servers on AWS
and GCP, as well as kubernetes, packer machine images, and docker images.

On the other hand, the built-in go test framework, which terratest sits on top
of, is quite basic compared with test frameworks in other languages such as
ruby's RSpec  (which underpins serverspec), and this can result in test
code which is more verbose and harder to maintain.
Use your existing test framework

If you already have experience writing tests for your application code, in
whatever language that's written in, there's no reason you can't write your IaC
tests in the same framework. Automated testing for IaC does have specific
challenges, as discussed earlier, but fundamentally you're still setting up
pre-conditions, making a change, and testing to see if you got the correct
result.
Writing your IaC tests using the same framework as your application tests
leverages the existing skills and experience of your engineering team, and
allows you to take advantage of all the features and tooling you're used to.
Here's a simple example of using RSpec to test a kubernetes cluster:
namespace_spec.rb
require "spec_helper"

describe "accessing namespace" do
  def can_i_get_pods(namespace, group)
    `kubectl auth can-i get pod --namespace #{namespace} --as test --as-group :#{group} --as-group system:authenticated`.chomp
  end

  let(:namespace) { "kube-system" }

  context "when group is sysadmin" do
    let(:group) { "sysadmin" }

    it "allows access to pods" do
      result = can_i_get_pods(namespace, group)
      expect(result).to eq("yes")
    end
  end

  context "when group is not sysadmin" do
    let(:group) { "not-sysadmin" }

    it "does not allow access to pods" do
      result = can_i_get_pods(namespace, group)
      expect(result).to eq("no")
    end
  end
end
Here, we are testing namespace access rules by using the --as flag to
kubectl to "impersonate" a user from two different groups, and confirm that
members of the sysadmin group can perform an operation (get pods in the
kube-system namespace), which non-members cannot.
The downside of writing tests in this way is that, because the test framework
is not specifically designed for testing infrastructure, it's likely that
you'll have to build some libraries to support your test code (such as the
can_i_get_pods function in our example). Many of these will be built-in parts
of dedicated IaC testing tools.
Conformance / Compliance Testing

Up to now, we've mostly been discussing functional testing. Does your IaC code
create the correct infrastructure setup for your needs?
Conformance testing  (aka compliance testing) is a slightly different
approach which tests whether the setup complies with the standards or rules we
want to apply to our infrastructure.
Examples of this could include things like:

Our docker containers should never run as root
Our application servers should not be accessible from outside our VPC
Users should not be able to launch pods on the master nodes of our kubernetes cluster

Automation around compliance testing usually involves checking IaC code to
ensure it complies with defined policies and rules, and rejecting code which
fails.
This is analogous to scanning tools like Sonarqube and Rubocop for
application software, where the tool scans your code for known anti-patterns
and vulnerabilities, and code which fails to meet a pre-defined quality
threshold is automatically rejected.
Open Policy Agent (OPA)

One popular tool for conformance testing, particularly in kubernetes (although
it is useful in other environments) is Open Policy Agent (OPA).
According to the project documentation, OPA is a "general-purpose policy
engine." It includes its own policy language Rego, in which you define the
policies you want to enforce.
Let's look at an example, using OPA to apply some restrictions to a
kubernetes cluster.
package ingress_clash

import data.kubernetes.ingresses

deny[msg] {
  input.request.kind.kind == "Ingress"

  id := concat("/", [input.request.object.metadata.namespace, input.request.object.metadata.name])

  host := input.request.object.spec.rules[_].host

  other_ingress := data.kubernetes.ingresses[other_namespace][other_name]

  id != concat("/", [other_namespace, other_name])

  host == other_ingress.spec.rules[_].host

  msg := sprintf("ingress host (%v) conflicts with ingress %v/%v", [host, other_namespace, other_name])
}

This policy ensures that no two ingresses in a kubernetes cluster are trying to
handle traffic for the same hostname (this could be a problem in a cluster
running multiple services, because it would be possible to accidentally "steal"
traffic from a production service by defining the same hostname on a
development ingress).
This line:
host == other_ingress.spec.rules[_].host

...evaluates to true, triggering the deny, if the other_ingress hostname
matches our hostname, while this line:
id != concat("/", [other_namespace, other_name])

...ensures that the policy doesn't fail every time, by comparing an ingress to itself.
Rego has its own test framework, enabling you to write tests for your policies
to ensure they have the effects you intend. Here is an example of some tests
for the policy we created above (in a real-world scenario, you would need to
duplicate these tests for UPDATE as well as CREATE operations).
package ingress_clash

# generates an Ingress spec
new_ingress(namespace, name, host) = {
  "apiVersion": "extensions/v1beta1",
  "kind": "Ingress",
  "metadata": {
    "name": name,
    "namespace": namespace
  },
  "spec": {
    "rules": [{ "host": host }]
  }
}

# generates an AdmissionReview payload (used to mock `input`)
new_admission_review(op, newObject, oldObject) = {
  "kind": "AdmissionReview",
  "apiVersion": "admission.k8s.io/v1beta1",
  "request": {
    "kind": {
      "kind": newObject.kind
    },
    "operation": op,
    "object": newObject,
    "oldObject": oldObject
  }
}

test_ingress_create_allowed {
  not denied
    with input as new_admission_review("CREATE", new_ingress("my-namespace", "ingress-1", "host-1.example.com"), null)
    with data.kubernetes.ingresses as {
      "my-namespace": {
        "ingress-2": new_ingress("my-namespace", "ingress-2", "host-2.example.com")
      }
    }
}

test_ingress_create_conflict {
  denied
    with input as new_admission_review("CREATE", new_ingress("my-namespace", "ingress-1", "same-host.example.com"), null)
    with data.kubernetes.ingresses as {
      "my-namespace": {
        "ingress-2": new_ingress("my-namespace", "ingress-2", "same-host.example.com")
      }
    }
}

Conftest

The last part of OPA I want to talk about is Conftest.
Conftest extends the idea of compliance testing to a wide variety of structured data
formats including kubernetes configuration files, Dockerfiles, and terraform.
Here's a simple example of using conftest on some terraform code.
In this case, we are enforcing a policy that any S3 buckets must have
encryption enabled.
To start with, we need terraform code to create S3 buckets. We're going to
create one bucket with server-side encryption, and one without:
variables.tf
variable "aws_access_key" {
  default = ""
}

variable "aws_secret_key" {
  default = ""
}

variable "aws_region" {
  default = "us-west-2"
}


main.tf
provider "aws" {
  access_key = var.aws_access_key
  secret_key = var.aws_secret_key
  region     = var.aws_region
}


s3.tf
// S3 bucket with no encryption

resource "aws_s3_bucket" "cleartext-s3-bucket" {
  bucket = "cleartext-testing-terraform-with-conftest"
  acl    = "public"
  versioning {
    enabled = true
  }
}

// Encrypted S3 bucket

resource "aws_kms_key" "s3-bucket-key" {}

resource "aws_s3_bucket" "encrypted-s3-bucket" {
  bucket = "encrypted-testing-terraform-with-conftest"
  acl    = "public"
  versioning {
    enabled = true
  }
  server_side_encryption_configuration {
    rule {
      apply_server_side_encryption_by_default {
        kms_master_key_id = aws_kms_key.s3-bucket-key.arn
        sse_algorithm     = "aws:kms"
      }
    }
  }
}


As usual, we need to supply our AWS credentials as environment variables:
export TF_VAR_aws_access_key="XXXXXXXXXXXXXXXXXXXX"
export TF_VAR_aws_secret_key="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"


You will need valid AWS credentials if you want to run this code, even though
we only need to run terraform plan.

Put all of these files in a directory, supplying valid AWS credentials, and run
it like this:
terraform init
terraform plan

You should see output that includes this:
Terraform will perform the following actions:

  # aws_kms_key.s3-bucket-key will be created
  + resource "aws_kms_key" "s3-bucket-key" {
      + arn                      = (known after apply)
      + customer_master_key_spec = "SYMMETRIC_DEFAULT"
      + description              = (known after apply)
      + enable_key_rotation      = false
      + id                       = (known after apply)
      + is_enabled               = true
      + key_id                   = (known after apply)
      + key_usage                = "ENCRYPT_DECRYPT"
      + policy                   = (known after apply)
    }

  # aws_s3_bucket.cleartext-s3-bucket will be created
  + resource "aws_s3_bucket" "cleartext-s3-bucket" {
      + acceleration_status         = (known after apply)
      + acl                         = "public"
      + arn                         = (known after apply)
      + bucket                      = "cleartext-testing-terraform-with-conftest"
      + bucket_domain_name          = (known after apply)
      + bucket_regional_domain_name = (known after apply)
      + force_destroy               = false
      + hosted_zone_id              = (known after apply)
      + id                          = (known after apply)
      + region                      = (known after apply)
      + request_payer               = (known after apply)
      + website_domain              = (known after apply)
      + website_endpoint            = (known after apply)

      + versioning {
          + enabled    = true
          + mfa_delete = false
        }
    }

  # aws_s3_bucket.encrypted-s3-bucket will be created
  + resource "aws_s3_bucket" "encrypted-s3-bucket" {
      + acceleration_status         = (known after apply)
      + acl                         = "public"
      + arn                         = (known after apply)
      + bucket                      = "encrypted-testing-terraform-with-conftest"
      + bucket_domain_name          = (known after apply)
      + bucket_regional_domain_name = (known after apply)
      + force_destroy               = false
      + hosted_zone_id              = (known after apply)
      + id                          = (known after apply)
      + region                      = (known after apply)
      + request_payer               = (known after apply)
      + website_domain              = (known after apply)
      + website_endpoint            = (known after apply)

      + server_side_encryption_configuration {
          + rule {
              + apply_server_side_encryption_by_default {
                  + kms_master_key_id = (known after apply)
                  + sse_algorithm     = "aws:kms"
                }
            }
        }

      + versioning {
          + enabled    = true
          + mfa_delete = false
        }
    }

Plan: 3 to add, 0 to change, 0 to destroy.


Now that we have our terraform code, let's see how we can use conftest to check
it against our bucket encryption policy.
Create a policy directory, and add this file:
policy/s3-encryption.rego
package main

encryption[config] {
  input.resource_changes[_].change.after.server_side_encryption_configuration = config
}

deny[msg] {
  encryption[[]]
  msg = "S3 bucket encryption settings must be specified."
}


This is a trivial policy which looks at the changes terraform is going to make,
and alerts us if there are any resources where
server_side_encryption_configuration is empty.

You can find more information about writing policies in the OPA documentation.

Conftest works by scanning the JSON output from terraform plan, which we can
create like this:
terraform plan -out=plan.save
terraform show -json ./plan.save > plan.json
Now that we have our plan.json file, we can run conftest like this:
conftest test plan.json

conftest looks for policies in the policy directory by default. You can specify a different directory with the --policy/-p command-line option.

You should see output like this:
FAIL - plan.json - S3 bucket encryption settings must be specified.

1 test, 0 passed, 0 warnings, 1 failure, 0 exceptions
If you copy the server_side_encryption_configuration stanza into the cleartext bucket definition, and regenerate the plan.json file, the conftest output should change to:
1 test, 1 passed, 0 warnings, 0 failures, 0 exceptions

Kubernetes and conformance testing

Although we've used terraform in this example, to keep it simple, Kubernetes lends
itself particularly well to this kind of approach.  As well as finding out what
your IaC code says about how your infrastructure should be set up, the
kubernetes API enables extensive introspection. So, you can scan a kubernetes
cluster and find out exactly what is actually running, and how it's
configured.
Tools like Sonobuoy make this easier, allowing you to automatically run
reports on the setup of your cluster and the code running on it.
Conclusion

This is a huge topic, and I've barely scratched the surface with this article.
But I hope I've shown you some of the tools and techniques available to allow
you to apply some of the same testing rigor to your IaC code and configuration
that you already apply to developing your application code.
Testing infrastructure code has some challenges, but by taking a layered
approach, using a combination of techniques at different points in your
infrastructure development lifecycle, you can gain a lot of confidence in your
setup, and minimise the risk of later changes introducing errors or
vulnerabilities.