Skip to content

Instantly share code, notes, and snippets.

@digitalronin
Last active September 20, 2020 16:05
Show Gist options
  • Save digitalronin/9018ef87a751b1817ffca18e3e0e0e91 to your computer and use it in GitHub Desktop.
Save digitalronin/9018ef87a751b1817ffca18e3e0e0e91 to your computer and use it in GitHub Desktop.

Automated testing tools for Infrastructure as Code

In software development, automated testing has long been accepted as best practice.

Test-driven development (TDD) or behavior-driven development (BDD) approaches often go as far as writing tests for functionality before it is implemented, and continuous integration and continuous delivery (CI/CD) pipelines are commonplace, automatically running tests against your codebase whenever a change is pushed to version control.

This kind of testing rigor, where your code is comprehensively and automatically exercised to ensure it does what it's supposed to do, and that recent changes haven't broken existing functionality (often referred to as introducing a "regression"), accelerates the software development process so that teams can go faster safely.

"Infrastructure as Code" (IaC) brings many of the benefits and tooling around modern software development to infrastructure - the servers, network components, backend storage and so on that our application code runs on in order to provide services to users. As this approach to infrastructure becomes more widespread, naturally people want to apply the same kind of testing rigor to their infrastructure code as their application code.

In this article I'm mainly going to discuss terraform as the code part of IaC, and AWS as the cloud provider. That's just because those are very common choices for defining and running infrastructure. The ideas and principles I'm talking about are also applicable to other IaC technologies and providers, and nothing here is meant as a criticism of either Terraform or AWS. Infrastructure testing is hard, whether you're deploying terraform code to AWS or Ansible code on Google Cloud. The same problems apply.

Problems with automated testing for IaC

Although they're both "code", testing infrastructure code is different from testing application code.

The main problem is time.

TDD/BDD works best when you have fast feedback. Very often, you'll have your tests running automatically in another window whenever you save your file, so you see any problems almost immediately.

When your code is creating infrastructure, this kind of fast feedback is impossible. Spinning up servers, creating virtual private clouds (VPCs), and setting up load-balancers takes time - often several minutes, depending on the type of infrastructure resource and the cloud provider.

For example, creating an AWS RDS instance takes around 20 minutes - longer if you're creating a cluster or setting up read replicas. That's not a criticism of AWS - all of the major cloud providers have similar limitations. Building infrastructure just takes time, even in a modern, cloud-centric environment.

Optimising for speed

There are ways you can minimise these delays - for instance you can use a cloud datastore instead of creating database servers, or launch docker containers instead of virtual servers - but they're impossible to eliminate. And the more your development environment diverges from your production infrastructure, the less reliable your IaC tests become.

It's never going to be as fast to run code which creates a virtual server, or launches a pod in a kubernetes cluster, as fast as you can instantiate objects in memory. So, you're going to get feedback from your IaC tests much more slowly than from your application tests.

Emulating infrastructure

Another possiblity to get faster feedback from your IaC tests is emulation. So rather than actually building the infrastructure your code defines, you use emulation to try to gain insights into its correctness.

A simple example of this would be to run terraform plan on your terraform code and see if it looks like it's going to do what you expect.

Although emulation can add value, the problem is that you're now exposed to multiple sources of errors; mistakes in your IaC code, and errors in your emulation layer where it's not providing a completely accurate representation of the behaviour of your infrastructure provider.

When you run terraform plan the output is saying, in effect, "These are the AWS API calls I made, and these are the results I expect those calls will have." Very often, terraform plan will be completely correct, but sometimes the API has behaviors that are not emulated completely correctly.

An example of this is parameter name lengths. It's quite common for terraform plan to be completely happy with some code, but then the AWS API rejects a particular call because the name assigned to, say an RDS instance, is too long.

That's just one example, that I happen to have seen a lot. I'm sure there are others, and again, this isn't a criticism of terraform in particular - emulating the entire API of a cloud provider is a huge task, and it's not surprising that (as far as I know), nobody does it perfectly.

Cost

For the sake of completeness, I should mention that running a suite of IaC tests that actually builds and then tears down cloud infrastructure generally costs a lot more than running application tests, where you're just paying whatever it costs for the compute power to execute your test code. However, the hosting costs of briefly running some test infrastructure is almost always much smaller than the cost of engineers' time spent finding and fixing problems that could have been avoided with better testing.

IaC Tools

Testing IaC code is slower and more difficult than testing application code, but it's still important, and the better you test your IaC code, the fewer problems you're going to have. So, how do you do it?

Dedicated IaC testing tools

Whichever IaC technology you're using, there are usually several dedicated tools designed to help you create automated tests for it.

Some examples include terratest for terraform, litmus for testing Puppet modules, test-kitchen for testing Chef code (these are examples, rather than endorsements)

The landscape of IaC testing tools changes quite quickly, so it's worth doing some basic due diligence to ensure that any tool you're planning to use is still being actively supported and developed, before you sink a lot of time and effort into it.

Testing AWS infrastructure code with terratest

Let's walk through a very simple example (ripped off from inspired by this terratest example) of using terratest to test some IaC code.

We're going to create some terraform code that deploys an EC2 instance, and some terratest code to check that our terraform code tags our instance the way we want it to.

In an empty directory, create two directories terraform and test, and the following files:

This code is for terraform version 0.13.3 (the latest version at time of writing).

terraform/variables.tf

variable "aws_access_key" {
  default = ""
}

variable "aws_secret_key" {
  default = ""
}

variable "aws_region" {
  default = "us-west-2"
}

variable "instance_name" {
  default = "Webserver"
}

terraform/main.tf

provider "aws" {
  access_key = var.aws_access_key
  secret_key = var.aws_secret_key
  region     = var.aws_region
}

terraform/outputs.tf

output "instance_id" {
  value = aws_instance.webserver.id
}

terraform/ec2_instance.tf

resource "aws_instance" "webserver" {
  ami           = "ami-0841edc20334f9287"  // AWS Linux AMI
  instance_type = "t2.micro"

  tags = {
    InstanceName = var.instance_name
  }
}

We need AWS credentials to run this code:

export TF_VAR_aws_access_key="XXXXXXXXXXXXXXXXXXXX"
export TF_VAR_aws_secret_key="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
export AWS_ACCESS_KEY_ID=${TF_VAR_aws_access_key}
export AWS_SECRET_ACCESS_KEY=${TF_VAR_aws_secret_key}

We need the TF_VAR_ environment variables for our terraform code, but terratest also needs the AWS credentials to query the API and get details of our EC2 instance, so we need the same values with different environment variable names.

If you run this code, it will launch an EC2 instance with a Name tag with the value Webserver.

Don't forget to terraform destroy anything you create, or you may incur charges from AWS.

Now let's add a test to check that our terraform code applies the Name tag correctly:

test/ec2_instance_test.go

package test

import (
	"fmt"
	"testing"

	"github.com/gruntwork-io/terratest/modules/aws"
	"github.com/gruntwork-io/terratest/modules/random"
	"github.com/gruntwork-io/terratest/modules/terraform"
	"github.com/stretchr/testify/assert"
)

func TestEC2Instance(t *testing.T) {
	t.Parallel()

	// Use a unique name, so we don't affect any of our "real" instances
	expectedName := fmt.Sprintf("terratest-%s", random.UniqueId())

	terraformOptions := &terraform.Options{
		TerraformDir: "../terraform",

		// Variables to pass to our Terraform code using -var options
		Vars: map[string]interface{}{
			"instance_name": expectedName,
		},
	}

	defer terraform.Destroy(t, terraformOptions)

	terraform.InitAndApply(t, terraformOptions)

	instanceID := terraform.Output(t, terraformOptions, "instance_id")

	awsRegion := "us-west-2" // Must match ../terraform/variables.tf "aws_region"

	instanceTags := aws.GetTagsForEc2Instance(t, awsRegion, instanceID)

	nameTag, containsNameTag := instanceTags["Name"]
	assert.True(t, containsNameTag)
	assert.Equal(t, expectedName, nameTag)
}

To run this test, cd into your test directory and run:

go mod init your_github_name/terratest_example  # This just needs to be a unique package name
go test -v -timeout 30m

The first time you run this, it will download all the packages it needs, and then it will start to apply the terraform code. You can follow along in the AWS console and watch it create and then destroy an AWS instance.

At the end of the test run, you should see something like this (the time taken to run the test will be different):

--- PASS: TestEC2Instance (106.38s)
PASS
ok      your_github_name/terratest_example  106.807s

Try breaking the terraform code (e.g. by changing Name to InstanceName), and the test should fail.

Some of the benefits of terratest are:

  • You write your tests in go, which is a language with wide adoption in the devops community. Both terraform and kubernetes are written in go, along with many other infrastructure tools, so lots of engineers are likely to have experience with the language (or be keen to acquire some).
  • It supports a wide range of target platforms, including cloud servers on AWS and GCP, as well as kubernetes, packer machine images, and docker images.

On the other hand, the built-in go test framework, which terratest sits on top of, is quite basic compared with test frameworks in other languages such as ruby's RSpec (which underpins serverspec), and this can result in test code which is more verbose and harder to maintain.

Use your existing test framework

If you already have experience writing tests for your application code, in whatever language that's written in, there's no reason you can't write your IaC tests in the same framework. Automated testing for IaC does have specific challenges, as discussed earlier, but fundamentally you're still setting up pre-conditions, making a change, and testing to see if you got the correct result.

Writing your IaC tests using the same framework as your application tests leverages the existing skills and experience of your engineering team, and allows you to take advantage of all the features and tooling you're used to.

Here's a simple example of using RSpec to test a kubernetes cluster:

namespace_spec.rb

require "spec_helper"

describe "accessing namespace" do
  def can_i_get_pods(namespace, group)
    `kubectl auth can-i get pod --namespace #{namespace} --as test --as-group :#{group} --as-group system:authenticated`.chomp
  end

  let(:namespace) { "kube-system" }

  context "when group is sysadmin" do
    let(:group) { "sysadmin" }

    it "allows access to pods" do
      result = can_i_get_pods(namespace, group)
      expect(result).to eq("yes")
    end
  end

  context "when group is not sysadmin" do
    let(:group) { "not-sysadmin" }

    it "does not allow access to pods" do
      result = can_i_get_pods(namespace, group)
      expect(result).to eq("no")
    end
  end
end

Here, we are testing namespace access rules by using the --as flag to kubectl to "impersonate" a user from two different groups, and confirm that members of the sysadmin group can perform an operation (get pods in the kube-system namespace), which non-members cannot.

The downside of writing tests in this way is that, because the test framework is not specifically designed for testing infrastructure, it's likely that you'll have to build some libraries to support your test code (such as the can_i_get_pods function in our example). Many of these will be built-in parts of dedicated IaC testing tools.

Conformance / Compliance Testing

Up to now, we've mostly been discussing functional testing. Does your IaC code create the correct infrastructure setup for your needs?

Conformance testing (aka compliance testing) is a slightly different approach which tests whether the setup complies with the standards or rules we want to apply to our infrastructure.

Examples of this could include things like:

  • Our docker containers should never run as root
  • Our application servers should not be accessible from outside our VPC
  • Users should not be able to launch pods on the master nodes of our kubernetes cluster

Automation around compliance testing usually involves checking IaC code to ensure it complies with defined policies and rules, and rejecting code which fails.

This is analogous to scanning tools like Sonarqube and Rubocop for application software, where the tool scans your code for known anti-patterns and vulnerabilities, and code which fails to meet a pre-defined quality threshold is automatically rejected.

Open Policy Agent (OPA)

One popular tool for conformance testing, particularly in kubernetes (although it is useful in other environments) is Open Policy Agent (OPA).

According to the project documentation, OPA is a "general-purpose policy engine." It includes its own policy language Rego, in which you define the policies you want to enforce.

Let's look at an example, using OPA to apply some restrictions to a kubernetes cluster.

package ingress_clash

import data.kubernetes.ingresses

deny[msg] {
  input.request.kind.kind == "Ingress"

  id := concat("/", [input.request.object.metadata.namespace, input.request.object.metadata.name])

  host := input.request.object.spec.rules[_].host

  other_ingress := data.kubernetes.ingresses[other_namespace][other_name]

  id != concat("/", [other_namespace, other_name])

  host == other_ingress.spec.rules[_].host

  msg := sprintf("ingress host (%v) conflicts with ingress %v/%v", [host, other_namespace, other_name])
}

This policy ensures that no two ingresses in a kubernetes cluster are trying to handle traffic for the same hostname (this could be a problem in a cluster running multiple services, because it would be possible to accidentally "steal" traffic from a production service by defining the same hostname on a development ingress).

This line:

host == other_ingress.spec.rules[_].host

...evaluates to true, triggering the deny, if the other_ingress hostname matches our hostname, while this line:

id != concat("/", [other_namespace, other_name])

...ensures that the policy doesn't fail every time, by comparing an ingress to itself.

Rego has its own test framework, enabling you to write tests for your policies to ensure they have the effects you intend. Here is an example of some tests for the policy we created above (in a real-world scenario, you would need to duplicate these tests for UPDATE as well as CREATE operations).

package ingress_clash

# generates an Ingress spec
new_ingress(namespace, name, host) = {
  "apiVersion": "extensions/v1beta1",
  "kind": "Ingress",
  "metadata": {
    "name": name,
    "namespace": namespace
  },
  "spec": {
    "rules": [{ "host": host }]
  }
}

# generates an AdmissionReview payload (used to mock `input`)
new_admission_review(op, newObject, oldObject) = {
  "kind": "AdmissionReview",
  "apiVersion": "admission.k8s.io/v1beta1",
  "request": {
    "kind": {
      "kind": newObject.kind
    },
    "operation": op,
    "object": newObject,
    "oldObject": oldObject
  }
}

test_ingress_create_allowed {
  not denied
    with input as new_admission_review("CREATE", new_ingress("my-namespace", "ingress-1", "host-1.example.com"), null)
    with data.kubernetes.ingresses as {
      "my-namespace": {
        "ingress-2": new_ingress("my-namespace", "ingress-2", "host-2.example.com")
      }
    }
}

test_ingress_create_conflict {
  denied
    with input as new_admission_review("CREATE", new_ingress("my-namespace", "ingress-1", "same-host.example.com"), null)
    with data.kubernetes.ingresses as {
      "my-namespace": {
        "ingress-2": new_ingress("my-namespace", "ingress-2", "same-host.example.com")
      }
    }
}

Conftest

The last part of OPA I want to talk about is Conftest.

Conftest extends the idea of compliance testing to a wide variety of structured data formats including kubernetes configuration files, Dockerfiles, and terraform.

Here's a simple example of using conftest on some terraform code.

In this case, we are enforcing a policy that any S3 buckets must have encryption enabled.

To start with, we need terraform code to create S3 buckets. We're going to create one bucket with server-side encryption, and one without:

variables.tf

variable "aws_access_key" {
  default = ""
}

variable "aws_secret_key" {
  default = ""
}

variable "aws_region" {
  default = "us-west-2"
}

main.tf

provider "aws" {
  access_key = var.aws_access_key
  secret_key = var.aws_secret_key
  region     = var.aws_region
}

s3.tf

// S3 bucket with no encryption

resource "aws_s3_bucket" "cleartext-s3-bucket" {
  bucket = "cleartext-testing-terraform-with-conftest"
  acl    = "public"
  versioning {
    enabled = true
  }
}

// Encrypted S3 bucket

resource "aws_kms_key" "s3-bucket-key" {}

resource "aws_s3_bucket" "encrypted-s3-bucket" {
  bucket = "encrypted-testing-terraform-with-conftest"
  acl    = "public"
  versioning {
    enabled = true
  }
  server_side_encryption_configuration {
    rule {
      apply_server_side_encryption_by_default {
        kms_master_key_id = aws_kms_key.s3-bucket-key.arn
        sse_algorithm     = "aws:kms"
      }
    }
  }
}

As usual, we need to supply our AWS credentials as environment variables:

export TF_VAR_aws_access_key="XXXXXXXXXXXXXXXXXXXX"
export TF_VAR_aws_secret_key="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"

You will need valid AWS credentials if you want to run this code, even though we only need to run terraform plan.

Put all of these files in a directory, supplying valid AWS credentials, and run it like this:

terraform init
terraform plan

You should see output that includes this:

Terraform will perform the following actions:

  # aws_kms_key.s3-bucket-key will be created
  + resource "aws_kms_key" "s3-bucket-key" {
      + arn                      = (known after apply)
      + customer_master_key_spec = "SYMMETRIC_DEFAULT"
      + description              = (known after apply)
      + enable_key_rotation      = false
      + id                       = (known after apply)
      + is_enabled               = true
      + key_id                   = (known after apply)
      + key_usage                = "ENCRYPT_DECRYPT"
      + policy                   = (known after apply)
    }

  # aws_s3_bucket.cleartext-s3-bucket will be created
  + resource "aws_s3_bucket" "cleartext-s3-bucket" {
      + acceleration_status         = (known after apply)
      + acl                         = "public"
      + arn                         = (known after apply)
      + bucket                      = "cleartext-testing-terraform-with-conftest"
      + bucket_domain_name          = (known after apply)
      + bucket_regional_domain_name = (known after apply)
      + force_destroy               = false
      + hosted_zone_id              = (known after apply)
      + id                          = (known after apply)
      + region                      = (known after apply)
      + request_payer               = (known after apply)
      + website_domain              = (known after apply)
      + website_endpoint            = (known after apply)

      + versioning {
          + enabled    = true
          + mfa_delete = false
        }
    }

  # aws_s3_bucket.encrypted-s3-bucket will be created
  + resource "aws_s3_bucket" "encrypted-s3-bucket" {
      + acceleration_status         = (known after apply)
      + acl                         = "public"
      + arn                         = (known after apply)
      + bucket                      = "encrypted-testing-terraform-with-conftest"
      + bucket_domain_name          = (known after apply)
      + bucket_regional_domain_name = (known after apply)
      + force_destroy               = false
      + hosted_zone_id              = (known after apply)
      + id                          = (known after apply)
      + region                      = (known after apply)
      + request_payer               = (known after apply)
      + website_domain              = (known after apply)
      + website_endpoint            = (known after apply)

      + server_side_encryption_configuration {
          + rule {
              + apply_server_side_encryption_by_default {
                  + kms_master_key_id = (known after apply)
                  + sse_algorithm     = "aws:kms"
                }
            }
        }

      + versioning {
          + enabled    = true
          + mfa_delete = false
        }
    }

Plan: 3 to add, 0 to change, 0 to destroy.

Now that we have our terraform code, let's see how we can use conftest to check it against our bucket encryption policy.

Create a policy directory, and add this file:

policy/s3-encryption.rego

package main

encryption[config] {
  input.resource_changes[_].change.after.server_side_encryption_configuration = config
}

deny[msg] {
  encryption[[]]
  msg = "S3 bucket encryption settings must be specified."
}

This is a trivial policy which looks at the changes terraform is going to make, and alerts us if there are any resources where server_side_encryption_configuration is empty.

You can find more information about writing policies in the OPA documentation.

Conftest works by scanning the JSON output from terraform plan, which we can create like this:

terraform plan -out=plan.save
terraform show -json ./plan.save > plan.json

Now that we have our plan.json file, we can run conftest like this:

conftest test plan.json

conftest looks for policies in the policy directory by default. You can specify a different directory with the --policy/-p command-line option.

You should see output like this:

FAIL - plan.json - S3 bucket encryption settings must be specified.

1 test, 0 passed, 0 warnings, 1 failure, 0 exceptions

If you copy the server_side_encryption_configuration stanza into the cleartext bucket definition, and regenerate the plan.json file, the conftest output should change to:

1 test, 1 passed, 0 warnings, 0 failures, 0 exceptions

Kubernetes and conformance testing

Although we've used terraform in this example, to keep it simple, Kubernetes lends itself particularly well to this kind of approach. As well as finding out what your IaC code says about how your infrastructure should be set up, the kubernetes API enables extensive introspection. So, you can scan a kubernetes cluster and find out exactly what is actually running, and how it's configured.

Tools like Sonobuoy make this easier, allowing you to automatically run reports on the setup of your cluster and the code running on it.

Conclusion

This is a huge topic, and I've barely scratched the surface with this article. But I hope I've shown you some of the tools and techniques available to allow you to apply some of the same testing rigor to your IaC code and configuration that you already apply to developing your application code.

Testing infrastructure code has some challenges, but by taking a layered approach, using a combination of techniques at different points in your infrastructure development lifecycle, you can gain a lot of confidence in your setup, and minimise the risk of later changes introducing errors or vulnerabilities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment