In software development, automated testing has long been accepted as best practice.
Test-driven development (TDD) or behavior-driven development (BDD) approaches often go as far as writing tests for functionality before it is implemented, and continuous integration and continuous delivery (CI/CD) pipelines are commonplace, automatically running tests against your codebase whenever a change is pushed to version control.
This kind of testing rigor, where your code is comprehensively and automatically exercised to ensure it does what it's supposed to do, and that recent changes haven't broken existing functionality (often referred to as introducing a "regression"), accelerates the software development process so that teams can go faster safely.
"Infrastructure as Code" (IaC) brings many of the benefits and tooling around modern software development to infrastructure - the servers, network components, backend storage and so on that our application code runs on in order to provide services to users. As this approach to infrastructure becomes more widespread, naturally people want to apply the same kind of testing rigor to their infrastructure code as their application code.
In this article I'm mainly going to discuss terraform as the code part of IaC, and AWS as the cloud provider. That's just because those are very common choices for defining and running infrastructure. The ideas and principles I'm talking about are also applicable to other IaC technologies and providers, and nothing here is meant as a criticism of either Terraform or AWS. Infrastructure testing is hard, whether you're deploying terraform code to AWS or Ansible code on Google Cloud. The same problems apply.
Although they're both "code", testing infrastructure code is different from testing application code.
The main problem is time.
TDD/BDD works best when you have fast feedback. Very often, you'll have your tests running automatically in another window whenever you save your file, so you see any problems almost immediately.
When your code is creating infrastructure, this kind of fast feedback is impossible. Spinning up servers, creating virtual private clouds (VPCs), and setting up load-balancers takes time - often several minutes, depending on the type of infrastructure resource and the cloud provider.
For example, creating an AWS RDS instance takes around 20 minutes - longer if you're creating a cluster or setting up read replicas. That's not a criticism of AWS - all of the major cloud providers have similar limitations. Building infrastructure just takes time, even in a modern, cloud-centric environment.
There are ways you can minimise these delays - for instance you can use a cloud datastore instead of creating database servers, or launch docker containers instead of virtual servers - but they're impossible to eliminate. And the more your development environment diverges from your production infrastructure, the less reliable your IaC tests become.
It's never going to be as fast to run code which creates a virtual server, or launches a pod in a kubernetes cluster, as fast as you can instantiate objects in memory. So, you're going to get feedback from your IaC tests much more slowly than from your application tests.
Another possiblity to get faster feedback from your IaC tests is emulation. So rather than actually building the infrastructure your code defines, you use emulation to try to gain insights into its correctness.
A simple example of this would be to run terraform plan
on your terraform
code and see if it looks like it's going to do what you expect.
Although emulation can add value, the problem is that you're now exposed to multiple sources of errors; mistakes in your IaC code, and errors in your emulation layer where it's not providing a completely accurate representation of the behaviour of your infrastructure provider.
When you run terraform plan
the output is saying, in effect, "These are the
AWS API calls I made, and these are the results I expect those calls will
have." Very often, terraform plan
will be completely correct, but sometimes
the API has behaviors that are not emulated completely correctly.
An example of this is parameter name lengths. It's quite common for terraform plan
to be completely happy with some code, but then the AWS API rejects a
particular call because the name assigned to, say an RDS instance, is too long.
That's just one example, that I happen to have seen a lot. I'm sure there are others, and again, this isn't a criticism of terraform in particular - emulating the entire API of a cloud provider is a huge task, and it's not surprising that (as far as I know), nobody does it perfectly.
For the sake of completeness, I should mention that running a suite of IaC tests that actually builds and then tears down cloud infrastructure generally costs a lot more than running application tests, where you're just paying whatever it costs for the compute power to execute your test code. However, the hosting costs of briefly running some test infrastructure is almost always much smaller than the cost of engineers' time spent finding and fixing problems that could have been avoided with better testing.
Testing IaC code is slower and more difficult than testing application code, but it's still important, and the better you test your IaC code, the fewer problems you're going to have. So, how do you do it?
Whichever IaC technology you're using, there are usually several dedicated tools designed to help you create automated tests for it.
Some examples include terratest for terraform, litmus for testing Puppet modules, test-kitchen for testing Chef code (these are examples, rather than endorsements)
The landscape of IaC testing tools changes quite quickly, so it's worth doing some basic due diligence to ensure that any tool you're planning to use is still being actively supported and developed, before you sink a lot of time and effort into it.
Let's walk through a very simple example (ripped off from inspired by this
terratest example) of using terratest to test some IaC code.
We're going to create some terraform code that deploys an EC2 instance, and some terratest code to check that our terraform code tags our instance the way we want it to.
In an empty directory, create two directories terraform
and test
, and the
following files:
This code is for terraform version 0.13.3 (the latest version at time of writing).
terraform/variables.tf
variable "aws_access_key" {
default = ""
}
variable "aws_secret_key" {
default = ""
}
variable "aws_region" {
default = "us-west-2"
}
variable "instance_name" {
default = "Webserver"
}
terraform/main.tf
provider "aws" {
access_key = var.aws_access_key
secret_key = var.aws_secret_key
region = var.aws_region
}
terraform/outputs.tf
output "instance_id" {
value = aws_instance.webserver.id
}
terraform/ec2_instance.tf
resource "aws_instance" "webserver" {
ami = "ami-0841edc20334f9287" // AWS Linux AMI
instance_type = "t2.micro"
tags = {
InstanceName = var.instance_name
}
}
We need AWS credentials to run this code:
export TF_VAR_aws_access_key="XXXXXXXXXXXXXXXXXXXX"
export TF_VAR_aws_secret_key="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
export AWS_ACCESS_KEY_ID=${TF_VAR_aws_access_key}
export AWS_SECRET_ACCESS_KEY=${TF_VAR_aws_secret_key}
We need the
TF_VAR_
environment variables for our terraform code, but terratest also needs the AWS credentials to query the API and get details of our EC2 instance, so we need the same values with different environment variable names.
If you run this code, it will launch an EC2 instance with a Name
tag with the
value Webserver
.
Don't forget to
terraform destroy
anything you create, or you may incur charges from AWS.
Now let's add a test to check that our terraform code applies the Name
tag correctly:
test/ec2_instance_test.go
package test
import (
"fmt"
"testing"
"github.com/gruntwork-io/terratest/modules/aws"
"github.com/gruntwork-io/terratest/modules/random"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/stretchr/testify/assert"
)
func TestEC2Instance(t *testing.T) {
t.Parallel()
// Use a unique name, so we don't affect any of our "real" instances
expectedName := fmt.Sprintf("terratest-%s", random.UniqueId())
terraformOptions := &terraform.Options{
TerraformDir: "../terraform",
// Variables to pass to our Terraform code using -var options
Vars: map[string]interface{}{
"instance_name": expectedName,
},
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
instanceID := terraform.Output(t, terraformOptions, "instance_id")
awsRegion := "us-west-2" // Must match ../terraform/variables.tf "aws_region"
instanceTags := aws.GetTagsForEc2Instance(t, awsRegion, instanceID)
nameTag, containsNameTag := instanceTags["Name"]
assert.True(t, containsNameTag)
assert.Equal(t, expectedName, nameTag)
}
To run this test, cd into your test
directory and run:
go mod init your_github_name/terratest_example # This just needs to be a unique package name
go test -v -timeout 30m
The first time you run this, it will download all the packages it needs, and then it will start to apply the terraform code. You can follow along in the AWS console and watch it create and then destroy an AWS instance.
At the end of the test run, you should see something like this (the time taken to run the test will be different):
--- PASS: TestEC2Instance (106.38s)
PASS
ok your_github_name/terratest_example 106.807s
Try breaking the terraform code (e.g. by changing Name
to InstanceName
),
and the test should fail.
Some of the benefits of terratest are:
- You write your tests in go, which is a language with wide adoption in the devops community. Both terraform and kubernetes are written in go, along with many other infrastructure tools, so lots of engineers are likely to have experience with the language (or be keen to acquire some).
- It supports a wide range of target platforms, including cloud servers on AWS and GCP, as well as kubernetes, packer machine images, and docker images.
On the other hand, the built-in go test framework, which terratest sits on top of, is quite basic compared with test frameworks in other languages such as ruby's RSpec (which underpins serverspec), and this can result in test code which is more verbose and harder to maintain.
If you already have experience writing tests for your application code, in whatever language that's written in, there's no reason you can't write your IaC tests in the same framework. Automated testing for IaC does have specific challenges, as discussed earlier, but fundamentally you're still setting up pre-conditions, making a change, and testing to see if you got the correct result.
Writing your IaC tests using the same framework as your application tests leverages the existing skills and experience of your engineering team, and allows you to take advantage of all the features and tooling you're used to.
Here's a simple example of using RSpec to test a kubernetes cluster:
namespace_spec.rb
require "spec_helper"
describe "accessing namespace" do
def can_i_get_pods(namespace, group)
`kubectl auth can-i get pod --namespace #{namespace} --as test --as-group :#{group} --as-group system:authenticated`.chomp
end
let(:namespace) { "kube-system" }
context "when group is sysadmin" do
let(:group) { "sysadmin" }
it "allows access to pods" do
result = can_i_get_pods(namespace, group)
expect(result).to eq("yes")
end
end
context "when group is not sysadmin" do
let(:group) { "not-sysadmin" }
it "does not allow access to pods" do
result = can_i_get_pods(namespace, group)
expect(result).to eq("no")
end
end
end
Here, we are testing namespace access rules by using the --as
flag to
kubectl
to "impersonate" a user from two different groups, and confirm that
members of the sysadmin
group can perform an operation (get pods
in the
kube-system
namespace), which non-members cannot.
The downside of writing tests in this way is that, because the test framework
is not specifically designed for testing infrastructure, it's likely that
you'll have to build some libraries to support your test code (such as the
can_i_get_pods
function in our example). Many of these will be built-in parts
of dedicated IaC testing tools.
Up to now, we've mostly been discussing functional testing. Does your IaC code create the correct infrastructure setup for your needs?
Conformance testing (aka compliance testing) is a slightly different approach which tests whether the setup complies with the standards or rules we want to apply to our infrastructure.
Examples of this could include things like:
- Our docker containers should never run as
root
- Our application servers should not be accessible from outside our VPC
- Users should not be able to launch pods on the master nodes of our kubernetes cluster
Automation around compliance testing usually involves checking IaC code to ensure it complies with defined policies and rules, and rejecting code which fails.
This is analogous to scanning tools like Sonarqube and Rubocop for application software, where the tool scans your code for known anti-patterns and vulnerabilities, and code which fails to meet a pre-defined quality threshold is automatically rejected.
One popular tool for conformance testing, particularly in kubernetes (although it is useful in other environments) is Open Policy Agent (OPA).
According to the project documentation, OPA is a "general-purpose policy engine." It includes its own policy language Rego, in which you define the policies you want to enforce.
Let's look at an example, using OPA to apply some restrictions to a kubernetes cluster.
package ingress_clash
import data.kubernetes.ingresses
deny[msg] {
input.request.kind.kind == "Ingress"
id := concat("/", [input.request.object.metadata.namespace, input.request.object.metadata.name])
host := input.request.object.spec.rules[_].host
other_ingress := data.kubernetes.ingresses[other_namespace][other_name]
id != concat("/", [other_namespace, other_name])
host == other_ingress.spec.rules[_].host
msg := sprintf("ingress host (%v) conflicts with ingress %v/%v", [host, other_namespace, other_name])
}
This policy ensures that no two ingresses in a kubernetes cluster are trying to handle traffic for the same hostname (this could be a problem in a cluster running multiple services, because it would be possible to accidentally "steal" traffic from a production service by defining the same hostname on a development ingress).
This line:
host == other_ingress.spec.rules[_].host
...evaluates to true, triggering the deny
, if the other_ingress
hostname
matches our hostname, while this line:
id != concat("/", [other_namespace, other_name])
...ensures that the policy doesn't fail every time, by comparing an ingress to itself.
Rego has its own test framework, enabling you to write tests for your policies
to ensure they have the effects you intend. Here is an example of some tests
for the policy we created above (in a real-world scenario, you would need to
duplicate these tests for UPDATE
as well as CREATE
operations).
package ingress_clash
# generates an Ingress spec
new_ingress(namespace, name, host) = {
"apiVersion": "extensions/v1beta1",
"kind": "Ingress",
"metadata": {
"name": name,
"namespace": namespace
},
"spec": {
"rules": [{ "host": host }]
}
}
# generates an AdmissionReview payload (used to mock `input`)
new_admission_review(op, newObject, oldObject) = {
"kind": "AdmissionReview",
"apiVersion": "admission.k8s.io/v1beta1",
"request": {
"kind": {
"kind": newObject.kind
},
"operation": op,
"object": newObject,
"oldObject": oldObject
}
}
test_ingress_create_allowed {
not denied
with input as new_admission_review("CREATE", new_ingress("my-namespace", "ingress-1", "host-1.example.com"), null)
with data.kubernetes.ingresses as {
"my-namespace": {
"ingress-2": new_ingress("my-namespace", "ingress-2", "host-2.example.com")
}
}
}
test_ingress_create_conflict {
denied
with input as new_admission_review("CREATE", new_ingress("my-namespace", "ingress-1", "same-host.example.com"), null)
with data.kubernetes.ingresses as {
"my-namespace": {
"ingress-2": new_ingress("my-namespace", "ingress-2", "same-host.example.com")
}
}
}
The last part of OPA I want to talk about is Conftest.
Conftest extends the idea of compliance testing to a wide variety of structured data formats including kubernetes configuration files, Dockerfiles, and terraform.
Here's a simple example of using conftest on some terraform code.
In this case, we are enforcing a policy that any S3 buckets must have encryption enabled.
To start with, we need terraform code to create S3 buckets. We're going to create one bucket with server-side encryption, and one without:
variables.tf
variable "aws_access_key" {
default = ""
}
variable "aws_secret_key" {
default = ""
}
variable "aws_region" {
default = "us-west-2"
}
main.tf
provider "aws" {
access_key = var.aws_access_key
secret_key = var.aws_secret_key
region = var.aws_region
}
s3.tf
// S3 bucket with no encryption
resource "aws_s3_bucket" "cleartext-s3-bucket" {
bucket = "cleartext-testing-terraform-with-conftest"
acl = "public"
versioning {
enabled = true
}
}
// Encrypted S3 bucket
resource "aws_kms_key" "s3-bucket-key" {}
resource "aws_s3_bucket" "encrypted-s3-bucket" {
bucket = "encrypted-testing-terraform-with-conftest"
acl = "public"
versioning {
enabled = true
}
server_side_encryption_configuration {
rule {
apply_server_side_encryption_by_default {
kms_master_key_id = aws_kms_key.s3-bucket-key.arn
sse_algorithm = "aws:kms"
}
}
}
}
As usual, we need to supply our AWS credentials as environment variables:
export TF_VAR_aws_access_key="XXXXXXXXXXXXXXXXXXXX"
export TF_VAR_aws_secret_key="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
You will need valid AWS credentials if you want to run this code, even though we only need to run
terraform plan
.
Put all of these files in a directory, supplying valid AWS credentials, and run it like this:
terraform init
terraform plan
You should see output that includes this:
Terraform will perform the following actions:
# aws_kms_key.s3-bucket-key will be created
+ resource "aws_kms_key" "s3-bucket-key" {
+ arn = (known after apply)
+ customer_master_key_spec = "SYMMETRIC_DEFAULT"
+ description = (known after apply)
+ enable_key_rotation = false
+ id = (known after apply)
+ is_enabled = true
+ key_id = (known after apply)
+ key_usage = "ENCRYPT_DECRYPT"
+ policy = (known after apply)
}
# aws_s3_bucket.cleartext-s3-bucket will be created
+ resource "aws_s3_bucket" "cleartext-s3-bucket" {
+ acceleration_status = (known after apply)
+ acl = "public"
+ arn = (known after apply)
+ bucket = "cleartext-testing-terraform-with-conftest"
+ bucket_domain_name = (known after apply)
+ bucket_regional_domain_name = (known after apply)
+ force_destroy = false
+ hosted_zone_id = (known after apply)
+ id = (known after apply)
+ region = (known after apply)
+ request_payer = (known after apply)
+ website_domain = (known after apply)
+ website_endpoint = (known after apply)
+ versioning {
+ enabled = true
+ mfa_delete = false
}
}
# aws_s3_bucket.encrypted-s3-bucket will be created
+ resource "aws_s3_bucket" "encrypted-s3-bucket" {
+ acceleration_status = (known after apply)
+ acl = "public"
+ arn = (known after apply)
+ bucket = "encrypted-testing-terraform-with-conftest"
+ bucket_domain_name = (known after apply)
+ bucket_regional_domain_name = (known after apply)
+ force_destroy = false
+ hosted_zone_id = (known after apply)
+ id = (known after apply)
+ region = (known after apply)
+ request_payer = (known after apply)
+ website_domain = (known after apply)
+ website_endpoint = (known after apply)
+ server_side_encryption_configuration {
+ rule {
+ apply_server_side_encryption_by_default {
+ kms_master_key_id = (known after apply)
+ sse_algorithm = "aws:kms"
}
}
}
+ versioning {
+ enabled = true
+ mfa_delete = false
}
}
Plan: 3 to add, 0 to change, 0 to destroy.
Now that we have our terraform code, let's see how we can use conftest to check it against our bucket encryption policy.
Create a policy
directory, and add this file:
policy/s3-encryption.rego
package main
encryption[config] {
input.resource_changes[_].change.after.server_side_encryption_configuration = config
}
deny[msg] {
encryption[[]]
msg = "S3 bucket encryption settings must be specified."
}
This is a trivial policy which looks at the changes terraform is going to make,
and alerts us if there are any resources where
server_side_encryption_configuration
is empty.
You can find more information about writing policies in the OPA documentation.
Conftest works by scanning the JSON output from terraform plan
, which we can
create like this:
terraform plan -out=plan.save
terraform show -json ./plan.save > plan.json
Now that we have our plan.json
file, we can run conftest like this:
conftest test plan.json
conftest looks for policies in the
policy
directory by default. You can specify a different directory with the--policy/-p
command-line option.
You should see output like this:
FAIL - plan.json - S3 bucket encryption settings must be specified.
1 test, 0 passed, 0 warnings, 1 failure, 0 exceptions
If you copy the server_side_encryption_configuration
stanza into the cleartext bucket definition, and regenerate the plan.json
file, the conftest output should change to:
1 test, 1 passed, 0 warnings, 0 failures, 0 exceptions
Although we've used terraform in this example, to keep it simple, Kubernetes lends itself particularly well to this kind of approach. As well as finding out what your IaC code says about how your infrastructure should be set up, the kubernetes API enables extensive introspection. So, you can scan a kubernetes cluster and find out exactly what is actually running, and how it's configured.
Tools like Sonobuoy make this easier, allowing you to automatically run reports on the setup of your cluster and the code running on it.
This is a huge topic, and I've barely scratched the surface with this article. But I hope I've shown you some of the tools and techniques available to allow you to apply some of the same testing rigor to your IaC code and configuration that you already apply to developing your application code.
Testing infrastructure code has some challenges, but by taking a layered approach, using a combination of techniques at different points in your infrastructure development lifecycle, you can gain a lot of confidence in your setup, and minimise the risk of later changes introducing errors or vulnerabilities.