Skip to content

Instantly share code, notes, and snippets.

@jimpriest
Forked from JAMSUPREME/1_README.md
Created May 21, 2021 18:32
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jimpriest/f602d0e7daf734fc0df827fe58de5473 to your computer and use it in GitHub Desktop.
Save jimpriest/f602d0e7daf734fc0df827fe58de5473 to your computer and use it in GitHub Desktop.
Infrastructure Best Practices

Overview

This doc is aimed to provide some general guidance as to infrastructure best practices. It is by no means a rigid set of rules, but rather a handful of tips that may help.

Guidance

I attempted to break these recommendations into distinct files:

  • General FAQ and Tooling: General guidance in the form of a FAQ. Also contrasts some other tooling.
  • Terraform: General guidance regarding Terraform that I have found to improve productivity.
  • Local development: General guidance on how to improve local development

Bash Helpers

Here are some helper bash scripts that I put in my ~/.bash_profile - they're somewhat documented.

Reference

I put most of this together as I was prototyping different combinations of tooling in https://github.com/JAMSUPREME/tf-multi-account

Overview

Below is a list of the topics I cover:

Below are my brief analyses on some tooling alternatives:

How many OUs should I have?

At the moment, I recommend 1 per environment. You get the advantages of isolating environments without the overhead of wiring too many things up. I also recommend having 1 shared infrastructure repo if you need to share any resources across applications in an environment. Here's an example breakdown:

/app1         (dev, prod)
  lambda
/app2         (dev, prod)
  ecs
/shared-infra (dev, prod)
  vpc
  elasticache

These articles are useful reads for understanding elaborate multi-account set ups:

What about project-wide resources?

Sometimes you may have a tool or application that applies to all of your applications (across all environments) but in itself it has only a single environment (due to licensing or other reasons).

An example of this might be something like:

  • An on-prem SonarQube set up
  • A single large Jenkins server to save costs
  • A single Gitlab setup isolated in its own acount

Reiterating my example above, you might have something like this:

app name        (account name)

/app1             (dev, prod)
/app2             (dev, prod)
/shared-infra     (dev, prod) <-- shared infrastructure per-environment
/jenkins          (jenkins)   <-- one jenkins shared across all apps and all environments
/sonarqube        (sonarqube)

However, if you don't want 1 account per shared product, you might wish to simplify it to something like this:

/jenkins   (tools)
/sonarqube (tools)  <-- both jenkins and sonarqube exist in a "tools" account

In this context, I'm using the term tools to avoid ambiguity with "shared infrastructure" since you'd probably want a unique term to avoid people getting confused between shared infrastructure (per environment) vs. shared infrastructure (across an entire project). Some other naming thoughts:

  • standalone (though I feel this could be ambiguous since there are contexts in which you create a standalone environment for an application to test it)
  • ops - It might make sense to call it an "ops" account since they are operational tools

Why not several OUs?

I also prototyped having an additional "infrastructure" account in https://github.com/JAMSUPREME/tf-multi-account but it was ultimately less value than anticipated. Here is some of my reasoning:

  • When talking about infrastructure it wouldn't be clear if we are discussing shared infrastructure, or the actual "infrastructure" account
  • If we need shared infrastructure, we end up with multiple repos and their purposes (and accounts) aren't clearly defined, e.g.
/app1         (dev, prod)
/app2         (dev, prod)
/shared-infra (dev, prod)
/infra        (infra)      <- this becomes ambiguous
  • It may or may not be practical to have a shared CodeBuild/CodePipeline in the infrastructure repo. If apps need unexpected customizations, we'll be cluttering both repos and introducing confusion.
  • The promotion model can be made fairly simple by using SNS, so we won't be tightly coupling environments (see "Benefits" below for my earlier thoughts)
  • The problem of a central ECR repo (or any other resource) can be solved by a distinct shared-infra repo, and there is no need for a distinct AWS account

How do I promote code across environments?

This varies slightly depending on whether or not you are using multiple OUs:

  • Single OU: Since your DEV, QA, and PROD all share an environment, you can control promotion via your buildspec.yml or Jenkinsfile (see one such example https://github.com/ICF-ITModernization/base-api-java/blob/master/Jenkinsfile#L128). This makes promotion obvious, and easy for interjecting special behavior for different environments.
  • Multiple OUs: Since DEV, QA, and PROD are all in distinct accounts, we need to have them communicate in some manner. As demonstrated in tf-multi-account we can send promotion events to other OUs in our organization, and then listen for those events in the higher environment. For this example, I used only Cloudwatch/EventBridge in an attempt to avoid complexity, but you could also use any pub/sub architecture you want.

How should I provision the infrastructure?

Broadly, I recommend using Terraform to bootstrap an initial pipeline (CodeBuild/Jenkins). Thereafter, your pipeline can adjust its own infrastructure by running terraform apply. If all of your resources are public, Github Actions may be a simpler alternative.

How will the infrastructure be deployed? How to bootstrap initially?

Generally,

  • Write the infrastructure code for your pipeline (CodeBuild, Jenkins, etc.)
  • Initially create that from your local machine
  • Once it exists, create an application pipeline (buildspec.yml, Jenkinsfile) that can build the application and then apply infrastructure changes. e.g.
# in a buildspec
phases:
  install:
    runtime-versions:
      java: corretto11
  build:
    commands:
      - docker build .
      - terraform apply -auto-approve
      
# or Jenkinsfile (may be prettier if you use fancy plugins)
pipeline {
  stages {
    stage('Build and deploy') {
      sh "docker build ."
      sh "terraform apply -auto-approve"
    }
  }
}

Since logs will exist, and you could add a manual step to prevent promoting to production if a bug existed, I think it's perfectly fine to have the pipeline potentially adjusting its own infrastructure.

How do I set up Configuration Management?

There are different ways in which you might use configuration management:

These tools can often be used in conjunction with your infrastructure provisioning (Terraform/CloudFormation). However, caution should be used when bouncing between many tools, as it can be difficult to follow the steps involved if it constantly jumps around from Terraform to UserData to Chef to Dockerfile to Task Definition, etc.

How should the application be built or compiled?

Jenkins, CodeBuild, Github Actions, etc. all work equally fine. The drawbacks of having a long-lived Jenkins server is that it requires more upkeep, though you may get better visualization. With Actions or a CodeBuild using an up-to-date base image, you don't need to worry about the server.

One of the related questions to this is whether or not you can reuse similar infrastructure code so you aren't copy/pasting the same shell scripts all over. This can be done whether you use a long-lived option like Jenkins or a bootstrapped custom image with helper scripts. Which leads to the next question...

How do I reuse scripts across multiple pipelines?

If you're using Jenkins, the easiest way is to use a shared library. See https://github.com/ICF-ITModernization/base-jenkins-shared-library for some examples, along with an example on how to let the shared library run its own methods in its own pipeline to verify that they work before it publishes itself. It also has a self-documenting function reference, so this repo should serve as a handy starting point for any new library.

If you aren't using something long-lived like jenkins, then an alternative may be to write a set of helpful bash scripts that you can source on startup. This way your buildspec.yml could use any helper scripts already defined.

How can I use the same configs locally and in the cloud?

If you use Kubernetes, it is possible to reuse your configuration locally and remotely, but the caveat is that you would need to run your databases, caches, etc. within your k8s cluster. If you are OK with this, then k8s is definitely a good fit. If you wish to use a cloud managed tool, then drift is inevitable.

Do I need EC2s or containers?

Short answer: No. (though if you use build server like Jenkins, that may be a special exception)

It's important to keep in mind that in many cases, going "serverless" is cheaper and less complicated than wiring together containers.

Here's a quick set of questions to ask yourself:

  • Do I perform basic input/output or CRUD operations? Serverless (A database and API gateway could be made with Terraform, and you simply deploy your desired endpoints/functions to lambda for CRUD operations)
  • Do I have a static website? Serverless (or maybe even basic s3/cloudfront hosting)
  • Do I have long-lived jobs or operations? Containers/EC2 - If you have something that takes a LONG time (5 minutes or more) it is probably advisable to use containers or EC2. If it's asynchronous, you can even get away with cheaper spot pricing.
  • Do I have an unsupported language (R, Elixir, Rust)? Containers/EC2 - If you are using a less-popular language, you won't be able to use Lambda, so you'll need to spin up your own containers. Something like https://www.openfaas.com/ may be relevant in this scenario if you desire a FaaS stack but want an unsupported stack. (Sidenote: For small binaries, you can also make a custom lambda runtime. See https://github.com/awslabs/aws-lambda-rust-runtime)
  • Do I need to maintain persistent state in memory/on-disk? EC2 - If, for some reason, you need to maintain some sort of state on disk or in memory (and you aren't using a managed service like RDS, RedShift, EFS, EMR, etc...) then you might have a use case where you need to have a long-lived EC2 instance attached to a recoverable EBS volume. This is not a common use case.

Using containers anyway

I'll add a footnote here that it may still be prudent to use containers for development just to isolate your tooling. For example, managing JDK aliases/locals can be a headache on a single machine, but if you spin up an image you are guaranteed to get the version you want.

The same goes for tools like Jenkins (and its many plugins). Installing all of these things and potentially making a mess of your machine can be unpleasant, so there may still be value in using a container to isolate your work environment. That being said, baking a basic image just for local development is a lot simpler than building something stable for production and maintaining it.

How do I share resources across application infrastructure?

My current recommendation is to have distinct repository shared-infrastructure in which you place all shared resources. This same practice applies regardless of whether you're using CloudFormation, Terraform, or something else.

As a brief example, the folder structure might look a bit like this:

/Github
  /app-one
    /terraform
      main.tf    (has app 1 resources)
  /app-two
    /terraform
      main.tf    (has app 2 resources)
  /shared-infr
    /terraform
      main.tf    (has VPC setup)

How do I share code across application infrastructure?

  • With Terraform, this is primarily done via modules.
  • With CloudFormation, you would generally use Nested Stack (via Template URL)

Regardless of tool, you should be very confident that there is a consistent set of resources to be created across applications, or you'll end up making very brittle templates that don't serve their purpose. Ideally they should be very focused templates that help you reduce boilerplate.

Some examples:

  • ECS app: If you have a lot of simple apps that need to go into a very similar ECS cluster with a load balancer and serve HTTP, it might reduce clutter to roll all the IAM, ECS, ELB, etc. into a single template.
  • Lambda function: If you have a lot of lambdas running with similar configurations, it may be easiest to abstract default params, VPC configs, etc. into a single template.

The important thing to remember is that you want to create a useful abstraction. If it doesn't represent a unique concept in the infrastructure, you are probably better of just copy/pasting, since many of the resources in TF/CF are already representative of a single unit. You won't get much value from abstracting unless you can fill in a lot of default values or simplify something with a lot of complex parts.

How can multiple developers work on infrastructure concurrently?

Regardless of tool, there are some important conventions that will help enable this:

  • Ensure everything can be named uniquely (environment is an easy way to achieve this)
  • Increase any account limits if you are hitting them (VPC caps are easy to hit if you use them a lot)
  • If you have resources that are shared across applications, be careful not to break or corrupt them with your standalone environment

For Terraform, see the Best Practices For Multiple Developers section below

For Cloudformation,

  • Create a unique stack based on the same template as your current application stack
  • If you are using nested stacks, you may wish to duplicate them into a bucket of your own so you can edit them safely
  • Note that the above steps could also be automated by having your CI/CD pipeline generate a unique bucket per-branch

How do I test infrastructure changes?

In respect to Terraform/CloudFormation, there are a couple things you can do:

  • Create the infrastructure in a standalone environment and manually verify behavior
  • Create the infrastructure in dev or QA and let e2e tests verify behavior

In many cases, you'll want to do both. Usually the creation of the infrastructure will involve some manual adjustments and verification, and then once it seems stable, you would run your e2e tests against it to ensure that nothing was unintentionally broken.

Configuration Management Tests

Configuration Management is also an important aspect of infrastructure, and there are a few ways it can be tested:

  • If using Chef, then InSpec and ChefSpec are useful for verifying the end-state of configuration
  • If not, using goss or serverspec works just as well

While your e2e tests may indirectly confirm that your CM is functioning as expected, adding some specs can help document expected behavior and cover more nuanced aspects of the configuration (crons, SELinux settings, file permissions, etc.)

How do I secure my infrastructure?

Here are a few tools that can help with securing containers and infrastructure:

  • Prisma Cloud (formerly Twistlock) or openSCAP
  • AWS Tools: Inspector, WAF & Shield, GuardDuty
  • Web scanner, e.g. OWASP ZAP

This isn't comprehensive by any means, but the above are commonly integrated into the CI/CD pipeline.

How do I monitor my infrastructure?

There are a few ways in which you can monitor your application:

  • Application Performance Monitoring (APM) - APM is generally most useful for keeping an eye on unhandled exceptions in your application. It's also good to get a birds' eye view of your application health at a glance.
  • Distributed tracing - Adding an additional point onto APM, if you have microservices, you will likely need distributed tracing either via your APM, logging, or an additional tool like X-Ray.
  • Logging - Whether it is HTTP traffic, warnings, or errors, you will likely have a lot of logs that need monitoring.
  • Infrastructure - Load balancers, container clusters, DNS, EC2 instances, etc. All of these exist outside the scope of your application but could impact it, so you would likely need a separate dashboard like Cloudwatch to observe infrastructure at a glance.
  • Custom metrics - In addition to the above (or in lieu of), you may want some sort of custom instrumentation like https://prometheus.io/

I STRONGLY recommend having one or more of these tools set up correctly prior to launching your app in production.

Newrelic and Datadog both have fairly robust tool suites that let you opt into which kinds of monitoring you need.

How do I avoid downtime?

This isn't a comprehensive list, but here are some resources that help avoid downtime:

  • Container cluster or EC2 health check - At the lowest level, you generally want some sort of check at your EC2 or container cluster level to ensure that your image is running as expected.
  • Autoscaling Groups (ASGs) - First and foremost, if you don't have an ASG, then any time your EC2 instance goes down for any reason, your app is down. If you do have an ASG, it's also possible that it was misconfigured and is failing to bring the instances up in a timely manner.
  • Elastic Load Balancer (ELBs) - You should have a load balancer that is distributing traffic across multiple Availability Zones (AZs) in case a single zone fails.
  • DNS routing - With DNS failovers (or other configs) you can distribute traffic across regions. This will prevent any outages isolated to a single region.

What kinds of disaster recovery do I plan for?

In addition to "downtime", you may also end up with some sort of disaster that results in data loss or corruption. For example, if someone accidentally dropped a database or deleted a production file system.

Here are some examples of backups you can take:

  • EBS backups - If you store anything important on your EC2 instances, then you will want to take EBS backups and potentially re-mount them whenever starting up new instances.
  • AWS Backup (EFS) - AWS also provides a "generic" backup for things like the Elastic File System. If you have a CMS web site of some sort, then you'd likely want to take backups periodically so you could recover from a bad deletion or admin blunder.
  • S3 replication - In case a single region became unavailable, or you accidentally lost the data, it's useful to replicate data.
  • Database replicas - For RDS databases, there are a lot of features for supporting read replicas and promoting replica-to-master. Aurora also has minute-level restores from its backups.

How do I keep my images and applications up to date?

Generally, you should have a pipeline that will run something like yum update -y weekly or monthly and then re-tag that as your latest image. This process is effectively the same whether you're using EC2 instance or containers.

Once the latest image has been created, you would then trigger an ASG refresh (for EC2s) or deploy a new task definition (container cluster). This will safely roll out the new image without downtime.

Why not use Waypoint?

I think Waypoint is primarily targeted at people who have some code and just want to drop it "anywhere". It might be practical for very small apps or apps that are fairly isolated. For apps that have a lot of wiring and already a have a fairly robust Terraform setup, I don't see a big win in value, especially if you are customizing a lot of things (load balancers, images, etc.)

Concerns

I have a few reservations about Waypoint:

  • Limited documentation/features. Most behavior is intentionally opaque.
  • Limited guidance on big setups (e.g. multiple dependent applications, or even an app with a DB)
  • The HTTPS support is via public DNS, and doesn't support non-HTTPS protocols (like db connections)
  • I have not yet figured out how one would set up a long-lived Waypoint server for an internal pipeline (only played with local server)

Benefits

  • Easy to spin up
  • Gives HTTPS and DNS, so you don't have to worry about port collision
  • Has a helpful UI, similar to k8s UI
  • It can run against ECS/Fargate or Kubernetes

Good features:

Concluding thoughts on Waypoint

  • Might be viable for local development, but has drawbacks for dependencies (e.g. database)
  • Wouldn't recommend trying it out for a production app since we might hit issues with customizations that we cannot do
  • It requires a long-lived waypoint server, which is also a drawback

Reference

Why not use Kubernetes?

I wouldn't dissuade anyone from using k8s if it fits their needs and they have at least a few people familiar with it or comfortable ramping up quickly. It is particularly useful when you have a VERY LARGE fleet or you want to enforce consistent standards across a company or program. For small projects, I think it is likely to be overkill (but OK if your whole team likes it).

Local options

There are a handful of options for running k8s locally: https://kubernetes.io/docs/tasks/tools/

There are also other tools like skaffold for making the development cycle easier.

Do we need it?

In this particular scenario, I would compare the following combinations:

  • docker-compose (local setup), terraform, ECS Fargate
  • k8s + minikube (local setup), terraform, EKS

Benefits

  • Kubernetes offers a lot of features
  • More elaborate support for various types of clustering and deployment
  • Good handling for resource management and caps

Drawbacks

  • Kubernetes is fairly complicated
  • Kubernetes (in my opinion) isn't particularly valuable for small-scale setups with no intention of federation or additional oversight
  • Additional local tooling (k8s, minikube) to be set up on top of docker
  • Additional learning for everyone to understand k8s basics
  • Some effort would need to be put into figuring out desired base templates and conventions
  • Doesn't solve our problem of sharing configuration from local-to-prod since databases won't generally go into the cluster, and we also introduce some blurring of tooling for how we provision infrastructure (i.e. k8s dictating load balancer)

Why not use CloudFormation?

If you're already largely dependent on it (or the CDK) then it may be practical to continue using it. However, if you have a greenfield, I would recommend Terraform. For a more detailed comparison, see the Terraform vs. Cloudformation section below.

Why not use the CDK or SAM?

My experience with the CDK was limited, but from what I've experienced:

  • Sometimes the CDK has bugs and while you get some simplicity up-front, when something is buggy or broken, or you need further customization, sometimes it gets in the way
  • When using the CDK, it can sometimes obscure what is getting created or result in spaghetti code that is more complicated than a Terraform alternative.

That being said, I haven't used it enough successfully to accurately articulate its strong points.

What about Terraform CDK?

In addition to the vanilla AWS CDK, there is also a CDK target for terraform: https://github.com/hashicorp/terraform-cdk (a.k.a. tfcdk or cdktf)

I did a bit of prototyping with it. See https://github.com/JAMSUPREME/tf-multi-account/tree/main/tf-cdk and README for more info.

Recommendation

My current recommendation is to use Terraform CDK to augment a normal Terraform setup. It excels in a few scenarios in which HashiCorp Configuration Language (HCL) can get clunky, but otherwise the two are nearly interchangeable, since the majority of the TF CDK is generating Terraform-compatible JSON configuration.

When to use CDK?

I would generally recommend augmenting vanilla TF with the CDK if you need to do any of these things:

# with terraform
resource "aws_sns_topic" "build_emailer" {
  count = var.add_sns_topic ? 1 : 0
  name = "build_email"
  tags = local.global_tags
}
# dependent resources must also have `count` and reference the resource via `aws_sns_topic.build_emailer[0]

# with CDK
if(!add_sns_topic){
  const topic = new SnsTopic(this, 'myFirstTopic', {
    name: 'first-topic',
    displayName: 'first-topic-display'
  });
  // make other dependent resources here that reference `topic`
}
  • Elaborate looping (for_each) or a lot of dynamic blocks. HCL gets clunky when using one or both of these.
  • You want to inheritance:
class DefaultCloudwatchLogGroup extends CloudwatchLogGroup {
  // put default retention, KMS key, etc.
}
  • You want flexible/functional/chained/fluent composition
// polymorphic tagging
function addTags(resource){
  resource.tags = local.global_tags
}
// builder-style resource composition
class S3BucketBuilder {
  function buildLifecyclePolicy(){}
  function buildVersioning(){}
}

Benefits

  • It can be used in a "hybrid" setup in conjunction with a "vanilla Terraform" setup.
  • Conditional resource creation is much easier (no need for count and resource[0] usage)
  • You could use composition or inheritance in a much simpler way (as opposed to TF modules)
  • Looping is much simpler (compared to HCL)
  • It should be possible to follow similar conventions for both styles

Concerns

  • The CDK doesn't automatically keep all resources in scope (must explicitly pass variables around)
  • The diff report cdktf diff is not as detailed as the normal terraform plan
  • The documentation isn't nearly as robust as normal TF https://registry.terraform.io/providers/hashicorp/aws/latest/docs (though you can inspect the TypeScript types)
  • The cdktf deploy hasn't run as cleanly as a normal terraform apply (not sure why)
  • It is technically still in alpha and they mention it is not ready for production use, though it seems pretty safe to do so. Similarly, there might be significant new feature flags or backwards-incompatible changes, though we don't know for sure.

Similarities

  • Both can be linted and formatted consistently
  • Both can support targeting distinct backends per environment
  • Both can do templating

Why use Terraform?

For anyone unfamiliar with Terraform, I'll highlight some of its best features briefly:

  • It can plug into several providers. This means you could provision some resources in AWS, and some in GCP, while also setting up NewRelic monitoring, or even a Cloudflare setup. It is not intended to be an abstraction over each provider (like Kubernetes load balancers), but rather offer the capability of easily managing multiple providers simultaneously.
  • It has excellent drift detection & correction capabilities.
  • It has easy state management
  • It has easily templating
  • It has easy variable / inputs / outputs management

These are just a few highlights but are very high value compared to alternatives.

Folder structure

/terraform
  /config
    dev-backend.conf   (configures the backend TF store)
    prod-backend.conf
    dev.tfvars         (provides config values for running TF apply)
    prod.tfvars
  configuration.tf.    (designated provider and TF versions)
  input_variables.tf   (contains input variable definitions which are provided by dev.tfvars)
  .tflint.hcl          (linting rule configuration, advisable to put in git)
  *.auto.tfvars.       (auto-loaded variables must be in the root, e.g. secrets)
  
  *.tf                 (all your resources)
  sns.tf
  s3.tf

File naming conventions

In my experience, it is best to name things after the service, potentially adding a qualifier for what it's doing.

Using the tf-multi example:

cloudwatch_send_promotion.tf
cloudwatch_receive_promotion.tf

The common exception to this is IAM roles/policies. A large quantity of services need to be tied to an IAM role (and policy) and therefore it's more prudent to keep those next to their respective resources.

Template naming conventions

  • Template files will have .tpl prepended to their file extension. For example, file.tpl.json or file.tpl.xml

Local, Data, and variable declarations

  • Place locals applicable to a single file at the top of the file
  • Place global locals at the bottom of the input_variables.tf file
  • Put global data declarations at the bottom of the input_variables.tf file
  • tflint should be used to enforce consistent variable naming
  • terraform fmt should be used to enforce consistent formatting (spacing, etc.)

Useful tagging

In your input_variables.tf, I recommend declaring a global_tags and ensuring you use this local on all your resources that support tagging. For example:

locals {
  global_tags = {
    "SourceRepo"  = "my-great-app-1"
    "Environment" = var.deploy_env
  }
}

Example resource:

resource "aws_s3_bucket" "terraform_state_bucket" {
  bucket = "my-great-state-bucket"
  tags = local.global_tags
}

Customized tags (via merge):

resource "aws_s3_bucket" "terraform_state_bucket" {
  bucket = "my-great-state-bucket"
  tags = merge(
    local.global_tags,
    {
      LineOfBusiness = "Your-LOB-ID",
    }
  )
}

Benefits:

  • Everything gets tagged correctly, and you can easily replace or add tags
  • The "source repo" is very helpful for finding out where the infrastructure came from
  • If you only use a single OU, "environment" is helpful for distinguishing or finding resources
  • It can be very useful for billing analytics

Failing Pull Requests

When creating a pull request, it should trigger a build that runs both tflint and terraform fmt and if either return a non-zero exit code, then the build should be failed.

This helps ensure that no poorly formatted code makes it into the main/master branch.

Helper scripts

With the aforementioned folder structure, we can make a few helper scripts to shorten some of the tedious copy/pasting for common commands. Helper scripts exist below: Terraform Helpers

It is a bit of shorthand:

Standard (no helper): terraform apply -var-file="config/dev.tfvars"

Shorthand (with helper): tfapply dev

Different resources per environment

Because the whole point of Terraform is to keep every environment consistent, going against this pattern can be a little clunky. There are 2 main ways I've found of making resources distinct per environment:

  • Gratuitous usage of count - This is a little clunky because you must also ensure all dependent resources are also omitted, and now you must access your resource by index aws_sns_topic.my_topic[0]
  • Abstract distinct resources into a module - You'll still need to use count for the module, but this way you abstract all the special resources into a different module that will only apply to a particular environment
  • What about when you need different configuration?
    • Usually using different variable values is sufficient
    • Other times, you want an entirely different configuration of a nested block. This is more difficult, but can sometimes be solved with the use of the dynamic block
    • If neither of the above work, then you may need to make a completely distinct resource (or module) but you should first ensure you really want such substantial differences between environments

Best Practices for multiple developers

This is a brief guide on how to use distinct workspaces with Terraform while multiple people modify the same environment.

Additive Standalone environment

The following is an effective strategy when you largely want to share an environment (e.g. dev) but both you and another developer are adding new resources. This strategy will copy the existing state into a new file and allow you to modify the new state so that it contains your resources. Your new resource can then be safely created or deleted without impacting other developers.

Note: This strategy will not work effectively if multiple developers need to make incompatible updates to existing resources. For certain shared resources, it won't be possible to create/modify them in isolation. If possible, you should update the resource in a standalone branch and merge into master so everyone can pull the change into their own branch. If this is not possible, see the next section on Unique Environments

  • Copy /terraform/config/backend-standalone.conf.example and rename it to backend-<ENV>-<NAME>.conf where <NAME> is your name and <ENV> is the target environment, e.g. backend-dev-justin.conf
  • Uncomment the values in backend-<ENV>-<NAME>.conf and set a unique key value (your name, for example)
  • Run terraform init -backend-config=config/backend-dev-justin.conf
    • When asked to copy state, answer yes
  • Add/Import/etc. your new resources and run terraform apply accordingly.

Here's an example file structure:

/terraform
  /config
    backend-dev.conf
    backend-dev-justin.conf

And example file content:

# backend-dev-justin.conf

# Make sure you are using the correct bucket for your env (dev, prod, etc.)
bucket = "tfmulti-dev-terraform-813871934424"
profile = "sdc"
# Change the following key to be a unique identifier for you (e.g. name)
key = "terraform-justin.tfstate"

Once your infrastructure is stable, you should do the following:

  • Switch back to the normal dev state: terraform init -backend-config=config/backend-dev.conf
    • When asked to copy state, ANSWER NO !!!!! (It will not be good if you overwrite dev's state)
  • Use terraform import to pull in the resources you created in your custom state

Unique environments

You may be doing a substantial rewrite to existing infrastructure, in which case copying state simply won't be useful since you will be modifying several existing resources.

If possible, in this scenario you should create an entirely new set of infrastructure that is distinct from the current stack.

For example, if you need to make substantial changes to dev, you would create a standalone dev-justin set of infrastructure that has its own state and entirely its own resources. You can then create/modify/destroy anything in this environment without affecting others.

Things to keep in mind with this approach:

  • All resources must be uniquely named
  • The new stack should not cause any side effects in the normal working environment (e.g. dev)
  • If possible, there should be minimal or no manual provisioning involved to avoid blunders or botched cleanup
  • You must remember to destroy this environment after completing your work

Terraform Format

Terraform comes with a built-in format that adheres to its canonical conventions. Merely run terraform fmt in the terraform directory. If you have modules, use terraform fmt -recursive to ensure subdirectories get formatted.

TFLint

I recommend using https://github.com/terraform-linters/tflint to enforce some common conventions and help avoid blunders.

IDE support

The Terraform language server https://github.com/hashicorp/terraform-ls can be used via multiple IDEs. I recommend using the respective plugin for your IDE.

VSCode

With VSCode, I recommend adding the Hashicorp Terraform plugin and making sure you keep it and the associated language server up-to-date.

Make sure you have the following in your VSCode settings.json:

"terraform.languageServer": {
  "external": true,
  "args": [
    "serve"
  ]
}

Footnote: There is also a plugin named HCL - this and other plugins might interfere with your terraform file behavior, so I recommend not installing them.

vs CloudFormation

There are a number of things I find more convenient in Terraform:

  • Drift detection & correction. Terraform will detect drift and correct it. CloudFormation stacks are effectively broken if any drift occurs.
  • Easy state management. You can import/remove resources from the Terraform state. This is convenient for importing manually-created resources or removing something from state that you wish to manually alter.
  • Easy state rollback. If you ever end up making a mess of the state, you can easily revert it in S3 (note that git versioning won't help you if you are adding/removing from state)
  • Easy variable & output management. Unlike the convoluted Outputs/Parameters declarations that you need to make for all stacks, TF lets you declare your variables, and anything in scope can use the variables. Any resource is available by its name, and you can declare data attributes to read values from external resources.
  • Easy templating. For anyone who's written a nasty !Join or !Sub block, you'll be overjoyed to use TF templating which allows basic interpolation for whichever file type you want. This also makes development easier since you'll get syntax highlighting instead of a wall of text.

Other downsides to CloudFormation

  • It is not easy to change a child stack. Sub stacks allow you to reuse modular YAML files, but it generally isn't easy to teardown a substack on its own and then bring it back up. You must manually modify the master stack to remove it, and then re-add it.
  • Changesets are (currently) worthless. At this juncture, changesets generally just tell you everything changed, making it functionally worthless. It also doesn't detect drift, so if any drift occurred, your changeset will fail.
  • Things get stuck. Whether it was because of drift (or another stack touching something) CloudFormation changes can sometimes get infinitely stuck. At least if Terraform gets stuck, you can enable TF_LOG and often this will indicate that there is some sort of AWS degraded service.
  • Transforms. Transforms are useful, but can be a bit complicated and I find that the Terraform alternatives are much more approachable.

Overview

This mini-guide is intended to help you figure out how to best manage your local development setup.

Your tooling choice largely impacts best practices!

I'll largely be focusing on my recommendation on the following assumptions:

  • You have an application that bundles up into a container, or can easily be run on a vanilla container image
  • You are going to deploy that app to ECS or Lambda or EC2

What if I'm using other tools?

Respectively:

  • Waypoint: I'd recommend making a waypoint-local.hcl and a waypoint-aws.hcl or something to that effect so you can easily waypoint up with whichever config is correct depending on where you need to deploy
  • Kubernetes: You could probably get away with a setup fairly similar to using docker-compose, but probably making your database/cache/etc. in distinct YAML files that wouldn't get deployed in cloud environments. Then you can use k8s exclusively locally and for prod you would have k8s + a few managed services (like RDS).

Docker-Compose shared or per-repo?

I recommend creating a docker-compose per application for the following reasons:

  • Mild convention - The only assumption you will need is that your dependent apps reside in the same folder as your current application (unless you wish to use static images). This applies to any approach if you need to develop in multiple apps simultaneously.
  • Explicit dependencies - One of the big advantages (from my viewpoint) is that you have a declaration of your app's dependencies in the form of a docker compose. If it depends on 2 other apps, anyone who develops the app will immediately see this.
  • Easy DB/cache composition - It is simple enough to create a snippet for the database and/or cache, and copy/pasting that isn't overly redundant

How will shared resources work? - As much as possible, I would suggest we avoid sharing resources across applications if they introduce coupling. However, I think it would be possible to set up shared volumes or services if that was necessary.

For example,

  • If 2 apps logged to the same s3 bucket, that is OK and isn't tight coupling.
  • If 2 apps read/write from the same database tables, that might be a sign of tight coupling or bad domain separation.

Standalone alternative

Creating a docker-compose in a standalone "local-development" repo is also a viable alternative. Some benefits and drawbacks:

  • Benefit: One large docker-compose, so only one place to look for updates.
  • Drawback: The large compose may contain a lot of apps you don't currently care about
  • Drawback: If any app is non-functioning, you will need to fix it to get your compose running correctly
  • Benefit: You can share the same database/cache across apps (if desired) within the same compose config.
#!/bin/bash
# Some AWS-related helpers
# 1: The pem file to check
get-aws-fingerprint(){
openssl pkcs8 -in $1 -nocrypt -topk8 -outform DER | openssl sha1 -c
}
# 1: Your AWS profile designating the desired account so you don't mix them up
# 2: The MFA token (123456) from your MFA authenticator
get-mfa-token(){
if [ -z "$1" ]; then
echo 'Arg $1 must be a profile from your ~/.aws/credentials file!'
return
fi
if [ -z "$2" ]; then
echo 'Arg $2 must be a MFA token!'
return
fi
TARGET_AWS_PROFILE=$1
MFA_TOKEN=$2
MFA_ARN=arn:aws:iam::123456:mfa/put.account.here
export AWS_PROFILE=$TARGET_AWS_PROFILE
aws sts get-session-token --serial-number $MFA_ARN --token-code $MFA_TOKEN
}
# Gets session token and exports necessary AWS variables
# 1: Your AWS profile designating the desired account so you don't mix them up
# 2: The MFA token (123456) from your MFA authenticator
set-mfa-token(){
JQ_LOC=$(which jq)
if [ -z "$JQ_LOC" ]; then
echo 'You need jq to use this function!'
return
fi
MFA_OUTPUT=$(get-mfa-token $1 $2)
KEY_ID=$(jq .Credentials.AccessKeyId <<< $MFA_OUTPUT -r)
KEY=$(jq .Credentials.SecretAccessKey <<< $MFA_OUTPUT -r)
TOKEN=$(jq .Credentials.SessionToken <<< $MFA_OUTPUT -r)
export AWS_ACCESS_KEY_ID=$KEY_ID
export AWS_SECRET_ACCESS_KEY=$KEY
export AWS_SESSION_TOKEN=$TOKEN
echo 'Successfully assigned AWS variables'
}
# Gets the binary secret, decodes it, and outputs it to a file
# 1: The secret ID
# 2: Optional output file
get-binary-secret(){
SECRET_JSON=$(aws secretsmanager get-secret-value --secret-id $1)
SECRET_BINARY=$(jq .SecretBinary <<< $SECRET_JSON -r)
DECODED_BINARY=$(echo "$SECRET_BINARY" | base64 -D)
if [ -z "$2" ]; then
echo "$DECODED_BINARY"
else
echo "$DECODED_BINARY" > $2
fi
}
tfinit(){
if [ -z "$1" ]; then
echo 'Arg $1 should be environment!'
return
fi
terraform init -backend-config=config/backend-$1.conf
}
tfplan(){
if [ -z "$1" ]; then
echo 'Arg $1 should be environment!'
return
fi
terraform plan -var-file="config/$1.tfvars"
}
tfdestroy(){
if [ -z "$1" ]; then
echo 'Arg $1 should be environment!'
return
fi
terraform destroy -var-file="config/$1.tfvars"
}
tfapply(){
if [ -z "$1" ]; then
echo 'Arg $1 should be environment!'
return
fi
echo "Apply started at $(date)"
terraform apply -var-file="config/$1.tfvars"
echo "Apply completed at $(date)"
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment