Skip to content

Instantly share code, notes, and snippets.

@armenr
Last active February 20, 2024 23:48
Show Gist options
  • Save armenr/738a2df431b7eaa38e79b4f827bb8173 to your computer and use it in GitHub Desktop.
Save armenr/738a2df431b7eaa38e79b4f827bb8173 to your computer and use it in GitHub Desktop.
Wait for EC2 to Become Reachable

EC2 Wait Until Ready

This script is part of a broader library of utilities that are used in conjunction with Terraform...to make life better/easier for Ops & SRE.

Use-Case

Not everything begins and ends with Kubernetes. Sometimes, you've got things to do directly on an EC2. It (almost) always goes the same way:

  1. Create an instance
  2. Wait for that instance to "come online"
  3. Ensure that all default cloud-init scripts (and any other UserData) have executed to completion
  4. Do useful work 🫠

As it turns out, it's not very easy or straightforward to cleanly handle that particular case in Terraform.

That's what this script is for.

Usage/Example

Let's say that you've got an EC2 instance you want to provision as soon as it's created. Let's also assume that you want to provision it with something like Ansible.

First, we create a null_resource

# Blocks ansible run until new/all hosts are ready
resource "null_resource" "verify_instance_readiness" {

  # Run this always...any new instance and/or existing instance should always
  # be ready before the terraform run proceeds
  triggers = { always_run = timestamp() }

  provisioner "local-exec" {
    command = "${path.module}/ec2-wait-until-ready.sh ${aws_instance._.id}"
  }
}

Next, we create a second null_resource which depends on the verify_instance_readiness null_resource

resource "null_resource" "ansible" {

  # Triggers matter - we need to ensure we trigger on every possible change to
  # any relevant data, vars, or files
  # ‼️ 👉 This is just ONE *EXAMPLE* trigger, from an existing implementation ‼️
  triggers = {
    # trigger on changes to ansible vars or instance IDs
    instance_id           = aws_instance._.id
    # ...other triggers
  }

  # 
  depends_on = [
    null_resource.verify_instance_readiness
    # ...other dependencies
  ]

  provisioner "local-exec" {
    command = <<-EOT
      ansible-playbook \
        --connection=aws_ssm \
        --inventory ${local_file.ansible_inventory.filename} \
        --extra-vars='${jsonencode(local.aspera_ansible_vars)}' \
      ${local_file.ansible_playbook.filename}
    EOT

    environment = {
      ANSIBLE_REMOTE_TEMP                 = "/tmp/.ansible/tmp"
      ANSIBLE_STDOUT_CALLBACK             = "yaml"
      AWS_PROFILE                         = var.aws_cli_profile
      OBJC_DISABLE_INITIALIZE_FORK_SAFETY = "YES"
    }
  }
}

Outcomes/Behavior

On First-Run/EC2 Instance Creation

  1. Your EC2 is created
  2. Our first null_resource named verify_instance_readiness runs
  3. It waits until all cloud-init and UserData scripts are executed to completion
  4. It returns successfully
  5. Your next null_resource then runs, to execute some set of provisioning steps --> BASH scripts, Ansible playbooks, etc.

On Subsequent terraform runs

  1. The null_resource is set to run every time because of triggers = { always_run = timestamp() }
  2. It runs
  3. It instantly connects to the existing EC2, sees that everything's great, and returns
  4. It amounts to a totally innocuous NOOP
#!/bin/bash
set -euo pipefail
# This is a simple script which waits until an EC2 instance is reachable via SSM Manager.
# This allows us to "block" our Ansible provisioner until the instance is ready.
# It is assumed that this script resides in the same directory as the terraform module that uses it!
# It may also require some minor changes if you want to explicitly pass a region to the underlying aws command
#
# Usage: ./ec2-wait-until-reachable.sh <INSTANCE_ID>
# Tested with: AL2 Linux + terraform
instanceId=$1
n=0
echo "[*] WAIT_FOR_EC2: Checking instance connectivity and state..."
until [[ "${n}" -ge 10 ]]; do
echo "[* ${instanceId}]: SSM Connectivity attempt #${n}"
aws ssm start-session --target "$1" >/dev/null 2>&1 && break
n=$((n + 1))
sleep 5
done
echo "[* ${instanceId}]: SSM-session connectivity verified!"
tries=0
RESPONSE_CODE=1
while [[ ${RESPONSE_CODE} != 0 && ${tries} -le 50 ]]; do
echo "[* ${instanceId}]: Checking if cloud-init is still running - attempt #${tries}"
cmdId=$(
aws ssm send-command \
--document-name AWS-RunShellScript \
--instance-ids "${instanceId}" \
--parameters commands="sudo cloud-init status --wait > /dev/null 2>&1" \
--query Command.CommandId \
--output text \
--no-paginate \
--no-cli-pager
)
sleep 5
RESPONSE_CODE=$(
aws ssm get-command-invocation \
--command-id "${cmdId}" \
--instance-id "${instanceId}" \
--query ResponseCode \
--output text \
--no-paginate \
--no-cli-pager
)
if [[ "${RESPONSE_CODE}" != 0 ]]; then
echo "[* ${instanceId}]: cloud-init is still running. Retrying in 5 seconds..."
sleep 5
fi
((tries++))
done
echo "[* ${instanceId}]: response_code => ${RESPONSE_CODE}"
echo "[* ${instanceId}]: cloud-init is no longer running."
echo "[* ${instanceId}]: Let's get to work!"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment