armenr/README.md

## README.md

      
    Raw
  

              README.md
            
          
    Cilium vxlan overlay w/ Terraform

Why?

The AWS EKS team works extremely hard. We appreciate all of their effort.
But the aws-vpc-cni requires fine-tuning of complex settings, and:

Limits the number of pods you can run on an EC2, based on the number of ENIs that instance size (or type) can support. Pod density is valuable.
Requires you to play with settings like WARM_ENI_TARGET, WARM_IP_TARGET, WARM_PREFIX_TARGET, etc...
Runs into conditions where Pods get stuck in "Creating," since IP management gets tricky based on cluster pod churn, and aws-vpc-cni...and ENABLE_PREFIX_DELEGATION + branching can lead to a lot of wasted IPs

If you've ever sat there and watched your cluster or pods "get stuck" because of a failure to assign an IP address, you know the pain of this.
Cilium's vxlan overlay approach obviates these problems completely.
Trade-offs and Benefits


Trade-off: You won't be able to use "Pod Security Groups" with this implementation
Trade-off: Another downside of this approach is that you can no longer create services of type LoadBalancer with the aws loadbalancer controller of type NLB in IP mode. In other words, NLBs can no longer direct traffic directly to your pods, but can only send traffic to instances that are listening on a NodePort. This means that you can not use the Pod readiness gates of the controller, and you can no longer guarantee 0-downtime deployments/upgrades of your ingress pods/nodes. If you're okay with a few dropped requests, then great. If not, then think twice! --> thanks to /u/DPRegular
Benefit: No more "stuck pods" or IP starvation, ever
Benefit: No more pod density/max pods limitations on your nodes - you can safeuly use t3/t4 micro, small, etc. with your autoscaler of choice. WE RECOMMEND KARPENTER!

Assumptions

Place cilium-provisioner.tf, dynamic-cilium-values.tpl, cilium-provisioner.tf in the same folder as your EKS terraform module.
You can also use _cilium-provisioner.sh + dynamic-cilium-values.tpl without terraform. Just read the instructions in the script, rename dynamic-cilium-values.tpl to .yaml, and hard-code your values.
Prerequisites

Read this, and make sure you have all the necessary port/firewall configs in place: https://docs.cilium.io/en/v1.9/operations/system_requirements/#firewall-rules
This assumes:

You're using the EKS module for terraform
You're using Linux Kernel 5.15.x on your AL2/BottleRocket/Ubuntu nodes
Your EKS Service IPv4 range --> 10.100.0.0/16
Your EKS cluster endpoint is accessible 🫠
You're using the latest cilium-cli release
Worker nodes are in private subnets, and have NAT gateways setup on the VPC

‼️ Important

Tag your nodes properly if you're going to use the installation script

For your nodegroups, you need to find some standard way to tag your EC2s (the installation script relies on this fact).
This is because you have to flush IPTables on any existing/running nodes in your cluster that were using the default aws-vpc-cni plugin, after it is disabled and cilium is installed.
The installation script retrieves those instance IDs by tag, autoamtically, and then flushes IPTables on those nodes by using aws ssm send-command.
You'll probably want to modify/tweak this to fit your setup.
instance_ids=$(aws ec2 describe-instances --filters "Name=tag:Name,Values=*${CLUSTER_NAME}-base-compute*" "Name=instance-state-name,Values=running" --query "Reservations[].Instances[].InstanceId" --output text)
In our case, we tag our "base" nodes (those which are not auto-scaled with karpenter) with this pattern:
nameOfCluster-base-compute-* --> example EC2 name: v3-qa-1-base-compute-1
You will see on line 59 of the _cilium-provisioner.sh script that we use this tag to identify the existing nodes that we have just installed cilium on.
This is an extremely important step. Don't skip it.
Changes to NodeGroups and/or Karpenter Provisioners

At startup, all of your nodes will need to be tainted with:
  - key: node.cilium.io/agent-not-ready
    value: "true"
    effect: NoExecute


READ Karpenter docs regarding taints: https://karpenter.sh/preview/concepts/provisioners/#cilium-startup-taint
READ Cilium docs regarding taints: https://docs.cilium.io/en/stable/installation/taints/#considerations-on-node-pool-taints-and-unmanaged-pods
Also see the "EKS" installation instructions tab here: https://docs.cilium.io/en/stable/installation/taints/#considerations-on-node-pool-taints-and-unmanaged-pods

Cilium Setup


This particular configuration doesn't include L7 traffic shaping or loadbalancing
This particular configuration doesn't rely on an egress gateway, although our testing showed that using cilium's egress gateway implementation also works (we tend to avoid any additional complexity wherever possible)
This particular configuration is ready for use with cilium's magical cluster mesh networking
This particular configuration does not rely on cilium's ingress gateway (we use aws-load-balancer-controller for that)

Caveats & Required Changes

Make sure that:

karpenter --> hostNetwork: true
aws-load-balancer-controller --> hostNetwork: true
metrics-server --> hostNetwork.enabled: true

...otherwise, they won't work.
We only install the following "EKS addons" generically:

kube-proxy
coreDNS
aws-vpc-cni

We also install the following from their most recent helm charts, and not as addons, since "addon" versions gave us issues:

aws-efs-csi-driver
aws-ebs-csi-driver
external-dns
aws-load-balancer-controller

Revert/Restore/Undo

To undo the changes made by cilium and completely uninstall it, just run:
cilium uninstall
This will restore the aws-vpc-cni plugin.
After removing cilium, you'll need to make sure you restart all your pods and/or revert any changes to your helm values.yaml for any of the aforementioned services. You'll definitely want to make sure you restart kube-proxy and coreDNS as well.

  
## _cilium-provisioner.sh
#!/bin/bash
#shellcheck disable=SC2154

set -euo pipefail

# !! this is important, don't touch it... !!
export CILIUM_CLI_MODE=helm

###################################################################################################
## This script expects the following environment variables be set prior to execution,
## otherwise it will fail.
#
# CLUSTER_NAME    :    the name of your cluster
# REGION          :    the AWS region you're running in
# KUBECTX_ALIAS   :    the kubecontext identifier/name/alias of the cluster you intend to target
#
###################################################################################################

check_deployment_ready() {
  local deployment=$1
  kubectl wait --for=condition=available --timeout=300s deployment/${deployment} -n kube-system
}

check_daemonset_ready() {
  local daemonset=$1
  kubectl wait --for=condition=available --timeout=300s daemonset/${daemonset} -n kube-system
}

# update kubeconfig
echo "Adding/updating kubeconfig for env..."
aws eks --region "${REGION}" update-kubeconfig --name "${CLUSTER_NAME}" --alias "${KUBECTX_ALIAS}"

# Check readiness for deployments
for deployment in coredns; do
  check_deployment_ready "${deployment}" &
done

# Check readiness for daemonsets
for daemonset in aws-node kube-proxy; do
  check_daemonset_ready "${daemonset}" &
done

# Wait for all readiness checks to complete
wait
echo "All specified deployments and daemonsets are ready."

# Install/upgrade Cilium as appropriate
if helm list --namespace kube-system | grep -q cilium; then
  echo "Cilium already installed."
  echo "Upgrading Cilium..."

  cilium upgrade \
    --cluster-name "${CLUSTER_NAME}" \
    --datapath-mode tunnel \
    --helm-values cilium-values.yaml
else
  echo "Cilium is not installed."

  cilium install \
    --cluster-name "${CLUSTER_NAME}" \
    --datapath-mode tunnel \
    --helm-values cilium-values.yaml
fi

# Wait for Cilium to be ready & healthy
cilium status --wait
echo "Cilium is ready - flushing iptables on base nodes..."

# Flush IPTables on base nodes
# Get Instance Ids whose names contain cluster name and are in 'running' state
instance_ids=$(aws ec2 describe-instances --filters "Name=tag:Name,Values=*${CLUSTER_NAME}-base-compute*" "Name=instance-state-name,Values=running" --query "Reservations[].Instances[].InstanceId" --output text)

# Iterate through each instance
IFS=$'\n' # change the field separator
for id in ${instance_ids}; do
  echo "FLUSHING DEFAULT IPTABLES ON BASE NODE: ${id}"

  # Send command using AWS SSM
  aws ssm send-command \
    --instance-ids "${id}" \
    --document-name "AWS-RunShellScript" \
    --parameters commands='sudo iptables -t nat -F AWS-SNAT-CHAIN-0 && sudo iptables -t nat -F AWS-SNAT-CHAIN-1 && sudo iptables -t nat -F AWS-CONNMARK-CHAIN-0 && sudo iptables -t nat -F AWS-CONNMARK-CHAIN-1' \
    --comment "Flushing IPTables Chains" \
    --query Command.CommandId \
    --output text \
    --no-paginate \
    --no-cli-pager
done

echo "Rolling all stateful sets and deployments..."

# Restart and wait for DaemonSets in the kube-system namespace
daemonsets=$(kubectl get daemonsets --namespace kube-system --output jsonpath='{.items[*].metadata.name}')

for daemonset in ${daemonsets}; do
    # Restart the DaemonSet
    kubectl rollout restart daemonset "${daemonset}" --namespace kube-system

    # Wait for the rollout to complete
    echo "Waiting for DaemonSet ${daemonset} to become ready..."
    kubectl rollout status daemonset "${daemonset}" --namespace kube-system
done

# Restart and wait for Deployments in the kube-system namespace
deployments=$(kubectl get deployments --namespace kube-system --output jsonpath='{.items[*].metadata.name}')

for deployment in ${deployments}; do
    # Restart the Deployment
    kubectl rollout restart deployment "${deployment}" --namespace kube-system

    # Wait for the rollout to complete
    echo "Waiting for Deployment ${deployment} to become ready..."
    kubectl rollout status deployment "${deployment}" --namespace kube-system
done

echo "Cilium installed successfully, cluster & cluster networking are ready!"

## cilium-provisioner.tf
# this file belongs in the same folder as your EKS module

# dynamic cilium-values file content
data "template_file" "cilium_values_template" {
  template = templatefile(
    "${path.module}/dynamic-cilium-values.tpl",
    {
      cluster_name = var.env_name
    }
  )
}

# dynamic cilium-values file
resource "local_file" "cilium_values_file" {
  content  = data.template_file.cilium_values_template.rendered
  filename = "${path.module}/cilium-values.yaml"
}

resource "null_resource" "setup_cilium" {
  depends_on = [module.eks.cluster_endpoint]

  triggers = {
    cluster_id              = module.eks.cluster_id
    cluster_oidc_issuer_url = module.eks.cluster_oidc_issuer_url
    cluster_version         = module.eks.cluster_version
    cilium_values           = local_file.cilium_values_file.content
  }

  provisioner "local-exec" {
    command = "${path.module}/_cilium-provisioner.sh ${var.region} ${var.env_name} ${local.eks_kubectx_alias}"

    environment = {
      CLUSTER_NAME    = var.env_name
      KUBECTX_ALIAS   = local.eks_kubectx_alias
      REGION          = var.region
      VPC_CIDR_BLOCK  = var.vpc_cidr_block
      CILIUM_CLI_MODE = "helm"
    }
  }
}

## dynamic-cilium-values.tpl
---
# this file belongs in the same folder as _cilium-provisioner.sh and cilium-provisioner.tf
bandwidthManager:
  enabled: true
bpf:
  masquerade: true
cluster:
  id: 0
  name: ${cluster_name}
encryption:
  nodeEncryption: false
eni:
  enabled: false
externalIPs:
  enabled: true
hostPort:
  enabled: true
k8sServicePort: 443
kubeProxyReplacement: strict
MTU: 9000
nodePort:
  enabled: true
operator:
  replicas: 1
serviceAccounts:
  cilium:
    name: cilium
  operator:
    name: cilium-operator
socketLB:
  enabled: true
tunnel: vxlan
	#!/bin/bash
	#shellcheck disable=SC2154

	set -euo pipefail

	# !! this is important, don't touch it... !!
	export CILIUM_CLI_MODE=helm

	###################################################################################################
	## This script expects the following environment variables be set prior to execution,
	## otherwise it will fail.
	#
	# CLUSTER_NAME : the name of your cluster
	# REGION : the AWS region you're running in
	# KUBECTX_ALIAS : the kubecontext identifier/name/alias of the cluster you intend to target
	#
	###################################################################################################

	check_deployment_ready() {
	local deployment=$1
	kubectl wait --for=condition=available --timeout=300s deployment/${deployment} -n kube-system
	}

	check_daemonset_ready() {
	local daemonset=$1
	kubectl wait --for=condition=available --timeout=300s daemonset/${daemonset} -n kube-system
	}

	# update kubeconfig
	echo "Adding/updating kubeconfig for env..."
	aws eks --region "${REGION}" update-kubeconfig --name "${CLUSTER_NAME}" --alias "${KUBECTX_ALIAS}"

	# Check readiness for deployments
	for deployment in coredns; do
	check_deployment_ready "${deployment}" &
	done

	# Check readiness for daemonsets
	for daemonset in aws-node kube-proxy; do
	check_daemonset_ready "${daemonset}" &
	done

	# Wait for all readiness checks to complete
	wait
	echo "All specified deployments and daemonsets are ready."

	# Install/upgrade Cilium as appropriate
	if helm list --namespace kube-system \| grep -q cilium; then
	echo "Cilium already installed."
	echo "Upgrading Cilium..."

	cilium upgrade \
	--cluster-name "${CLUSTER_NAME}" \
	--datapath-mode tunnel \
	--helm-values cilium-values.yaml
	else
	echo "Cilium is not installed."

	cilium install \
	--cluster-name "${CLUSTER_NAME}" \
	--datapath-mode tunnel \
	--helm-values cilium-values.yaml
	fi

	# Wait for Cilium to be ready & healthy
	cilium status --wait
	echo "Cilium is ready - flushing iptables on base nodes..."

	# Flush IPTables on base nodes
	# Get Instance Ids whose names contain cluster name and are in 'running' state
	instance_ids=$(aws ec2 describe-instances --filters "Name=tag:Name,Values=${CLUSTER_NAME}-base-compute" "Name=instance-state-name,Values=running" --query "Reservations[].Instances[].InstanceId" --output text)

	# Iterate through each instance
	IFS=$'\n' # change the field separator
	for id in ${instance_ids}; do
	echo "FLUSHING DEFAULT IPTABLES ON BASE NODE: ${id}"

	# Send command using AWS SSM
	aws ssm send-command \
	--instance-ids "${id}" \
	--document-name "AWS-RunShellScript" \
	--parameters commands='sudo iptables -t nat -F AWS-SNAT-CHAIN-0 && sudo iptables -t nat -F AWS-SNAT-CHAIN-1 && sudo iptables -t nat -F AWS-CONNMARK-CHAIN-0 && sudo iptables -t nat -F AWS-CONNMARK-CHAIN-1' \
	--comment "Flushing IPTables Chains" \
	--query Command.CommandId \
	--output text \
	--no-paginate \
	--no-cli-pager
	done

	echo "Rolling all stateful sets and deployments..."

	# Restart and wait for DaemonSets in the kube-system namespace
	daemonsets=$(kubectl get daemonsets --namespace kube-system --output jsonpath='{.items[*].metadata.name}')

	for daemonset in ${daemonsets}; do
	# Restart the DaemonSet
	kubectl rollout restart daemonset "${daemonset}" --namespace kube-system

	# Wait for the rollout to complete
	echo "Waiting for DaemonSet ${daemonset} to become ready..."
	kubectl rollout status daemonset "${daemonset}" --namespace kube-system
	done

	# Restart and wait for Deployments in the kube-system namespace
	deployments=$(kubectl get deployments --namespace kube-system --output jsonpath='{.items[*].metadata.name}')

	for deployment in ${deployments}; do
	# Restart the Deployment
	kubectl rollout restart deployment "${deployment}" --namespace kube-system

	# Wait for the rollout to complete
	echo "Waiting for Deployment ${deployment} to become ready..."
	kubectl rollout status deployment "${deployment}" --namespace kube-system
	done

	echo "Cilium installed successfully, cluster & cluster networking are ready!"
	# this file belongs in the same folder as your EKS module

	# dynamic cilium-values file content
	data "template_file" "cilium_values_template" {
	template = templatefile(
	"${path.module}/dynamic-cilium-values.tpl",
	{
	cluster_name = var.env_name
	}
	)
	}

	# dynamic cilium-values file
	resource "local_file" "cilium_values_file" {
	content = data.template_file.cilium_values_template.rendered
	filename = "${path.module}/cilium-values.yaml"
	}

	resource "null_resource" "setup_cilium" {
	depends_on = [module.eks.cluster_endpoint]

	triggers = {
	cluster_id = module.eks.cluster_id
	cluster_oidc_issuer_url = module.eks.cluster_oidc_issuer_url
	cluster_version = module.eks.cluster_version
	cilium_values = local_file.cilium_values_file.content
	}

	provisioner "local-exec" {
	command = "${path.module}/_cilium-provisioner.sh ${var.region} ${var.env_name} ${local.eks_kubectx_alias}"

	environment = {
	CLUSTER_NAME = var.env_name
	KUBECTX_ALIAS = local.eks_kubectx_alias
	REGION = var.region
	VPC_CIDR_BLOCK = var.vpc_cidr_block
	CILIUM_CLI_MODE = "helm"
	}
	}
	}
	---
	# this file belongs in the same folder as _cilium-provisioner.sh and cilium-provisioner.tf
	bandwidthManager:
	enabled: true
	bpf:
	masquerade: true
	cluster:
	id: 0
	name: ${cluster_name}
	encryption:
	nodeEncryption: false
	eni:
	enabled: false
	externalIPs:
	enabled: true
	hostPort:
	enabled: true
	k8sServicePort: 443
	kubeProxyReplacement: strict
	MTU: 9000
	nodePort:
	enabled: true
	operator:
	replicas: 1
	serviceAccounts:
	cilium:
	name: cilium
	operator:
	name: cilium-operator
	socketLB:
	enabled: true
	tunnel: vxlan