Skip to content

Instantly share code, notes, and snippets.

@armenr
Last active May 31, 2024 12:48
Show Gist options
  • Save armenr/ef52919426410db78df46c672ec6e45e to your computer and use it in GitHub Desktop.
Save armenr/ef52919426410db78df46c672ec6e45e to your computer and use it in GitHub Desktop.
Cilium vxlan overlay for EKS clusters

Cilium vxlan overlay w/ Terraform

Why?

The AWS EKS team works extremely hard. We appreciate all of their effort.

But the aws-vpc-cni requires fine-tuning of complex settings, and:

  1. Limits the number of pods you can run on an EC2, based on the number of ENIs that instance size (or type) can support. Pod density is valuable.
  2. Requires you to play with settings like WARM_ENI_TARGET, WARM_IP_TARGET, WARM_PREFIX_TARGET, etc...
  3. Runs into conditions where Pods get stuck in "Creating," since IP management gets tricky based on cluster pod churn, and aws-vpc-cni...and ENABLE_PREFIX_DELEGATION + branching can lead to a lot of wasted IPs

If you've ever sat there and watched your cluster or pods "get stuck" because of a failure to assign an IP address, you know the pain of this.

Cilium's vxlan overlay approach obviates these problems completely.

Trade-offs and Benefits

  • Trade-off: You won't be able to use "Pod Security Groups" with this implementation
  • Trade-off: Another downside of this approach is that you can no longer create services of type LoadBalancer with the aws loadbalancer controller of type NLB in IP mode. In other words, NLBs can no longer direct traffic directly to your pods, but can only send traffic to instances that are listening on a NodePort. This means that you can not use the Pod readiness gates of the controller, and you can no longer guarantee 0-downtime deployments/upgrades of your ingress pods/nodes. If you're okay with a few dropped requests, then great. If not, then think twice! --> thanks to /u/DPRegular
  • Benefit: No more "stuck pods" or IP starvation, ever
  • Benefit: No more pod density/max pods limitations on your nodes - you can safeuly use t3/t4 micro, small, etc. with your autoscaler of choice. WE RECOMMEND KARPENTER!

Assumptions

Place cilium-provisioner.tf, dynamic-cilium-values.tpl, cilium-provisioner.tf in the same folder as your EKS terraform module.

You can also use _cilium-provisioner.sh + dynamic-cilium-values.tpl without terraform. Just read the instructions in the script, rename dynamic-cilium-values.tpl to .yaml, and hard-code your values.

Prerequisites

Read this, and make sure you have all the necessary port/firewall configs in place: https://docs.cilium.io/en/v1.9/operations/system_requirements/#firewall-rules

This assumes:

  1. You're using the EKS module for terraform
  2. You're using Linux Kernel 5.15.x on your AL2/BottleRocket/Ubuntu nodes
  3. Your EKS Service IPv4 range --> 10.100.0.0/16
  4. Your EKS cluster endpoint is accessible 🫠
  5. You're using the latest cilium-cli release
  6. Worker nodes are in private subnets, and have NAT gateways setup on the VPC

‼️ Important

Tag your nodes properly if you're going to use the installation script

For your nodegroups, you need to find some standard way to tag your EC2s (the installation script relies on this fact).

This is because you have to flush IPTables on any existing/running nodes in your cluster that were using the default aws-vpc-cni plugin, after it is disabled and cilium is installed.

The installation script retrieves those instance IDs by tag, autoamtically, and then flushes IPTables on those nodes by using aws ssm send-command.

You'll probably want to modify/tweak this to fit your setup.

instance_ids=$(aws ec2 describe-instances --filters "Name=tag:Name,Values=*${CLUSTER_NAME}-base-compute*" "Name=instance-state-name,Values=running" --query "Reservations[].Instances[].InstanceId" --output text)

In our case, we tag our "base" nodes (those which are not auto-scaled with karpenter) with this pattern:

nameOfCluster-base-compute-* --> example EC2 name: v3-qa-1-base-compute-1

You will see on line 59 of the _cilium-provisioner.sh script that we use this tag to identify the existing nodes that we have just installed cilium on.

This is an extremely important step. Don't skip it.

Changes to NodeGroups and/or Karpenter Provisioners

At startup, all of your nodes will need to be tainted with:

  - key: node.cilium.io/agent-not-ready
    value: "true"
    effect: NoExecute

Cilium Setup

  • This particular configuration doesn't include L7 traffic shaping or loadbalancing
  • This particular configuration doesn't rely on an egress gateway, although our testing showed that using cilium's egress gateway implementation also works (we tend to avoid any additional complexity wherever possible)
  • This particular configuration is ready for use with cilium's magical cluster mesh networking
  • This particular configuration does not rely on cilium's ingress gateway (we use aws-load-balancer-controller for that)

Caveats & Required Changes

Make sure that:

  • karpenter --> hostNetwork: true
  • aws-load-balancer-controller --> hostNetwork: true
  • metrics-server --> hostNetwork.enabled: true

...otherwise, they won't work.

We only install the following "EKS addons" generically:

  • kube-proxy
  • coreDNS
  • aws-vpc-cni

We also install the following from their most recent helm charts, and not as addons, since "addon" versions gave us issues:

  • aws-efs-csi-driver
  • aws-ebs-csi-driver
  • external-dns
  • aws-load-balancer-controller

Revert/Restore/Undo

To undo the changes made by cilium and completely uninstall it, just run:

cilium uninstall

This will restore the aws-vpc-cni plugin.

After removing cilium, you'll need to make sure you restart all your pods and/or revert any changes to your helm values.yaml for any of the aforementioned services. You'll definitely want to make sure you restart kube-proxy and coreDNS as well.

#!/bin/bash
#shellcheck disable=SC2154
set -euo pipefail
# !! this is important, don't touch it... !!
export CILIUM_CLI_MODE=helm
###################################################################################################
## This script expects the following environment variables be set prior to execution,
## otherwise it will fail.
#
# CLUSTER_NAME : the name of your cluster
# REGION : the AWS region you're running in
# KUBECTX_ALIAS : the kubecontext identifier/name/alias of the cluster you intend to target
#
###################################################################################################
check_deployment_ready() {
local deployment=$1
kubectl wait --for=condition=available --timeout=300s deployment/${deployment} -n kube-system
}
check_daemonset_ready() {
local daemonset=$1
kubectl wait --for=condition=available --timeout=300s daemonset/${daemonset} -n kube-system
}
# update kubeconfig
echo "Adding/updating kubeconfig for env..."
aws eks --region "${REGION}" update-kubeconfig --name "${CLUSTER_NAME}" --alias "${KUBECTX_ALIAS}"
# Check readiness for deployments
for deployment in coredns; do
check_deployment_ready "${deployment}" &
done
# Check readiness for daemonsets
for daemonset in aws-node kube-proxy; do
check_daemonset_ready "${daemonset}" &
done
# Wait for all readiness checks to complete
wait
echo "All specified deployments and daemonsets are ready."
# Install/upgrade Cilium as appropriate
if helm list --namespace kube-system | grep -q cilium; then
echo "Cilium already installed."
echo "Upgrading Cilium..."
cilium upgrade \
--cluster-name "${CLUSTER_NAME}" \
--datapath-mode tunnel \
--helm-values cilium-values.yaml
else
echo "Cilium is not installed."
cilium install \
--cluster-name "${CLUSTER_NAME}" \
--datapath-mode tunnel \
--helm-values cilium-values.yaml
fi
# Wait for Cilium to be ready & healthy
cilium status --wait
echo "Cilium is ready - flushing iptables on base nodes..."
# Flush IPTables on base nodes
# Get Instance Ids whose names contain cluster name and are in 'running' state
instance_ids=$(aws ec2 describe-instances --filters "Name=tag:Name,Values=*${CLUSTER_NAME}-base-compute*" "Name=instance-state-name,Values=running" --query "Reservations[].Instances[].InstanceId" --output text)
# Iterate through each instance
IFS=$'\n' # change the field separator
for id in ${instance_ids}; do
echo "FLUSHING DEFAULT IPTABLES ON BASE NODE: ${id}"
# Send command using AWS SSM
aws ssm send-command \
--instance-ids "${id}" \
--document-name "AWS-RunShellScript" \
--parameters commands='sudo iptables -t nat -F AWS-SNAT-CHAIN-0 && sudo iptables -t nat -F AWS-SNAT-CHAIN-1 && sudo iptables -t nat -F AWS-CONNMARK-CHAIN-0 && sudo iptables -t nat -F AWS-CONNMARK-CHAIN-1' \
--comment "Flushing IPTables Chains" \
--query Command.CommandId \
--output text \
--no-paginate \
--no-cli-pager
done
echo "Rolling all stateful sets and deployments..."
# Restart and wait for DaemonSets in the kube-system namespace
daemonsets=$(kubectl get daemonsets --namespace kube-system --output jsonpath='{.items[*].metadata.name}')
for daemonset in ${daemonsets}; do
# Restart the DaemonSet
kubectl rollout restart daemonset "${daemonset}" --namespace kube-system
# Wait for the rollout to complete
echo "Waiting for DaemonSet ${daemonset} to become ready..."
kubectl rollout status daemonset "${daemonset}" --namespace kube-system
done
# Restart and wait for Deployments in the kube-system namespace
deployments=$(kubectl get deployments --namespace kube-system --output jsonpath='{.items[*].metadata.name}')
for deployment in ${deployments}; do
# Restart the Deployment
kubectl rollout restart deployment "${deployment}" --namespace kube-system
# Wait for the rollout to complete
echo "Waiting for Deployment ${deployment} to become ready..."
kubectl rollout status deployment "${deployment}" --namespace kube-system
done
echo "Cilium installed successfully, cluster & cluster networking are ready!"
# this file belongs in the same folder as your EKS module
# dynamic cilium-values file content
data "template_file" "cilium_values_template" {
template = templatefile(
"${path.module}/dynamic-cilium-values.tpl",
{
cluster_name = var.env_name
}
)
}
# dynamic cilium-values file
resource "local_file" "cilium_values_file" {
content = data.template_file.cilium_values_template.rendered
filename = "${path.module}/cilium-values.yaml"
}
resource "null_resource" "setup_cilium" {
depends_on = [module.eks.cluster_endpoint]
triggers = {
cluster_id = module.eks.cluster_id
cluster_oidc_issuer_url = module.eks.cluster_oidc_issuer_url
cluster_version = module.eks.cluster_version
cilium_values = local_file.cilium_values_file.content
}
provisioner "local-exec" {
command = "${path.module}/_cilium-provisioner.sh ${var.region} ${var.env_name} ${local.eks_kubectx_alias}"
environment = {
CLUSTER_NAME = var.env_name
KUBECTX_ALIAS = local.eks_kubectx_alias
REGION = var.region
VPC_CIDR_BLOCK = var.vpc_cidr_block
CILIUM_CLI_MODE = "helm"
}
}
}
---
# this file belongs in the same folder as _cilium-provisioner.sh and cilium-provisioner.tf
bandwidthManager:
enabled: true
bpf:
masquerade: true
cluster:
id: 0
name: ${cluster_name}
encryption:
nodeEncryption: false
eni:
enabled: false
externalIPs:
enabled: true
hostPort:
enabled: true
k8sServicePort: 443
kubeProxyReplacement: strict
MTU: 9000
nodePort:
enabled: true
operator:
replicas: 1
serviceAccounts:
cilium:
name: cilium
operator:
name: cilium-operator
socketLB:
enabled: true
tunnel: vxlan
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment