Skip to content

Instantly share code, notes, and snippets.

@Mohitsharma44
Last active July 6, 2022 06:35
Show Gist options
  • Save Mohitsharma44/4ffc85739c8cfecdd071f1bb96209dcd to your computer and use it in GitHub Desktop.
Save Mohitsharma44/4ffc85739c8cfecdd071f1bb96209dcd to your computer and use it in GitHub Desktop.
Angi Test

Brief summary of tool that was chosen and why.

Summary

Things that were done:

  • CI/CD pipeline to build, test and deploy a hello world web application
  • This above application gets deployed to a kubernetes cluster (EKS)

Bonus:

  • Use infrastructure as code
  • Application can be scaled automatically to handle increased loads (also implemented cluster auto scaling if we ever run out of nodes)
  • Changes to the application can be audited and rolledback if needed (using gitops or argoCD if needed to be done manually)

Description

I decided to break the test down into three different logical pieces.

Infra: The Infra piece is responsible for setting up the EKS cluster along with all its dependencies such as VPC, subnets, SGs, IAM Roles and Policies, ECR repos etc.

App: I decided to write a very simple web application code (as my hello world application). For this, I decided to keep things very simple and wrote a python file using fastAPI library, a pip requirements file and a Dockerfile. The python file and its dependencies gets bundled up in a container image and is pushed to the ECR repo using github actions. Once a new version is pushed, we also get a slack notification informing of that and prompting if we want to deploy this new image version to EKS cluster. (the slack action is not configured yet.)

App Manifests: For managing the CD piece of the workflow, I decided to use ArgoCD. The ang-test-charts repo contains helm charts for the argo cd application (leveraging app of apps pattern) and one for deploying the above python app via argo cd.

A high level architecture can be shown as:

High level architecture

Slack Notifications can be seen as: Slack notification

Tools Chosen

Terraform:

Purpose: I chose terraform for managing all the pieces of infrastructure (and for some helm charts which I feel would be common for any number of EKS clusters we spawn).

Why this tool: Using terraform, I was able to define the infrastructure for EKS cluster (and its dependencies) in a declarative fashion. I was able to abstract out reusable pieces into modules.. keeping the terraform code DRY and relatively easy to read. Another important reason is that Terraform stores the expected "state" of the infrastructure in a statefile (in S3 bucket for our case). Every terraform plan against the dev env compares the real state of infrastructure with what we want it to be. If it detects any changes between the two, terraform will apply the changes to the actual infrastructure resources to make it match with what we want it to be. Terraform is also currently the industry standard for managing infrastructure.

Other tool options: There are a few options in the IaC world such as Pulumi, Cloudformation and a couple others. I chose to use terraform because of my experience with the tool and the fact that its not a PaaS vendor provided tool. Unlike terraform, vendor specific tools like cloudformation makes the knowledge of the tool almost non transferable to other platforms due to the lack of common standards. Some tools like pulumi promise to be more OOP-sy but in my experience, things can get quite complicated and slow when using their OSS offering. Moreover, the pricing of pulumi has been all over the place (at least until a few months ago). Given that Pulumi itself uses terraform providers under the hood, troubleshooting any bugs in provider becomes quite tricky due to the abstraction. So until the product matures, I prefer avoiding use of pulumi. (I hope the above answer didn't come out as a rant against pulumi :) )

Helm:

Purpose: I chose helm to be the package manager for deploying our demo app as well as a few others to the kubernetes cluster. For the app, I wrote a helm chart for creating all the resources to make it horizontally scalable and accessible from outside of the kubernetes cluster. For my CD implementation, ArgoCD's Application specification is also written as a helm chart which gets deployed automatically after the EKS cluster is built.

Why this tool: Helm charts help template out the kubernetes yaml specifications for all of the kubernetes resources making deployment of complex applications easier to keep track of. Helm also contains a concept of hooks which can be leveraged for sequencing the deploy of your application. For example, if we want TLS certs (which is present in a volume or secret) to be mounted in the pod before deploying an Ingress, we can do that by using the pre-install hooks. Helm also keeps track of the upgrades and previous versions which rolling back trivial. On the topic of keeping things DRY, besides templating, helm charts (in the form of library charts) also allows having snippets of code to be re-used across charts.

Other tool options: There are a few other widely used tools like kustomize. I chose helm since it encapsulates the kubernetes objects into a single deployable unit hiding a lot of the complexity of the deployment. Kustomize is quite different from the templating engine that helm is. Kustomize is more like a tiered configuration management tool unlike helm which is more of a templating engine. It works on the concept of base and overlays where the yaml specifications from base are loaded at run time and overlay, if any, is merged on top of it. Both have their uses (for example I would gladly incorporate kustomize if I have a lot of different environments we need to deploy helm charts or plain yaml specifications). Mainly, I chose helm chart because of my experience in it and given that this is a small project comprising of just one environment.

GithubActions:

Purpose: I chose github actions to be the ci-tool of my choice for testing, building and deploying the application container image to ECR repo.

Why this tool: Its a SaaS offering from github. I'm hosting my repository on github. Its very straightforward for the simple ci tests and builds that I wanted to run. Well, this is only partly true. Another reason is that I haven't used EKS exhaustively personally for my (non work related) projects (its a money burner!!). So setting things up for EKS in terraform (the infra repo) took me a fair bit of time to figure out. (I've used kind and k3s alot so wanted to use this test as an excuse to get my hands dirty with EKS)

Other tool options: I had toyed with the idea of using AWS codebuild or self host some OSS ci tool but in the end, due to time constraints, I decided to go with github actions.

ArgoCD:

Purpose: I chose Argo CD for providing continuous deployment of my demo web application.

Why this tool: Argo CD (and other argo project offerings) is something I wanted to sink my teeth into for quite some time. It kind of made sense to me to have the deployment piece be handled "automatically" once the application has been tested and approved. I haven't explicitly implemented the approval feature in my ci pipeline (something that seems trivial) but I'm relying on the fact that someone with appropriate priviledges is able to merge the PR.

Other tool options: I have played around with rancher fleet but didn't want to deploy whole rancher kubernetes orchestration tool for just performing gitops style CD. I wanted to try my hands on argo workflows but it seemed a bit non trivial to set up in the 3 day time I had for this project. Another option was using githubActions itself as a CD tool for a true gitops functionality but then that would mean setting up a whole another RBAC roles and policies in kubernetes cluster. Not to mention that centralizing the audit records would be another issue. Moreover, I wanted to implement argo-workflow (for automated rollbacks and b/g deploys so going with argocd made more sense)

Docker:

Purpose: I've used dockerfile for packaing the application into container image using docker cli tool (not the docker engine)

Why this tool: I needed to build container images in the most straightforward way as possible and since there were no restrictions on tool I could chose for this, I chose docker cli (with containerd container run time).

Other tool options: (I use colima on my mac with containerd runtime). I have played with nerdctl which is almost a drop in replacement but decided to just use docker cli since its installed on my machine as well as present in github actions ubuntu base image I'm using. I could've also used Kaniko for building image in kubernetes but that just felt a bit too much work for this project. There are some other options like buildah or buildkit which can be explored if needed.

Slack:

Purpose: Sending notifications about new app releases, changes to infra etc, deployment success/failure etc.

Why this tool: Just a preference honestly. I've used it at work and for home projects

Other tool options: Soo soo many to even list. However, in my experience most orgs use slack.

Testing

Coz you dont wanna take my word for it when I say "Trust me.. It all works!"

Infra: The state file is stored encrypted in an S3 bucket so to be able to run terraform operations referring the plans, your user/ role will need some permissions. Easiest way (since this is just a test) would be to create an admin user like mine and do the following:

git clone git@github.com:Mohitsharma44/angi-test-infra.git
cd dev
terraform init
terraform plan

You shall see no changes to the infrastructure, so lets access the kubernetes cluster. Assuming you have aws cli installed, run the output of the following:

terraform output | sed -ne "s/.*configure_kubectl = //p" | tr -d '"'

Now you can check all the resources deployed for our demo app (assuming you have kubectl)

kubectl get all -n testapp

Lets hit the endpoint and see if we get the right version (I dont want to write go template so here is my jq parsing skills at its best :D)

endpoint=$(kubectl get ing -n testapp -o json | jq -r '.items[].status.loadBalancer.ingress[].hostname')

curl -L http://$endpoint/

# You can also see some metrics but for that you need to port forward locally (we dont want to expose your app metrics to the world now, do we? Side topic but an interesting read: https://sysdig.com/blog/exposed-prometheus-exploit-kubernetes-kubeconeu/)
kubectl port-forward svc/test-app -n testapp 8090:80

curl http://localhost:8090/metrics

The version should match this: https://github.com/Mohitsharma44/angi-test-app/blob/main/main.py#L4

But this is no fun.. We want to see it scale up AND down based on the load.

Lets hit it with some http loadtesting tool. I'm using plow for generating load. (I had some fd(file descriptor) issues running apache benchmark. I think ab (apache benchmark) wasn't closing the connections properly when using concurrency. I think it was waiting for ALB to do it 🤷 Didn't have much time to investigate).

Anyways, back to making our pods sweat a bit.. and have HPA come to their rescue

plow http://$endpoint/ -c 200 -n 100000 -d 5m

In a separate terminal, lets see some scaling action

watch -n 1 'kubectl top po -n testapp --use-protocol-buffers && kubectl get hpa -n testapp'

Yeah, kubectl top doesn't support -w/ --watch

At some point (In 10+ or so seconds), you shall see the REPLICAS scale up to 2 and then 3 (You can check the scaling criteria in the helm chart)

Now stop the plow script (ctrl + c) and wait a few seconds (well, some 15+ or so seconds)

For auditing the changes to the application, we are using gitops, so there is that. Next, we can log into argoCD UI and get events from there:

kubectl port-forward svc/argo-cd-argocd-server -n argocd 8080:80

The password credential to log into argo cd can be obtained from kubernetes secret:

kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d

Now log into the ui using admin as the user and the password obtained from the above.

Improvements:

Quite a few. To save you from reading more lines of text, we can discuss this over a call but two things I wasn't able to achieve:

  • CI/ CD for Infra: I was able to get most of the stuff working but during one of the steps in infrastructure setup, I use kubernetes and helm terraform providers to deploy certain helm charts. The providers rely on obtaining kubeconfig file using aws cli. Now for some odd reason, terraform plan in github action run is complaining about aws executable not found.. which is a bit odd since I'm using aws cli in previous build steps (in the same container image). Check the logs if you'd like to: https://github.com/Mohitsharma44/angi-test-infra/runs/7201102173?check_suite_focus=true#step:7:573 It's probably something minor that I'm missing because of which the exec plugins are not able to see aws cli installed.

  • Auto update helm chart whenever a container image is pushed to ECR repo: This is fairly trivial to implement and just needs a bit more time to setup. There are a couple of ways this can be achieved:

    • Have a merge to master in app repo trigger a ci job in helm charts repo that updates the image version in application helm chart. (this will be picked up by argoCD and we'll have a true CD pipeline).
    • Have a lambda be triggered whenever a new image version is pushed to our repo in ECR. This lambda can update the image version and the process can continue like above.
    • Have a periodic/ cron run of a ci job in manifest repo that checks if the version in app repo matches the image version in manifest repo ...
  • For terraform and kubernetes -- Implementing policy as a code (OPA), integration testing and module unit testing.

  • For kubernetes -- Have a domain with DNS records and ACM certs for easy access to kubernetes cluster and for publishing app securely.

  • Role based access to the kubernetes cluster instead of credentials based.

  • Scaling app based on number of requests coming in instead of relying on cpu and memory metrics -- I actually wanted to use keda with prometheus for this. I've deployed prometheus and I'm exposing metrics from the app that prometheus is scraping. I just need to plug them into keda's scaledObject specification (e.g.).

I think I'll stop with this doc now. There are a lot of pieces that I haven't discussed.. things like why did I not go with fargate, why single nat gateway, why m5 instances, how did I decide cluster autoscaling strategy. Future plans like monitoring using grafana, centralized log shipping, HA promoetheus, scaling apps based on prometheus metrics, alerting etc.

Cheers,

-Mohit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment