SaMnCo/blog.md Secret

## blog.md

      
    Raw
  

              blog.md
            
          
    Opportunistic Autoscaling with the Kubernetes HPA

Since the version 1.6 of Kubernetes it is possible to use custom metrics to autoscale an application within a cluster.
It means you can expose metrics such as the number of hits on an API or the latency of an API, and scale the serving pods according to that metric instead of the CPU load.
Now let's say you are operating a cluster which you rent to your customers via a Serverless Framework such as Kubeless, Nuclio or Fission. You charge your customers by the GB-s used and the number of invocations. But on the other end, your cost is really that you run the machines. So you would like to make sure that the cluster is used as much as possible.
You notice over time that the cluster is in fact only used at 60% in average, but there are peaks at certain times of the day where it loads at 90%. These peaks are really hard to predict because they depend on your customers' business.
You would like to leverage the remainer of the capacity, so that the cluster is permanently used at its max capacity. You found that NiceHash would be happy to buy it. Now you need to find a way to dynamically allocate the right amount of resources.
"Opportunistic Scaling"

You create an application that can look into the cluster and extract the capacity currently being used, and computes the remaining capacity.
This remaining capacity is fed to an autoscaler, which will scale an app to consume it.
As a result:

If the "paid" load on your cluster goes UP, the remaining resources go DOWN, and you DOWNSCALE the mining app.
If on the contrary your paid load goes DOWN, the remaining capacity goes UP, and you SCALE UP the number of mining containers.

The target is to have a load that is as constant as possible around a threshold you define (85% let's say), thus collecting an average of 25% unused power and monetize it.
It may look a silly application but I can definitely tell you that the mining pools seem to have a load variation when office hours finish, showing that business resources are definitely being used for mining at night!
In any case, this is what I define here as Opportunistic Scaling: the fact for an application to consume only resources available, and not overrule other apps.
In a real life scenario, crypto mining is effectively adding resources to a compute grid, hence it does also make sense beyond the hype and fun. There are also a lot of other interesting use cases. Among others:

Lambda on the edges using a serverless framework (Telco / Cloud Operator)
Elastic transcoding (Media Lab / Cloud): Think of what Ikea is doing on workstations but in a compute cluster
AI on the edges (Media Lab / Cloud)
Caching (CDN)
The cool use case you'll share in comments1

Now let's see how we can do that with Kubernetes.
Using Reversed Custom Metrics

In this blog, we will create a K8s cluster with a custom metrics API on bare metal.
We then create an app that exposes (among others) a metric
Remaining CPU = (total amount of CPU requested in cluster) - (total requested by all applications but myself)
This metrics decreases when the load of the cluster grows, and grows when the load shrinks. This effectively will make the metric a "client" of the requested load on a server.
Then we will use this metric to configure a Horizontal Pod Autoscaler (HPA) in Kubernetes. This will result in keeping the load in the cluster as high as possible.
Information about this post

This blog post has been sponsored by my friends at Kontron who gracefully allowed me to play with a 6 nodes cluster of their latest SymKloud Platform. Each node has 24 to 32 cores, and some of them have nVidia P4 cards.
 
</insert Kontron Content>
In addition, my friend Ronan Delacroix helped me with the code based and wrote most of the python needed for this experiment.
 
</insert Ronan Content>
Setup

Kubernetes Cluster

There are so many solutions out there to build one that we cannot even count them anymore. But as usual for my posts on bare metal I will use a Canonical Distribution of Kubernetes (CDK), in its version 1.8.
So we assume that you have a K8s Cluster in version 1.8, with RBAC active, and an admin role.
Important Note: The APIs we will be using here are very unstable and subject to big changes. I really recommend you read the K8s change log to check on them.

For example, there was a change in 1.8 on the name of the APIs. If you run a 1.7 cluster, this will impact you.
There are also changes in 1.9 and the custon.metrics.k8s.io moves from v1alpha1 to v1beta1.

There are some details of the configuration we will see today that are done in a certain way on CDK and may be slightly different on clusters that are self hosted. I will try to mention them whenever possible.
Foreplay

Make sure you installed

a CDK cluster deployed, and access to the Juju Tooling. In this post, I use Juju 2.3.1
kubectl
helm
stern
jq
cfssl and cfssljson

RBAC Configuration

NOTE: this applies to CDK, and will apply to GKE when the API Aggregation is GA with K8s 1.9.
On CDK and GKE the default user is not a real admin from an RBAC perspective, so you need to update it before you can create other Cluster Role Bindings that extend your own role.
First make sure you know your identity with

CDK: you are the user admin

https://gist.github.com/66aec7469e2ba3d4cb02f00fa885d6cf

On GKE:

https://gist.github.com/af3f5fa01b725b1cecbc517d62ed0f02
Now grant yourself the cluster-admin role:
https://gist.github.com/73418507d43ea5ace26f3130db0ff919
You can then check it out with:
https://gist.github.com/0ecd96030f734e0726bf52bb8042b37e
Helm Install

In order for Helm to operate in a RBAC cluster, we need to add it as a cluster-admin as well:
https://gist.github.com/224fea6e44a3b0db3e6d999007957927
OK, we are done with prepping our cluster
Note: By default Helm deploys without resource constraints. When trying to surcharge a cluster and maximize its usage, it means that Tiller will be part of the pods that may go away because resources are exhausted. If you do not want that to happen, you can edit the manifest and reapply it:
https://gist.github.com/0ea6fefc8a9c306b064f92bedc1ab174
AutoScaling: Preparing the cluster

Introduction & References

First of all, I recommend you have a look at the documentation about extending Kubernetes.
Then also look at the documentation about extending Kubernetes with the Aggregation Layer.
The last theorical doc is about setting up an API Server, and you can find it here.
Once the aggregation is active, we will deploy 2 custom APIs: the Metric Server and a Custom Metric Adapter
OK now that you are versed in what we need to do, let's get started.
Configuring the control plane

In order to activate the Aggregation Layer, we must add a few flags to our API Server:
https://gist.github.com/e54ce501402ba9417e5ce7971571aaa8
You will note that we do not activate the flags
https://gist.github.com/6a84b9bf8fc7bc92976ff480f53ac6e4
This is because the proxy in CDK uses a Kubeconfig and not a client certificate.
However, we do enable the aggregator routing because the control plane of Kubernetes is not self hosted and we fall in the case "If you are not running kube-proxy on a host running the API server, then you must make sure that the system is enabled with the enable-aggregator-routing flag".
Also we added the client-ca-file flag to export the CA of the API server in the cluster.
Now for the Controller Manager, we must tell it to use the HPA, which we do with:
https://gist.github.com/602f7e038bb24236aa0e0823a1f8191b
Note that the 2 last options here are really for demos to make it quick to observe the results of actions. You may not need change them for your use case (they default to 3m and 5m)
Just to make sure the settings are apply restart the 2 services with
https://gist.github.com/d0aa0de080d2fb58e8a4910b59409e1b
This will make Kubernetes create a configmap in the kube-system namespace called extension-apiserver-authentication, which contains all the additional flags we generated and their configuration. You can have a look at it via
https://gist.github.com/1464d853e2d2922a152e607e5a0035b2
Each of API servers will now need to have an RBAC authorization to read this Config Map. Thankfully K8s will also automatically create a role for it:
https://gist.github.com/c97cc5e3522ea4af1378f4101a713a35
Last but not least, you won't need Heapster for now, so make sure it is not there via:
https://gist.github.com/35c2b3c8a456efc47f90df3927c6e966
Initial API State

Before you start having fun with API Servers, have a look to the status of your cluster with:
https://gist.github.com/3caeb8727aa59320d2489c5e1c969711
At the end of the next sections, you will have 3 more APIs in this list:

monitoring.coreos.com/v1, for the Prometheus Operator
metrics.k8s.io, for the Metrics Server that collects metrics for CPU and Memory
custom.metrics.k8s.io, for the custom metrics you want to expose

Adding the Metrics Server API

There are 2 implementations of the Metrics API (metrics.k8s.io) at this stage: Heapster and the Metrics Server. At the time of this writing, the Metrics Server has a simple deployment method, while Heapster required some work on my end, and I was too lazy to write the code.
We can simply deploy it with
https://gist.github.com/963f11177617eda64168348a0ffe88d0
This manifest contains:

the Service Account for the Metrics Server
a RoleBinding so that the Metrics Server can read the configmap above
a ClusterRoleBinding so that the Metrics Server inherits the system:auth-delegator ClusterRole (you can find documetation about that here.
a Deployment and ClusterIP Service for the Metrics Server
an APIService object, which is a registration of the new API into the API Server.

Now check our APIs again:
https://gist.github.com/33b126d3c79468c7be1be3f535537665
Awesome... But does it really work? Query the endpoint of the API via kubectl to make sure
https://gist.github.com/8cad8bd8cdf157b301ea043c5d39fb41
Good job, NodeMetrics and PodMetrics are exposed. Look into what you can use from there:
https://gist.github.com/088b373562a7b3c4542e36a9380a0c72
And
https://gist.github.com/243688abadd186b30219d57ca3050a39
No big surprise here, you can access the CPU and memory consumption in real time. Refer to the docs for more details about how to query the API.
Installing the Custom Metrics Pipeline

Right before we have taken a shortcut, having a metric pipeline that is directly exposable as the aggregated API. Unfortunately, in the case of custom metrics, we must do this in 2 distinct steps.
First of all we must deploy the custom metrics pipeline, which will give us the ability to collect metrics. We use Prometheus for that part as the canonical example of metrics collection system on K8s.
Then we will expose these metrics via a specific API Server. We will use the work of Sully (@DirectXMan12) that can be found here for that.
Prometheus has many installation methods. My personal favorite is the Prometheus Operator. It takes a lot of efforts to architect a piece of software using traditional solution. But crafting a software model that ties to the underlying distributed solution beautifully is closer to art than anything else.
That is essentially what the operator is. The operator models how Prometheus should be given a set of conditions, then realizes that in Kubernetes. Whaow, good job @CoreOS.
Note that you can create an Operator for anything, and that something similar is coming for Tensorflow as far as I can see the APIs coming up... Anyway, let's not get distracted.
Install the Prometheus Operator with:
https://gist.github.com/d28da0eccc0b0a52cadcf2e70dcf8e72
This contains:

a Service Account for the operator
a ClusterRole and ClusterRoleBinding that are fairly extensive, so that the Operator can deploy Custom Resource Definitions for Prometheus (instances of), Alert Managers and Service Monitors.
a Deployment for the Operator pod.

This will let the Operator add the monitoring API as well:
https://gist.github.com/6c7819029b7ace33e39cb4da777e73bf
Now create an instance of Prometheus with:
https://gist.github.com/b2e1625577267dc5134715b138c5b956
The RBAC manifest will allow Prometheus to read the metrics it needs in the cluster and /metrics endpoints of any object (pod or service). The Prometheus manifest defines an instance and a service to expose it as a nodePort (so we can have a look at the UI).
What is important in this second file is the section:
https://gist.github.com/17c49bc59e3aefb80641723e4e0fb932
This essentially dedicates the Prometheus instance to Service Monitors with this label (or set of labels). When we will define the applications we want to monitor and how, we will need that information.
Note that this is a trivial example of deployment, with no persistent storage or any fancy thingy. If you are contemplating using this for a more production grade usage, you will need to spend some time on this.
OK, now you can connect on the UI and check that you have everything deployed correctly. It is pretty empty for now...

Installing the Custom Metrics Adapter

Now that we have the ability to collect metrics via our Prometheus pipeline, we want to expose them under the aggregated API.
First of all, you will need some certificates. Joy. This is all documented here. Run the following commands to generate your precious:
https://gist.github.com/405bf8028e99bce99bb1eee1576dd0e4
In order to authenticate our extended API server against the Kubernetes API Server, we have several options:

Using a client certificate
Using a Kubeconfig file
Using BasicAuth or Token authentication

Adding users with certificates in CDK is a project in itself and would deserve its own blog post. If interested, ping me in questions and we can discuss this in DMs. BasicAuth and Tokens are easy, but they also require to edit /root/cdk/known_tokens.csv or /root/cdk/basic_auth.csv on all masters and restart the API server daemon everywhere.
So the solution with the least complexity is actually the Kubeconfig file. Thanks to RBAC, the only thing we need to create a new user is a service account, which will give us access to an authentication token, which we can then put into our kubeconfig.
https://gist.github.com/ebe8e4fb6ca71907fab13e00d6e3c58a
You can then create a copy of your .kube/config file and edit the user section to add the custom-api-server:
https://gist.github.com/fb916f7f7481e9dd1133a301ee6d1d7b
Do not forget to also edit the contexts to map to this user instead of admin.
Now edit a cm-values.yaml file for the helm chart:
https://gist.github.com/1fe8613a730d65a6a57166892f8e3ad9
OK you are now ready to download the chart and install it
https://gist.github.com/7fa2058e2c91b51f159fb289ba7be7e3
And we check that the new API is registered in Kubernetes:
https://gist.github.com/3047a2195737c87e159b03417124f65e
Great. Now let us check that everything works properly by querying the K8s endpoint for it:
https://gist.github.com/5c0ad9ac3a931e60a126008311d4cc9a
At this point of our development, if we get a 200 answer and NOT a 404 we are good and we have our autoscaling up & running. If you get a 404, then it did not work properly.
Summary

In the long section above, we have done the following

Install the new Metrics Server and add the metrics.k8s.io API to the cluster. This gave us access to an equivalent of heapster, but exposing metrics under the classic API.
Install a Custom Metrics pipeline to be able to collect any metrics. We did this via Prometheus, using the Operator to create a Prometheus Instance
Install the Custom Metrics API custom.metrics.k8s.io via the installation of a Prometheus Adapter.

For each step, we validated that the cluster worked properly and as intended. Now we need to put it to good use.
Using Custom Metrics

Demo Application: http_requests

First of all we will test our set up with a very simple application written by @luxas that exposed a http_request metric on /metric. You can deploy it with
https://gist.github.com/738fb8431fbd1bc3e0cd88b54a20e3da
This manifest contains:

the Deployment and Service so we can query the application
a Service Monitor, which will indicate to the prometheus instance that it should scrap the metrics of the application
An Horizontal Pod Autoscaler (HPA), which will consume the number of http_requests and use it to scale the application.

Let us look into the HPA for a moment:
https://gist.github.com/b93e819dd3fd4194964a077265568590
As you can see, we have here

a Target (our deployment), with a minReplicas and a maxReplicas.
a metrics of type pod which tries to make sure that pods get an average 500m queries (which slightly above what the standard load from Kubernetes + Prometheus is)

So this means that you do not need the application to rely on its own metrics. You could potentially target any application metrics and use them to manage another application. Very powerful principles.
Let us say for example that you manage an application based on the principle of decoupled invocation, such as a chat or an order management solution.
Some day, you start getting a peak of requests on the front end, and the backend does not follow. The queue fills up, and you start experimenting delays in processing of the requests.
Well now you can scale the workers that process the queue based on the requests made on the front end. You create a target object that monitors the number of http_requests on the frontend, but the scale target may be your application. It is as simple as that.
Now look at how Custom API reacts to this (it may take a couple of minutes before this works)
https://gist.github.com/5b2d621dc36c7dc652dac8a67f4582af
And we can then query the service itself:
https://gist.github.com/f8c9150459d36dfff2e6a63d498dcc9e
And we then look at our HPA:
https://gist.github.com/59f008e98ef98854f45c10c921ac8cc9
We can see here that just the requests for status account for 866m. Now deploy a shell app so we can create some load:
https://gist.github.com/2a18fa1787a9018de563975ab186936d
Now prepare 2 shells. In the first one, connect into your container with
https://gist.github.com/515b33d01810145015dee24cec204c47
And in the second one, track the HPA with
https://gist.github.com/efad7d91010d4d53cbd16e7279a83a9b
In addition, he created a nice UI that presents the values in real time.
This requires a Grafana installation. You can install both apps with:
https://gist.github.com/9334cb37f8152e21a9c6f6c8d0d65c40
These manifests contain

Cluster Roles and bindings for collecting metrics
Deployments for both Grafana and the python application
Nodeports services on ports 30505 (app) and 30902 (Grafana)
Config maps to configure both

What is of interest to us in this example is the "cpu_capacity_remaining". As mentioned in the intro, I had access thanks to Kontron to a 184-core cluster. I decided to "reserve" 30 cores, or 15% of my capacity to give room for load peaks. This gave me an autoscaler looking like:
https://gist.github.com/6fe140aa88640b1625133094210bca69
You will note I am using Electroneum as my crypto. The reason for this is practical. It is a very new cryptocurrency, with limited mining resources allocated to it right now, which means you can directly measure your impact and see daily returns, which is cool for demos. In case you wonder, as this currency requires a Monero miner, this setup can easily be converted into something more lucrative by pointing it to a real monero pool.
To replicate this blog with your own machines, edit the src/manifest-etn.yaml file according to your own cluster then deploy with :
https://gist.github.com/edaf158000e16fc6163de8bfa8f61b58
This manifest contains:

a Deployment of the miner
a Horizontal Pod Autoscaler as seen above
a service to expose the UI on port 30500 of nodes.

Now let us check on our HPAs with:
https://gist.github.com/79921139b13a74820c3a342a8082c8d4
Alright, we are all set! Now we can finally check how our application reacts to load.
Opportunistic Load Balancer in motion

In order to supercharge our cluster, we reuse our shell-demo application, and generate 10 hits per second on the API for 5 min. Because we are expecting only 0.5 hits, this will quickly trigger the scale out:
https://gist.github.com/a37f5c09d1bac99bd09062f7b09a3080
There you go, we can see the new pod coming in. Each new pod requests 4 CPU cores to the cluster. This unbalances the HPA, that tries to counter by releasing miners. Over 5 minutes, our app will scale up to 17 replicas, thus claiming 68 cores to the cluster, which will be freed from the mining app.
After 5 minutes, the load is now normal and we see a scale down of the simple app from 17 pods down to its stable version at 2 replicas. The HPA for the miner will react and start harvesting the capacity.
This can be seen in the UI on the CPU capacity graph:

That's it, we have an application that is self opportunistically adjusting to the load created by other applications in the cluster.
Some thoughts about the HPA

While creating this blog, I had a huge hard time configuring the HPA to make it stable and convergent and not completely erratic. One must understant that the HPA in K8s is, so far, pretty dumb. It is not exactly learning from the past, rather systematically repeating the same reaction patterns regardless of the fact they failed or succeeded in the past.
Let's say a custom metric is at 150% of its target value, then the cluster will perform a 150% capacity increase. This means that if your application is creating 2% HPA resource value for 1% increase of scale, you will enter into a turbulence zone, with the HPA proving incapable of converging.
For our application, this means that the mining application was designed so that each replica consumes 1 CPU Core, and the max value was set on the max we could use for this specific app. Other things were running there consuming ~30 cores, hence the fact we have a max replicas at 150 ~= (184 - 30).
Long Story short: if you do this at home, be thoughtful about your HPA design, and do some experiments. If the HPA does not learn, you certainly should.
References

I would like to thank @Luxas and @DirectXMan12 for inspiring this work and for the fantastic walkthrough they wrote here and there that helped me a lot while writing this.