Skip to content

Instantly share code, notes, and snippets.

Created February 10, 2025 04:55
Show Gist options
  • Save rajivreddy/d9238268df0c3fa142a37b6e08057686 to your computer and use it in GitHub Desktop.
Save rajivreddy/d9238268df0c3fa142a37b6e08057686 to your computer and use it in GitHub Desktop.
Kubernetes Auto Scaling Best Practices

Resource Allocation,Scaling in Kubernetes

Namespace limits

When you decide to segregate your cluster in namespaces, you should protect against misuses in resources.

You shouldn't allow your user to use more resources than what you agreed in advance.

Cluster administrators can set constraints to limit the number of objects or amount of computing resources that are used in your project with quotas and limit ranges.

Namespaces have LimitRange

Containers without limits can lead to resource contention with other containers and unoptimized consumption of computing resources.

Kubernetes has two features for constraining resource utilisation: ResourceQuota and LimitRange.

With the LimitRange object, you can define default values for resource requests and limits for individual containers inside namespaces.

Any container created inside that namespace, without request and limit values explicitly specified, is assigned the default values.

Example Code:

apiVersion: v1
kind: LimitRange
  name: cpu-resource-constraint
    - default: # this section defines default limits
        cpu: 500m
      defaultRequest: # this section defines default requests
        cpu: 500m
      max: # max and min define the limit range
        cpu: "1"
        cpu: 100m
      type: Container

Namespaces have ResourceQuotas

With ResourceQuotas, you can limit the total resource consumption of all Pods/containers inside a Namespace.

Defining a resource quota for a namespace limits the total amount of CPU, memory or storage resources that can be consumed by all containers belonging to that namespace.

You can also set quotas for other Kubernetes objects such as the number of Pods in the current namespace.

If you're thinking that someone could exploit your cluster and create 20000 ConfigMaps, using the LimitRange is how you can prevent that. Example

apiVersion: v1
kind: ResourceQuota
  name: configmap-quota
  namespace: my-namespace # Change to your namespace
    configmaps: "10" # Maximum of 10 ConfigMaps allowed in the namespace

For cpu and memory Limits

apiVersion: v1
kind: List
  - apiVersion: v1
    kind: ResourceQuota
      name: pods-high
        cpu: "1000"
        memory: "200Gi"
        pods: "10"
          - operator: In
            scopeName: PriorityClass
            values: ["high"]
  - apiVersion: v1
    kind: ResourceQuota
      name: pods-medium
        cpu: "10"
        memory: "20Gi"
        pods: "10"
          - operator: In
            scopeName: PriorityClass
            values: ["medium"]
  # dfd
  - apiVersion: v1
    kind: ResourceQuota
      name: pods-low
        cpu: "5"
        memory: "10Gi"
        pods: "10"
          - operator: In
            scopeName: PriorityClass
            values: ["low"]

How does this impact your AutoScaling

If you provision and Deployment with out Resource Allocation, Limit Range will assign default values that are configured in the policy. this allows HPA to calculate the CPU and memory metrics

Best Practice: Always define requests and limits to prevent a single pod from consuming excessive resources.

How HPA Works Internally

To Make HPA works you need to have metrics, Example if you want to scale based on CPU and memory utilization then make sure metrics server is already installed on kubernetes. By default it will checks CPU, Memory metrics

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
  name: my-hpa
    apiVersion: apps/v1
    kind: Deployment
    name: my-deployment
  minReplicas: 2
  maxReplicas: 10
    - type: Resource
        name: cpu
          type: Utilization
          averageUtilization: 50

HPA determines the desired number of replicas using this formula:

desiredReplicas = (currentMetricValue/targetMetricValue)*currentReplicas

in above example, we set averageUtilization is 50% and assume current CPU Utilization is 80%, based on formula

desiredReplicas=(80/50)*2 == 3.2

it will be rounded off to 3.

Scaling Behaviors & Considerations

there are other ways to configure the velocity of the scale. You can add behavior configuration to the HAP manifest to have differnete velocity of scaling example

      - type: Percent
        value: 900
        periodSeconds: 60
      - type: Pods
        value: 1
        periodSeconds: 600 # (i.e., scale down one pod every 10 min)

The 900 implies that 9 times the current number of pods can be added, effectively making the number of replicas 10 times the current size. All other parameters are not specified (default values are used) If the application is started with 1 pod, it will scale up with the following number of pods:

1 -> 10 -> 100 -> 1000

but the scale down will be gradual and it will scale down 1 pod every 10 mins.

stabilizationWindowSeconds - this value indicates the amount of time the HPA controller should consider previous recommendations to prevent flapping of the number of replicas. this configuration allows you to Avoid false positive signals for scaling up(In scale up mode ) and does not want to scale down pods too early expecting some late load spikes(Scale down mode)

Using Custom metrics

HPA supports Custom, External metrics as well( data Sources cab be Prometheus, Datadog, AWS CloudWatch and etc ), if you have any non functional requirements you can use these metrics to scale the application for example http_requests_per_second(This metrics can be available from your ingress.) Example:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
  name: web-app-hpa
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2
  maxReplicas: 15
    - type: External
          name: http_requests_per_second
          type: AverageValue
          averageValue: 100

How VPA works

/Add Contenet

Best Pratices for HPA

  1. Use the Right Metrics for Scaling your Application

    1. Default Metrics: CPU, Memory
    2. Custom Metrics: HTTP request rate (RPS), Failed requests Per Sec
    3. External Metrics: API call rate
  2. Avoid Overly Aggressive Scaling

    you can use stabilization windows to prevent frequent scaling (flapping).

  3. Combine HPA with Readiness & Liveness Probes New pods takes time to get to Ready State, Make sure your Liveness and Readiness probes are configured Right.

  4. Set Min and Max Replicas Properly

  5. Scale Using Multiple Metrics

You can use tools like for Event Driven Auto Scaling

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment