Skip to content

Instantly share code, notes, and snippets.

@shpwrck
Last active September 29, 2022 15:15
Show Gist options
  • Save shpwrck/0750f084cfdf0b6e89dc9076852f62f6 to your computer and use it in GitHub Desktop.
Save shpwrck/0750f084cfdf0b6e89dc9076852f62f6 to your computer and use it in GitHub Desktop.

Config required to scale and secure Istio for production

*note: working document, may not apply to all installations/architectures

Cert Management

Manage Certificates with Cert-Manager

Benefit:

Manual deployment of certificates is error prone.

Notes:

Vault can also be leveraged as a certificate source.

Istio CNI

Install Istio CNI

Benefit:

Installing the Istio-CNI allows for users to deploy Istio workloads without adding NET_ADMIN and NET_RAW capabilities.

Notes:

The istio-cni may cause issues with init containers. Follow these steps to address these issues.

IstioD

Scale Istiod

Benefit:

An Istiod outage can impact the data plane through validation. By default Istiod is configured with an HPA, but only one replica.

Notes:

At minimum Istiod should have 2 replicas.

Increase Maximum HPA

Benefit:

5 is the default limit for small environments. Larger environments need a larger cap.

Notes:

Change the value from 5 to 10.

Reduce HPA Target Utilization

Benefit:

Reducing the Average Utilization for the Istiod pod will trigger scaling events quicker, and allow for production level usage patterns.

Notes:

Reducing the utilization from 80% to 60% will suffice in many situations.

Set Istiod Requests Appropriately

Benefit:

The default installation values of 500m-CPU and 2048Mi-Mem are for small pilot installations.

Notes:

[ref] At scale Istiod uses 1vCPU and 1.5G of memory.

Disable or Diminish Trace Sampling

Benefit:

The default 1% sampling gathers more traces than necessary at requests/sec > 10,000

Notes:

Depending on the environment, 0.1% or 0.01% may be appropriate.

Leverage Revisions and Revision Tags

Benefit:

Revisions and revision tags will enable effective upgrades, and less resource maintenance.

Notes:

Be aware that tags and revisions have separate format requirements

Leveage Pilot Pod Anti-Affinity/Affinity

Benefit:

Scaling pilot multiple times on the same node is still configuring a single point of failure.

Notes:

Newer feature that is not fully documented in istio/istio. ref

Disable envoyMetricsService/envoyAccessLogService

Benefit:

These features are primarily diagnostic and should be used outside of performance sensitive environments.

Notes:

Tuning of the logs and metrics may eliminate the need to disable these.

Configure Certificate TTL

Benefit:

Default and Max certificate TTL should be configured to suit your TLS requirements. Set using environment variables.

Notes:

DEFAULT_WORKLOAD_CERT_TTL

MAX_WORKLOAD_CERT_TTL

Tune Pilot Debounce Settings

Benefit:

Reduces the burden on highly dynamic systems.

Notes:

PILOT_DEBOUNCE_AFTER

PILOT_DEBOUNCE_MAX

Gateways

Enable Affinity/Anti-Affinity on Gateways

Benefit:

Same as istiod example

Notes:

In heterogenous environments the network could play an additional consideration here.

Scale Gateways

Benefit:

The gateways are critical to the datapath and should have a higher number of minReplicas than 1.

Notes:

2 is the absolute minimum, more may be necessary.

Add Gateway PreStop Patch to allow for all connections to close

Benefit:

In order to allow for delayed connection closures, it may be necessary to wait an additional amount of time in order to allow for connection closures.

        overlays:
          - apiVersion: v1
            kind: Deployment
            name: istio-ingressgateway-${ISTIO_REVISION}
            patches:
            # Sleep 25s on pod shutdown to allow connections to drain
            - path: spec.template.spec.containers.[name:istio-proxy].lifecycle
              value:
                preStop:
                  exec:
                    command:
                    - sleep
                    - "25"  

Notes:

25 is an arbitrary number here. It is suggested to tune based on environment.

Proxy and Connections

Upgrade to http2 wherever possible

Benefit:

http2 is going to be more performant than http1 in situations where it can be leveraged.

Notes:

An explaination of how it can be configured is here

Tune proxy concurrency

Benefit:

By default the sidecar will create 2 threads. This may be leaving capacity underutilized.

Notes:

Available options are 0 (to leverage all cores based on limits and requests) or a predefined number based on tuning. ref

Leverage an Envoy Filter to Balance Connections over Worker Threads

Benefit:

Evenly distributes work across threads.

Notes:

Requirement based on the number of threads being leveraged.

...
spec:
  configPatches:
  - applyTo: LISTENER
    match:
      context: GATEWAY
    listener:
      portNumber: 8443
    patch:
      operation: MERGE
      value:
        connection_balance_config:
          exact_balance: {}
  workload:
    labels:
      app: ingress
...
Leverage Sidecar Resource to Minimize Config and Enforce mTLS

Benefit:

Reducing the configuration available to envoy aids in both security and performance.

Notes:

In Gloo Mesh this can be done using ServiceIsolation in workspacesettings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment