Skip to content

Instantly share code, notes, and snippets.

@stevenctl
Created June 28, 2021 17:45
Show Gist options
  • Save stevenctl/f3f903ee4ec721324dc85ec36c9ac733 to your computer and use it in GitHub Desktop.
Save stevenctl/f3f903ee4ec721324dc85ec36c9ac733 to your computer and use it in GitHub Desktop.
cluster drain draft

Removing a cluster from the mesh

While running a multi-cluster Istio service mesh can help to increase capacity and reliablitlity, it introduces new operational concerns. Removing a cluster from the mesh, either temporarily or permanently, requires special considerations.

Simple case

The easiest way to disconnect workloads from one cluster in your mesh from another is to delete the remote secret that allows the control plane to access the remote cluster's API server.

There are a few downsides to this:

  1. Deleting the remote secret will drop the endpoints, but not open connections. You will need to validate that these connections have been fully drained before considering the cluster "out of rotation".
  2. The ability to send new connections to the cluster will immediately drop to 0. If load suddenly shifts elsewhere, there could be service degradations or other unpredictable consequences.

Deleting the remote secret is fine when experimenting, but for production clusters it should be the last step when removing a cluster from rotation.

Deleting the remote secret

First, find and delete the remote secret.

$ kubectl --context "${CTX_CLUSTER1}" -n istio-system get secrets
NAME                                               TYPE                                  DATA   AGE
...
istio-remote-secret-cluster-2                      Opaque                                1      16m
...
$ kubectl --context "${CTX_CLUSTER1}" -n istio-system delete secret istio-remote-secret-cluster-2

After doing this, you should no longer see endpoints from cluster-2 on proxies in cluster-1. You can verify the endpoints using istioctl, and compare them with the Pod IPs in the remote cluster as given by kubectl.

$ istioctl --context $CTX_CLUSTER1 \
    proxy-config ep \
   $(kubectl --context $CTX_CLUSTER1 -n sample get po -lapp=sleep -ojsonpath='{.items[0].metadata.name}').sample

Verify the output doesn't contain the IPs in the remote cluster.

$ kubectl --context $CTX_CLUSTER2 get pods -n sample -owide

Production case

In a production environment, more precaution needs to be taken when removing the cluster. Rather than immediately moving all traffic to another cluster(s), we should gradually move the traffic over. This can be done with a simple traffic shifting rule using a transparent label topology.istio.io/cluster.

Removing traffic to cluster-2

The following rule is based on the apps in the multi-cluster verification step. We can shift most of our traffic over, and eventually change the weights to 100 and 0 for cluster-1 and cluster-2, respectively.

NOTE: The rule will need to be adjusted for and applied to all the clusters besides the cluster you are removing.

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: helloworld
spec:
  hosts:
  - helloworld
  http:
  - route:
    - destination:
        host: helloworld
        subset: cluster-1
      weight: 80
    - destination:
        host: helloworld
        subset: cluster-2
      weight: 20
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: helloworld
spec:
  host: helloworld
  subsets:
  - name: cluster-1
    labels:
      topology.istio.io/cluster: cluster-1
  - name: cluster-2
    labels:
      topology.istio.io/cluster: cluster-2
---

At each percentage of traffic shifting, monitor metrics and alerts in case the new load to cluster-1 causes issues. Once it is confirmed that things are stable, advance the shifting percentage until cluster-2 has a weight of 0.

Removing traffic from cluster-2:

TODO: MCS will be much better

There are two options for stopping traffic from leaving the cluster:

  1. Using traffic-shifting rules similar to the above.
  2. Changing Mesh Config's service settings

While traffic-shifting rules must be applied per-service, they're a bit safer because they won't suddenly shift all traffic originating from within the mesh to endpoints with in the mesh.

A benefit of the mesh config based approach is that it can be applied to all services in the cluster at once, with the rule below:

serviceSettings:
- settings:
   clusterLocal: true
 hosts:
 - *

The hosts field accepts wildcards, allowing to enforce cluster-local rules at the service, namespace or cluster scopes.

Keep in mind, for a primary-remote setup, this rule will affect all clusters that receive config from the primary cluster where the serviceSettings are applied.

Fully removing the cluster

Once it can be confirmed that the remaining clusters are stable, it is safe to remove the remote secret. This is necessary because new services added during this "maintenance period" may not have the traffic rules in place to keep from contacting the removed cluster.

Monitoring active connections

TODO find metrics that allow us to ensure cross-cluster connections have actually closed even after removing secrets/endpoints long-lived TCP connections can stay open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment