Skip to content

Instantly share code, notes, and snippets.

@chadlwilson
Last active July 20, 2022 07:46
Show Gist options
  • Save chadlwilson/c4653f6a5a1da43ccacf46589d7bf482 to your computer and use it in GitHub Desktop.
Save chadlwilson/c4653f6a5a1da43ccacf46589d7bf482 to your computer and use it in GitHub Desktop.
Notes on Istio 1.4.7 to 1.5.4 upgrade via HELM

Upgrading to 1.5.4 via Helm

This is the rough outline of how we successfully did an in-place control + data plane upgrade from Istio 1.4.7 -> 1.5.4 via the official Helm charts.

Upgrade was

  • applied via scripting/automation
  • on a mesh using
    • mTLS
    • Istio RBAC via AuthorizationPolicy
    • telemetry v1
    • tracing enabled, but Jaeger not deployed via istio chart
    • istio ingress gateway + secondary istio ingress gateway
    • active traffic flowing through without any observed increase in error rates

1. Review upgrade nodes

This ignores anything specifically mentioned in the upgrade notes.

2. Pre-upgrade gotchas

  • Bug in RBAC backward compatibility with 1.4 in 1.5.0 -> 1.5.2, fixed in 1.5.3
  • issue with visibility of ServiceEntrys being scoped using Sidecar resource
  • All traffic ports are now captured by default; this caused our non-mTLS metrics ports to start enforcing mTLS which they previously did not do on 1.4.7
    • Fix: exclude metrics ports via sidecar annotations traffic.sidecar.istio.io/excludeInboundPorts: "9080, 15090"

3. Pre-upgrade scripting

#!/usr/bin/env bash

# In 1.4 Galley manages the webhook configuration; in 1.5 Helm manages it and it is patched by galley dynamically 
# without `ownerReferences`, so we can detect if we have upgraded Galley already
if kubectl get validatingwebhookconfiguration/istio-galley -n istio-system -o yaml | grep ownerReferences; then
  echo "Detected 1.4 installation - preparing Helm upgrade to 1.5.x by deleting galley-managed webhook..."

  # Disable webhook reconciliation so we can delete the webhook
  kubectl get deployment/istio-galley -n istio-system -o yaml | \
    sed 's/enable-reconcileWebhookConfiguration=true/enable-reconcileWebhookConfiguration=false/' | \
    kubectl apply -f -

  # Wait for Galley to come back up
  kubectl rollout status deployment/istio-galley -n istio-system --timeout 60s

  # Delete the webhook
  kubectl delete validatingwebhookconfiguration/istio-galley -n istio-system

  # Now we can proceed to helm upgrade to 1.5 which will recreate the webhook
fi

3. Istio Upgrade

Not to be taken literally - this is pseudo-script...

helm upgrade --install --wait --atomic --cleanup-on-fail istio-init istio-init-1.5.4.tgz

# scripting to wait for jobs to complete goes here

helm upgrade --install --wait --atomic --cleanup-on-fail istio istio-1.5.4.tgz

# scripting to bounce `Deployment`s for injected services goes here

3. Post-upgrade gotchas

We noticed issues with ingress gateways coming up during the control plane upgrade.

It appears there was some kind of race condition when starting new 1.5.4 ingressgateway instances while parts of the 1.4.7 control plane were still running. Suspect new ingressgateway talking to old verison pilot problem perhaps?

Symptom

  • lots of weird errors about invalid configuration being received from pilot relating to tracing on the new version ingressgateway logs
  • a subset of new version ingress gateways would not become ready which could cause the helm upgrade --wait to get stuck

Fix

  • delete the pods that fail to become ready (manual intervention in our case, although technically possible to automate)
  • the new pods automatically re-created always came ready
  • helm upgrade will go to completion
@jlcrow
Copy link

jlcrow commented Jan 11, 2021

The person you're wanting to tag is chadlwilson (notice the l in the middle). I'm chadwilson :)

lol, missed that, sorry to bother you :)

@chadlwilson
Copy link
Author

Error: UPGRADE FAILED: cannot patch "istioproxy" with kind attributemanifest: Timeout: request did not complete within requested timeout 30s && cannot patch "kubernetes" with kind attributemanifest

These errors look like issues with the galley-managed ValidatingWebhookConfiguration- it implies the Kube API is trying to make calls to validate custom resources. There are issues with this between 1.4 and 1.5 which is why my scripting above temporarily disables galley reconciliation and deletes the webhook - did you make sure the webhook was deleted prior to the upgrade?

While my approach to deleting the webhook and disabling reconcilation has some risks, it came from this discussion and then my own trial and error.

Failed to get secret for proxy "router~172.16.10.235~istio-internal-ingressgateway-5b69f8cb69-xvmd8.istio-system~istio-system.svc.cluster.local" from secret cache

Yeah, this does look like an SDS error; my understanding with 1.6 is that SDS is built into the proxy itself rather than relying on the separate node-agent, so perhaps there is some trust issue between the ingressgateway and the control plane. Not really sure about that - but if your acceptable upgrade path requires you to not drop traffic perhaps it's better to focus on getting the 1.4 -> 1.5 upgrade working reliably rather than put yourself on the edge with an attempt at a 1.4 -> 1.6 upgrade without having the Mesh first reconciled on 1.5.

@jlcrow
Copy link

jlcrow commented Feb 8, 2021

@chadlwilson Good news, we were finally able to upgrade our production istio environment from 1.4.9 to 1.6.14, with zero downtime. Basically went back to the canary route, with istioctl and a control plane yaml that matched our existing gateways, I was able to identify a mistake in our scripts using verify-install. The 1.4.x helm removal script we were using had a mistake in it that removed the service account from one of the gateways, which caused the SDS issues in our first attempt. Appreciate all the help and feedback, happy to be past all this.

@chadlwilson
Copy link
Author

chadlwilson commented Feb 9, 2021

Great news @jlcrow - that's fantastic! And yes, the removal of an older Helmed install is really sensitive - lots of overlapping resources and easy to make a mistake :-(

Copy link

ghost commented Jul 20, 2022

# Delete the webhook
kubectl delete validatingwebhookconfiguration/istio-galley -n istio-system

no need namespace, actually.
warning: deleting cluster-scoped resources, not scoped to the provided namespace

@chadlwilson
Copy link
Author

# Delete the webhook
no need namespace, actually. warning: deleting cluster-scoped resources, not scoped to the provided namespace

True :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment