Skip to content

Instantly share code, notes, and snippets.

@chadlwilson
Last active July 20, 2022 07:46
Show Gist options
  • Save chadlwilson/c4653f6a5a1da43ccacf46589d7bf482 to your computer and use it in GitHub Desktop.
Save chadlwilson/c4653f6a5a1da43ccacf46589d7bf482 to your computer and use it in GitHub Desktop.
Notes on Istio 1.4.7 to 1.5.4 upgrade via HELM

Upgrading to 1.5.4 via Helm

This is the rough outline of how we successfully did an in-place control + data plane upgrade from Istio 1.4.7 -> 1.5.4 via the official Helm charts.

Upgrade was

  • applied via scripting/automation
  • on a mesh using
    • mTLS
    • Istio RBAC via AuthorizationPolicy
    • telemetry v1
    • tracing enabled, but Jaeger not deployed via istio chart
    • istio ingress gateway + secondary istio ingress gateway
    • active traffic flowing through without any observed increase in error rates

1. Review upgrade nodes

This ignores anything specifically mentioned in the upgrade notes.

2. Pre-upgrade gotchas

  • Bug in RBAC backward compatibility with 1.4 in 1.5.0 -> 1.5.2, fixed in 1.5.3
  • issue with visibility of ServiceEntrys being scoped using Sidecar resource
  • All traffic ports are now captured by default; this caused our non-mTLS metrics ports to start enforcing mTLS which they previously did not do on 1.4.7
    • Fix: exclude metrics ports via sidecar annotations traffic.sidecar.istio.io/excludeInboundPorts: "9080, 15090"

3. Pre-upgrade scripting

#!/usr/bin/env bash

# In 1.4 Galley manages the webhook configuration; in 1.5 Helm manages it and it is patched by galley dynamically 
# without `ownerReferences`, so we can detect if we have upgraded Galley already
if kubectl get validatingwebhookconfiguration/istio-galley -n istio-system -o yaml | grep ownerReferences; then
  echo "Detected 1.4 installation - preparing Helm upgrade to 1.5.x by deleting galley-managed webhook..."

  # Disable webhook reconciliation so we can delete the webhook
  kubectl get deployment/istio-galley -n istio-system -o yaml | \
    sed 's/enable-reconcileWebhookConfiguration=true/enable-reconcileWebhookConfiguration=false/' | \
    kubectl apply -f -

  # Wait for Galley to come back up
  kubectl rollout status deployment/istio-galley -n istio-system --timeout 60s

  # Delete the webhook
  kubectl delete validatingwebhookconfiguration/istio-galley -n istio-system

  # Now we can proceed to helm upgrade to 1.5 which will recreate the webhook
fi

3. Istio Upgrade

Not to be taken literally - this is pseudo-script...

helm upgrade --install --wait --atomic --cleanup-on-fail istio-init istio-init-1.5.4.tgz

# scripting to wait for jobs to complete goes here

helm upgrade --install --wait --atomic --cleanup-on-fail istio istio-1.5.4.tgz

# scripting to bounce `Deployment`s for injected services goes here

3. Post-upgrade gotchas

We noticed issues with ingress gateways coming up during the control plane upgrade.

It appears there was some kind of race condition when starting new 1.5.4 ingressgateway instances while parts of the 1.4.7 control plane were still running. Suspect new ingressgateway talking to old verison pilot problem perhaps?

Symptom

  • lots of weird errors about invalid configuration being received from pilot relating to tracing on the new version ingressgateway logs
  • a subset of new version ingress gateways would not become ready which could cause the helm upgrade --wait to get stuck

Fix

  • delete the pods that fail to become ready (manual intervention in our case, although technically possible to automate)
  • the new pods automatically re-created always came ready
  • helm upgrade will go to completion
@jlcrow
Copy link

jlcrow commented Dec 15, 2020

Cool gist, wondering did you get past 1.5.x yet and into the new model, if so how did you tackle that? I've tried jumping from 1.4. to 1.6 using the canary option, but ran into issues post upgrade where the gateways and sidecar proxies didn't have matching certs across all gateways so traffic was dying at the gateway never reaching the pod. So ended up rolling back with helm, just trying to find a way forward since 1.4.x isn't getting any more updates.

@chadlwilson
Copy link
Author

Hi @jlcrow - sorry for the slow response - forgot about this one.

We have subsequently moved 1.5 -> 1.6 (and about to do 1.6 -> 1.7) with zero downtime, migrating to installing our control plane via defining the IstioOperator resource in source control, and then using istioctl install to actually manage the canary rollout rather than Helm. We used istoctl to do an initial conversion of our Helm resources/values and then manually corrected niggles.

We didn't have issues with the gateways, however it's worth noting that

  • skip version upgrades are (in general) not supported in Istio without dropping traffic due to potential for lack of ability of old proxies to understand config from new proxies and vice verse
  • the gateway changes (along with v1 istio-telemetry, which we still relied on) were one part of the process that couldn't be canaried, so we were upgrading the ingressgateway pods in-place driven by the IstioOperator resource and istioctl. Thus we focused a lot on ensuring the transition from IstioOperator-managed components to Helm was seamless, and that new gateways could talk to "old" 1.5 proxies; and also that if we needed to rollback the gateways that we could a seamless process to re-run the Helm deploy to correct any issues.

Our process essentially was

  1. Start with Helm-managed 1.5 control plane, gateways & data plane
  2. Install canary 1.6 control plane + in-place upgrade of ingressgateways + telemetry. Gateways are now talking to 1.6 control plane, but we have mixed data plane versions between gateway -> service proxy. (check traffic is still flowing)
  3. Label namespaces with appropriate canary version/bounce data plane service proxies/Deployments (Everything is now on 1.6 - check traffic is still flowing)
  4. ... old 1.5 Helm-managed control plane is now essentially unused/idle. We left it there a few days before blowing it away with manual kubectl deletes.

Your issue sounds like perhaps the new citadel-within-istiod doesn't share a root of trust with the old citadel-managed root certs and is issuing incompatible certs to your new gateways that can't be trusted by old service proxies? Were you installing the new 1.6 control plane into the same istio-system namespace as the old Helm-managed components so that it can see/share the same root certs in istio-ca-secret? Did you happen to notice whether there were changes to the istio-ca-secret post-install?

We moved away from Helm because there was no supported Helm-managed deployment process, however I note that the Istio team seem to have added one back again recently (which is rather frustating given the difficult in moving away from it and mixed messaging around supporting helm install|upgrade). However, I'm not sure there is entirely a Helm-managed process that will get you from 1.4 -> 1.8 without dropping traffic due to forward/backward compability of control plane components. In our case we seek to never drop traffic during production deployments, so that probably meant we had no choice but to move towards the IstioOperator, however if this isn't a concern for you there are probably more options available and lower engineering effort.

@jlcrow
Copy link

jlcrow commented Jan 11, 2021

@chadlwilson Thanks so much for getting back to me. Yeah, like you the helm abandonment by Istio has led to a lot of grief for us. We're also in a place where we don't have the luxury of downtime in production. The script you created was helpful and I was able to perform a clean helm upgrade in a test cluster from 1.4.9 to 1.5.10. In order to get traffic flowing we also had to remove this env from pilot: PILOT_DISABLE_XDS_MARSHALING_TO_ANY: true. When I attempted this in our staging cluster, I still ran into trouble going from 1.4.9 -> 1.5.10, helm timed out trying to patch a bunch of resources, after several failed attempts to move on I reverted everything to the previous 1.4.9 state. In my previous attempt I did try to jump from 1.4 to a canary of 1.6 like you described, matching the 1.4 config. This is the odd piece about that whole situation, same namespace istio-system, updated all the namespaces with the canary label, bounced the workload, everything was beautiful and confirmed running. It wasn't until the next day after a helm deployment of one of our applications that traffic across one gateway stopped being able to talk to the services due to certificate issues between the gateway and all the services it provided ingress to. Not sure on the istio-ca-secret and whether it changed, but they were in the same namespace. This is my 2nd failed attempt to move forward off of 1.4 in this cluster, lots of fun, apparently we picked the wrong time to start using istio.

@chadlwilson
Copy link
Author

@jlcrow

helm timed out trying to patch a bunch of resources

Hmm, do you recall which resources the 1.5 upgrade timed out on? Might be able to trigger some memories. Wonder if it was waiting for a Deployment of some sort or another to become ready and it never did (e.g due to the issues I noted above in ingressgateways).

It wasn't until the next day after a helm deployment of one of our applications that traffic across one gateway stopped being able to talk to the services due to certificate issues between the gateway and all the services it provided ingress to.

This is rather odd; never saw anything like this. I can't really understand what might cause this, and we haven't had certificate issues that I can recall since we switched to node-agent/SDS, a long time prior to its retirement in 1.6. The mTLS and cert rotation has been the most stable bit for us. Did you use SDS on your 1.4 install? Perhaps it doesn't like co-existing with a 1.6 control plane somehow. Again, skip version upgrades and peaceful co-existance are a pretty scary and untested endeavour (as I understand) so I guess anything could break :-/

This is my 2nd failed attempt to move forward off of 1.4 in this cluster, lots of fun, apparently we picked the wrong time to start using istio

Yeah, I hear ya. We've been using Istio since 1.0, and upgrades were pretty smooth through to 1.4 (albeit with keeping on top of the API deprecations where we were using alpha-ish stuff such as RBAC; and our zero-downtime/zero-traffic-loss focus only really started around our 1.3 usage) but 1.5 and then 1.6 have honestly been a nightmare of breaking changes, upgrade bugs & churn in supported deployment tooling (and at times weaknesses in the "supported" deployment tooling upgrade path, e.g Helm install deprecated/unsupported in 1.5 with no actual migration path to istiod with istioctl being supported).

The payoff/light at the end-of-the-tunnel is the introduction of control plane canarying in 1.6 which should hopefully take a lot of fear/stress out of the process in production.

We're about to do our 1.7 upgrade, so I hope that is smoother (blocked earlier on needing to get beyond Kubernetes 1.15).

@jlcrow
Copy link

jlcrow commented Jan 11, 2021

@chadlwilson

Hmm, do you recall which resources the 1.5 upgrade timed out on? Might be able to trigger some memories. Wonder if it was waiting for > a Deployment of some sort or another to become ready and it never did (e.g due to the issues I noted above in ingressgateways).

I actually did have the ingress issue and saw their readiness probes 503 and deleted them pretty early in the upgrade process.

Error: UPGRADE FAILED: cannot patch "istioproxy" with kind attributemanifest: Timeout: request did not complete within requested timeout 30s && cannot patch "kubernetes" with kind attributemanifest: Timeout: request did not complete within requested timeout 30s && cannot patch "stdio" with kind handler: Timeout: request did not complete within requested timeout 30s && cannot patch "prometheus" with kind handler: context deadline exceeded && cannot patch "kubernetesenv" with kind handler: Timeout: request did not complete within requested timeout 30s && cannot patch "accesslog" with kind instance: Timeout: request did not complete within requested timeout 30s && cannot patch "tcpaccesslog" with kind instance: Timeout: request did not complete within requested timeout 30s && cannot patch "requestcount" with kind instance: Timeout: request did not complete within requested timeout 30s && cannot patch "requestduration" with kind instance: Timeout: request did not complete within requested timeout 30s && cannot patch "requestsize" with kind instance: Timeout: request did not complete within requested timeout 30s && cannot patch "responsesize" with kind instance: Timeout: request did not complete within requested timeout 30s && cannot patch "tcpbytesent" with kind instance: Timeout: request did not complete within requested timeout 30s && cannot patch "tcpbytereceived" with kind instance: Timeout: request did not complete within requested timeout 30s && cannot patch "tcpconnectionsopened" with kind instance: Timeout: request did not complete within requested timeout 30s && cannot patch "tcpconnectionsclosed" with kind instance: Timeout: request did not complete within requested timeout 30s && cannot patch "attributes" with kind instance: Timeout: request did not complete within requested timeout 30s && cannot patch "stdio" with kind rule: Timeout: request did not complete within requested timeout 30s && cannot patch "stdiotcp" with kind rule: Timeout: request did not complete within requested timeout 30s && cannot patch "promhttp" with kind rule: Timeout: request did not complete within requested timeout 30s && cannot patch "promtcp" with kind rule: Timeout: request did not complete within requested timeout 30s && cannot patch "promtcpconnectionopen" with kind rule: Timeout: request did not complete within requested timeout 30s && cannot patch "promtcpconnectionclosed" with kind rule: Timeout: request did not complete within requested timeout 30s && cannot patch "kubeattrgenrulerule" with kind rule: Timeout: request did not complete within requested timeout 30s && cannot patch "tcpkubeattrgenrulerule" with kind rule: Timeout: request did not complete within requested timeout 30s

This is rather odd; never saw anything like this. I can't really understand what might cause this, and we haven't had certificate issues that I can recall since we switched to node-agent/SDS, a long time prior to its retirement in 1.6. The mTLS and cert rotation has been the most stable bit for us. Did you use SDS on your 1.4 install? Perhaps it doesn't like co-existing with a 1.6 control plane somehow. Again, skip version upgrades and peaceful co-existance are a pretty scary and untested endeavour (as I understand) so I guess anything could break :-/

We do run SDS in the 1.4 cluster and I had this same thought when the gateways came up and saw the 1/1 instead of 2/2 containers on the pod. These are the errors we saw, mind you this happened after we were completely on the 1.6 canary and 1.4 had been deleted from the cluster at the time.

2020-09-28T17:44:25.996687Z warning envoy config [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:92] StreamSecrets gRPC config stream closed: 16, request authenticate failure 2020-09-28T17:44:28.115769Z info sds resource:default new connection 2020-09-28T17:44:28.115925Z info sds Skipping waiting for ingress gateway secret 2020-09-28T17:44:28.426937Z error citadelclient Failed to create certificate: rpc error: code = Unauthenticated desc = request authenticate failure 2020-09-28T17:44:28.426986Z error cache resource:default request:4efd50bb-5705-44a2-81e0-4bd71ce0cd13 CSR hit non-retryable error (HTTP code: 0). Error: rpc error: code = Unauthenticated desc = request authenticate failure 2020-09-28T17:44:28.427020Z error cache resource:default failed to generate secret for proxy: rpc error: code = Unauthenticated desc = request authenticate failure 2020-09-28T17:44:28.427040Z error sds resource:default Close connection. Failed to get secret for proxy "router~172.16.10.235~istio-internal-ingressgateway-5b69f8cb69-xvmd8.istio-system~istio-system.svc.cluster.local" from secret cache: rpc error: code = Unauthenticated desc = request authenticate failure 2020-09-28T17:44:28.427155Z info sds resource:default connection is terminated: rpc error: code = Canceled desc = context canceled 2020-09-28T17:44:28.427409Z warning envoy config [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:92] StreamSecrets gRPC config stream closed: 16, request authenticate failure

Same here, upgrades were smooth throughout 1.3 and 1.4 and we're very comfortable with helm and use it to template a lot of our deployments. I appreciate the conversation, we'll persist on a way forward, but getting close to just setting up a cluster next door and migrating workload to it with 1.8 installed. Good luck on your 1.7 upgrade.

@chadwilson
Copy link

The person you're wanting to tag is chadlwilson (notice the l in the middle). I'm chadwilson :)

@jlcrow
Copy link

jlcrow commented Jan 11, 2021

The person you're wanting to tag is chadlwilson (notice the l in the middle). I'm chadwilson :)

lol, missed that, sorry to bother you :)

@chadlwilson
Copy link
Author

Error: UPGRADE FAILED: cannot patch "istioproxy" with kind attributemanifest: Timeout: request did not complete within requested timeout 30s && cannot patch "kubernetes" with kind attributemanifest

These errors look like issues with the galley-managed ValidatingWebhookConfiguration- it implies the Kube API is trying to make calls to validate custom resources. There are issues with this between 1.4 and 1.5 which is why my scripting above temporarily disables galley reconciliation and deletes the webhook - did you make sure the webhook was deleted prior to the upgrade?

While my approach to deleting the webhook and disabling reconcilation has some risks, it came from this discussion and then my own trial and error.

Failed to get secret for proxy "router~172.16.10.235~istio-internal-ingressgateway-5b69f8cb69-xvmd8.istio-system~istio-system.svc.cluster.local" from secret cache

Yeah, this does look like an SDS error; my understanding with 1.6 is that SDS is built into the proxy itself rather than relying on the separate node-agent, so perhaps there is some trust issue between the ingressgateway and the control plane. Not really sure about that - but if your acceptable upgrade path requires you to not drop traffic perhaps it's better to focus on getting the 1.4 -> 1.5 upgrade working reliably rather than put yourself on the edge with an attempt at a 1.4 -> 1.6 upgrade without having the Mesh first reconciled on 1.5.

@jlcrow
Copy link

jlcrow commented Feb 8, 2021

@chadlwilson Good news, we were finally able to upgrade our production istio environment from 1.4.9 to 1.6.14, with zero downtime. Basically went back to the canary route, with istioctl and a control plane yaml that matched our existing gateways, I was able to identify a mistake in our scripts using verify-install. The 1.4.x helm removal script we were using had a mistake in it that removed the service account from one of the gateways, which caused the SDS issues in our first attempt. Appreciate all the help and feedback, happy to be past all this.

@chadlwilson
Copy link
Author

chadlwilson commented Feb 9, 2021

Great news @jlcrow - that's fantastic! And yes, the removal of an older Helmed install is really sensitive - lots of overlapping resources and easy to make a mistake :-(

Copy link

ghost commented Jul 20, 2022

# Delete the webhook
kubectl delete validatingwebhookconfiguration/istio-galley -n istio-system

no need namespace, actually.
warning: deleting cluster-scoped resources, not scoped to the provided namespace

@chadlwilson
Copy link
Author

# Delete the webhook
no need namespace, actually. warning: deleting cluster-scoped resources, not scoped to the provided namespace

True :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment