This is the rough outline of how we successfully did an in-place control + data plane upgrade from Istio 1.4.7
-> 1.5.4
via the official Helm charts.
Upgrade was
- applied via scripting/automation
- on a mesh using
- mTLS
- Istio RBAC via
AuthorizationPolicy
- telemetry v1
- tracing enabled, but Jaeger not deployed via istio chart
- istio ingress gateway + secondary istio ingress gateway
- active traffic flowing through without any observed increase in error rates
This ignores anything specifically mentioned in the upgrade notes.
- Bug in RBAC backward compatibility with
1.4
in1.5.0
->1.5.2
, fixed in1.5.3
- issue with visibility of
ServiceEntry
s being scoped usingSidecar
resource- istio/istio#24251 subsequent added to the upgrade notes
- All traffic ports are now captured by default; this caused our non-mTLS metrics ports to start enforcing mTLS which they previously did not do on
1.4.7
- Fix: exclude metrics ports via sidecar annotations
traffic.sidecar.istio.io/excludeInboundPorts: "9080, 15090"
- Fix: exclude metrics ports via sidecar annotations
#!/usr/bin/env bash
# In 1.4 Galley manages the webhook configuration; in 1.5 Helm manages it and it is patched by galley dynamically
# without `ownerReferences`, so we can detect if we have upgraded Galley already
if kubectl get validatingwebhookconfiguration/istio-galley -n istio-system -o yaml | grep ownerReferences; then
echo "Detected 1.4 installation - preparing Helm upgrade to 1.5.x by deleting galley-managed webhook..."
# Disable webhook reconciliation so we can delete the webhook
kubectl get deployment/istio-galley -n istio-system -o yaml | \
sed 's/enable-reconcileWebhookConfiguration=true/enable-reconcileWebhookConfiguration=false/' | \
kubectl apply -f -
# Wait for Galley to come back up
kubectl rollout status deployment/istio-galley -n istio-system --timeout 60s
# Delete the webhook
kubectl delete validatingwebhookconfiguration/istio-galley -n istio-system
# Now we can proceed to helm upgrade to 1.5 which will recreate the webhook
fi
Not to be taken literally - this is pseudo-script...
helm upgrade --install --wait --atomic --cleanup-on-fail istio-init istio-init-1.5.4.tgz
# scripting to wait for jobs to complete goes here
helm upgrade --install --wait --atomic --cleanup-on-fail istio istio-1.5.4.tgz
# scripting to bounce `Deployment`s for injected services goes here
We noticed issues with ingress gateways coming up during the control plane upgrade.
It appears there was some kind of race condition when starting new 1.5.4
ingressgateway
instances while
parts of the 1.4.7
control plane were still running. Suspect new ingressgateway talking to old verison pilot
problem perhaps?
Symptom
- lots of weird errors about invalid configuration being received from pilot relating to
tracing
on the new version ingressgateway logs - a subset of new version ingress gateways would not become ready which could cause the
helm upgrade --wait
to get stuck
Fix
- delete the pods that fail to become ready (manual intervention in our case, although technically possible to automate)
- the new pods automatically re-created always came ready
helm upgrade
will go to completion
@chadlwilson
I actually did have the ingress issue and saw their readiness probes 503 and deleted them pretty early in the upgrade process.
Error: UPGRADE FAILED: cannot patch "istioproxy" with kind attributemanifest: Timeout: request did not complete within requested timeout 30s && cannot patch "kubernetes" with kind attributemanifest: Timeout: request did not complete within requested timeout 30s && cannot patch "stdio" with kind handler: Timeout: request did not complete within requested timeout 30s && cannot patch "prometheus" with kind handler: context deadline exceeded && cannot patch "kubernetesenv" with kind handler: Timeout: request did not complete within requested timeout 30s && cannot patch "accesslog" with kind instance: Timeout: request did not complete within requested timeout 30s && cannot patch "tcpaccesslog" with kind instance: Timeout: request did not complete within requested timeout 30s && cannot patch "requestcount" with kind instance: Timeout: request did not complete within requested timeout 30s && cannot patch "requestduration" with kind instance: Timeout: request did not complete within requested timeout 30s && cannot patch "requestsize" with kind instance: Timeout: request did not complete within requested timeout 30s && cannot patch "responsesize" with kind instance: Timeout: request did not complete within requested timeout 30s && cannot patch "tcpbytesent" with kind instance: Timeout: request did not complete within requested timeout 30s && cannot patch "tcpbytereceived" with kind instance: Timeout: request did not complete within requested timeout 30s && cannot patch "tcpconnectionsopened" with kind instance: Timeout: request did not complete within requested timeout 30s && cannot patch "tcpconnectionsclosed" with kind instance: Timeout: request did not complete within requested timeout 30s && cannot patch "attributes" with kind instance: Timeout: request did not complete within requested timeout 30s && cannot patch "stdio" with kind rule: Timeout: request did not complete within requested timeout 30s && cannot patch "stdiotcp" with kind rule: Timeout: request did not complete within requested timeout 30s && cannot patch "promhttp" with kind rule: Timeout: request did not complete within requested timeout 30s && cannot patch "promtcp" with kind rule: Timeout: request did not complete within requested timeout 30s && cannot patch "promtcpconnectionopen" with kind rule: Timeout: request did not complete within requested timeout 30s && cannot patch "promtcpconnectionclosed" with kind rule: Timeout: request did not complete within requested timeout 30s && cannot patch "kubeattrgenrulerule" with kind rule: Timeout: request did not complete within requested timeout 30s && cannot patch "tcpkubeattrgenrulerule" with kind rule: Timeout: request did not complete within requested timeout 30s
We do run SDS in the 1.4 cluster and I had this same thought when the gateways came up and saw the 1/1 instead of 2/2 containers on the pod. These are the errors we saw, mind you this happened after we were completely on the 1.6 canary and 1.4 had been deleted from the cluster at the time.
2020-09-28T17:44:25.996687Z warning envoy config [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:92] StreamSecrets gRPC config stream closed: 16, request authenticate failure 2020-09-28T17:44:28.115769Z info sds resource:default new connection 2020-09-28T17:44:28.115925Z info sds Skipping waiting for ingress gateway secret 2020-09-28T17:44:28.426937Z error citadelclient Failed to create certificate: rpc error: code = Unauthenticated desc = request authenticate failure 2020-09-28T17:44:28.426986Z error cache resource:default request:4efd50bb-5705-44a2-81e0-4bd71ce0cd13 CSR hit non-retryable error (HTTP code: 0). Error: rpc error: code = Unauthenticated desc = request authenticate failure 2020-09-28T17:44:28.427020Z error cache resource:default failed to generate secret for proxy: rpc error: code = Unauthenticated desc = request authenticate failure 2020-09-28T17:44:28.427040Z error sds resource:default Close connection. Failed to get secret for proxy "router~172.16.10.235~istio-internal-ingressgateway-5b69f8cb69-xvmd8.istio-system~istio-system.svc.cluster.local" from secret cache: rpc error: code = Unauthenticated desc = request authenticate failure 2020-09-28T17:44:28.427155Z info sds resource:default connection is terminated: rpc error: code = Canceled desc = context canceled 2020-09-28T17:44:28.427409Z warning envoy config [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:92] StreamSecrets gRPC config stream closed: 16, request authenticate failure
Same here, upgrades were smooth throughout 1.3 and 1.4 and we're very comfortable with helm and use it to template a lot of our deployments. I appreciate the conversation, we'll persist on a way forward, but getting close to just setting up a cluster next door and migrating workload to it with 1.8 installed. Good luck on your 1.7 upgrade.