Skip to content

Instantly share code, notes, and snippets.

@issacg
Last active June 12, 2023 12:23
Show Gist options
  • Save issacg/481911f41b6af9eb44590bb5fe1186f3 to your computer and use it in GitHub Desktop.
Save issacg/481911f41b6af9eb44590bb5fe1186f3 to your computer and use it in GitHub Desktop.
runai-logs
[LOG] initializing Kubernetes client...
[LOG] successfully initialized Kubernetes client
cleaning up previous deployment if it exists...
waiting for all resources to be deleted...
all resources were successfully deleted
deploying runai diagnostics tool...
[TEST] running external cluster tests...
--------------------------------------------------
[TEST] GPU Nodes
--------------------------------------------------
[LOG] please verify that the list above includes all GPU nodes in the cluster
[LOG] if you suspect GPU nodes are missing from the list above, gpu-feature-discovery might be malfunctioning
[PASS]
[TEST] Nvidia device plugin
--------------------------------------------------
[PASS]
[TEST] DCGM Exporter
--------------------------------------------------
[PASS]
[TEST] Nginx Ingress Controller
--------------------------------------------------
[WARNING] nginx ingress controller is installed in the cluster
[WARNING]
[TEST] Cluster Version
--------------------------------------------------
[LOG] Kubernetes Cluster Version: v1.25.8-gke.1000
[PASS]
[TEST] Storage Classes
--------------------------------------------------
[LOG] StorageClasses in cluster:
[LOG] premium-rwo
[LOG] standard
[LOG] standard-rwo
[PASS]
[TEST] Prometheus check
--------------------------------------------------
[WARNING] prometheus is installed in the cluster
[WARNING]
[TEST] Node Feature Discovery
--------------------------------------------------
[WARNING] node-feature-discovery is installed in the cluster
[WARNING]
[TEST] GPU Feature Discovery
--------------------------------------------------
[PASS]
[TEST] List Pods
--------------------------------------------------
[LOG] Namespace/Name/Phase
[LOG] cert-manager/cert-manager-544cd78564-khrft/Running
[LOG] cert-manager/cert-manager-cainjector-676b44b449-8c5pp/Running
[LOG] cert-manager/cert-manager-webhook-5c64c6c6f9-c9brc/Running
[LOG] default/tensorflow-benchmarks-launcher-w2hk2/Succeeded
[LOG] gpu-operator/gpu-operator-1686569784-node-feature-discovery-master-76b4z5gq6/Running
[LOG] gpu-operator/gpu-operator-1686569784-node-feature-discovery-worker-8kwz9/Running
[LOG] gpu-operator/gpu-operator-1686569784-node-feature-discovery-worker-mftp9/Running
[LOG] gpu-operator/gpu-operator-6495fb4657-kpggj/Running
[LOG] kube-system/event-exporter-gke-755c4b4d97-rnbfb/Running
[LOG] kube-system/fluentbit-gke-2lzz6/Running
[LOG] kube-system/fluentbit-gke-74krn/Running
[LOG] kube-system/fluentbit-gke-7zg6s/Running
[LOG] kube-system/fluentbit-gke-9fszk/Running
[LOG] kube-system/fluentbit-gke-q89lw/Running
[LOG] kube-system/gke-metadata-server-44v9w/Running
[LOG] kube-system/gke-metadata-server-5sb6x/Running
[LOG] kube-system/gke-metadata-server-kxz4q/Running
[LOG] kube-system/gke-metadata-server-vcrlh/Running
[LOG] kube-system/gke-metadata-server-zx6x6/Running
[LOG] kube-system/gke-metrics-agent-2d2gz/Running
[LOG] kube-system/gke-metrics-agent-4lv8t/Running
[LOG] kube-system/gke-metrics-agent-mpcwf/Running
[LOG] kube-system/gke-metrics-agent-r2vhh/Running
[LOG] kube-system/gke-metrics-agent-zh649/Running
[LOG] kube-system/kube-dns-5b5dfcd97b-bxb8v/Running
[LOG] kube-system/kube-dns-5b5dfcd97b-lbhmv/Running
[LOG] kube-system/kube-dns-autoscaler-5f56f8997c-lcwpw/Running
[LOG] kube-system/kube-proxy-gke-runai-mvp-runai-gpu-pool-a3994fbf-5xvz/Running
[LOG] kube-system/kube-proxy-gke-runai-mvp-runai-gpu-pool-cbec8e56-fzdb/Running
[LOG] kube-system/kube-proxy-gke-runai-mvp-runai-pool-4a223292-5mql/Running
[LOG] kube-system/kube-proxy-gke-runai-mvp-runai-pool-a6becd78-j7qs/Running
[LOG] kube-system/kube-proxy-gke-runai-mvp-system-b41ae89e-9ibp/Running
[LOG] kube-system/l7-default-backend-8cdcff48c-6fj8s/Running
[LOG] kube-system/metrics-server-v0.5.2-855ff55569-w8p6z/Running
[LOG] kube-system/netd-26hrz/Running
[LOG] kube-system/netd-bjr2g/Running
[LOG] kube-system/netd-jkcrc/Running
[LOG] kube-system/netd-vdvcq/Running
[LOG] kube-system/netd-zmx6v/Running
[LOG] kube-system/nvidia-driver-installer-2phgd/Running
[LOG] kube-system/nvidia-driver-installer-qfxfd/Running
[LOG] kube-system/nvidia-gpu-device-plugin-medium-d79kl/Running
[LOG] kube-system/nvidia-gpu-device-plugin-medium-fj8fv/Running
[LOG] kube-system/pdcsi-node-2dnb2/Running
[LOG] kube-system/pdcsi-node-9ddlb/Running
[LOG] kube-system/pdcsi-node-qnt6w/Running
[LOG] kube-system/pdcsi-node-s654d/Running
[LOG] kube-system/pdcsi-node-vpkxh/Running
[LOG] monitoring/alertmanager-prometheus-kube-prometheus-alertmanager-0/Running
[LOG] monitoring/prometheus-kube-prometheus-operator-6ddd77f99b-db29x/Running
[LOG] monitoring/prometheus-kube-state-metrics-7b7455ff5d-h8gzr/Running
[LOG] monitoring/prometheus-prometheus-kube-prometheus-prometheus-0/Running
[LOG] monitoring/prometheus-prometheus-node-exporter-6ml69/Running
[LOG] monitoring/prometheus-prometheus-node-exporter-9pmb5/Running
[LOG] monitoring/prometheus-prometheus-node-exporter-l8dkz/Running
[LOG] monitoring/prometheus-prometheus-node-exporter-mzvwq/Running
[LOG] monitoring/prometheus-prometheus-node-exporter-qc65k/Running
[LOG] mpi-operator/mpi-operator-76fbc4d578-7jfmv/Running
[LOG] nginx-ingress/nginx-ingress-ingress-nginx-controller-867dc6b6c5-4h2nz/Running
[LOG] runai-preinstall-diagnostics/runai-preinstall-diagnostics-bc48w/Pending
[LOG] runai-preinstall-diagnostics/runai-preinstall-diagnostics-qgss6/Pending
[PASS]
[TEST] running internal cluster tests using image gcr.io/run-ai-lab/preinstall-diagnostics:v2.4.0...
--------------------------------------------------
[LOG] not all pods are ready [0/2], retrying in 5 seconds
[LOG] not all pods are ready [0/2], retrying in 5 seconds
[LOG] not all pods are ready [0/2], retrying in 5 seconds
[LOG] all daemonset pods are available
[LOG] logs for [gke-runai-mvp-runai-pool-4a223292-5mql/runai-preinstall-diagnostics-bc48w] are not ready yet, retrying in 5 seconds...
[LOG] logs for [gke-runai-mvp-runai-pool-4a223292-5mql/runai-preinstall-diagnostics-bc48w] are not ready yet, retrying in 5 seconds...
========================== LOGS FROM NODE gke-runai-mvp-runai-pool-4a223292-5mql ==========================
Logs for [gke-runai-mvp-runai-pool-4a223292-5mql/runai-preinstall-diagnostics-bc48w]:
[LOG] initializing Kubernetes client...
[LOG] successfully initialized Kubernetes client
[TEST] Run:AI service access: https://app.run.ai
--------------------------------------------------
[LOG] Run:AI service is accessible from within the cluster
[PASS]
[TEST] DNS Servers access
--------------------------------------------------
[LOG] Address for [app.run.ai] is [104.21.95.156], resolved by [1.1.1.1:53]
[LOG] Address for [app.run.ai] is [172.67.145.177], resolved by [8.8.8.8:53]
[PASS]
[TEST] Dynu DNS service access: https://api.dynu.com
--------------------------------------------------
[LOG] Dynu DNS service is accessible from within the cluster
[PASS]
[TEST] Connectivity to runai container registry: https://gcr.io/run-ai-prod
--------------------------------------------------
[LOG] Run:AI container registry is accessible
[PASS]
[TEST] DNS Resolver
--------------------------------------------------
[WARNING] Backend FQDN was not provided using the --domain flag, skipping test
[SKIP]
[PASS]
[TEST] Print resolv.conf
--------------------------------------------------
[LOG] search runai-preinstall-diagnostics.svc.cluster.local svc.cluster.local cluster.local me-west1-b.c.runai-prod.internal c.runai-prod.internal google.internal
nameserver 10.0.32.10
options ndots:5
[PASS]
[TEST] Run:AI Helm Repository
--------------------------------------------------
[PASS]
[TEST] DockerHub
--------------------------------------------------
[PASS]
[TEST] Quay.io
--------------------------------------------------
[PASS]
[TEST] Run:AI Prometheus
--------------------------------------------------
[PASS]
[TEST] Run:AI Auth Provider
--------------------------------------------------
[PASS]
[TEST] OS Information
--------------------------------------------------
[LOG] Os Info: Linux runai-preinstall-diagnostics-bc48w 5.15.89+ #1 SMP Sat Mar 18 09:27:02 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
[PASS]
[TEST] Node connectivity check
--------------------------------------------------
[LOG] not all pods are ready [0/2], retrying in 5 seconds
[LOG] not all pods are ready [0/2], retrying in 5 seconds
[LOG] all daemonset pods are available
[LOG] attempting to ping pod [gke-runai-mvp-runai-pool-4a223292-5mql/runai-preinstall-diagnostics-bc48w]...
[LOG] [gke-runai-mvp-runai-pool-4a223292-5mql/runai-preinstall-diagnostics-bc48w] -> [gke-runai-mvp-runai-pool-4a223292-5mql/runai-preinstall-diagnostics-bc48w]: successfully pinged
[LOG] [gke-runai-mvp-runai-pool-4a223292-5mql/runai-preinstall-diagnostics-bc48w] -> [gke-runai-mvp-runai-pool-4a223292-5mql/runai-preinstall-diagnostics-bc48w]: node clocks are in sync
[LOG] attempting to ping pod [gke-runai-mvp-runai-pool-a6becd78-j7qs/runai-preinstall-diagnostics-qgss6]...
[LOG] [gke-runai-mvp-runai-pool-4a223292-5mql/runai-preinstall-diagnostics-bc48w] -> [gke-runai-mvp-runai-pool-a6becd78-j7qs/runai-preinstall-diagnostics-qgss6]: successfully pinged
[LOG] [gke-runai-mvp-runai-pool-4a223292-5mql/runai-preinstall-diagnostics-bc48w] -> [gke-runai-mvp-runai-pool-a6becd78-j7qs/runai-preinstall-diagnostics-qgss6]: node clocks are in sync
[PASS]
[COMPLETE]
========================== LOGS FROM NODE gke-runai-mvp-runai-pool-a6becd78-j7qs ==========================
Logs for [gke-runai-mvp-runai-pool-a6becd78-j7qs/runai-preinstall-diagnostics-qgss6]:
[LOG] initializing Kubernetes client...
[LOG] successfully initialized Kubernetes client
[TEST] Run:AI service access: https://app.run.ai
--------------------------------------------------
[LOG] Run:AI service is accessible from within the cluster
[PASS]
[TEST] DNS Servers access
--------------------------------------------------
[LOG] Address for [app.run.ai] is [172.67.145.177], resolved by [1.1.1.1:53]
[LOG] Address for [app.run.ai] is [104.21.95.156], resolved by [8.8.8.8:53]
[PASS]
[TEST] Dynu DNS service access: https://api.dynu.com
--------------------------------------------------
[LOG] Dynu DNS service is accessible from within the cluster
[PASS]
[TEST] Connectivity to runai container registry: https://gcr.io/run-ai-prod
--------------------------------------------------
[LOG] Run:AI container registry is accessible
[PASS]
[TEST] DNS Resolver
--------------------------------------------------
[WARNING] Backend FQDN was not provided using the --domain flag, skipping test
[SKIP]
[PASS]
[TEST] Print resolv.conf
--------------------------------------------------
[LOG] search runai-preinstall-diagnostics.svc.cluster.local svc.cluster.local cluster.local me-west1-c.c.runai-prod.internal c.runai-prod.internal google.internal
nameserver 10.0.32.10
options ndots:5
[PASS]
[TEST] Run:AI Helm Repository
--------------------------------------------------
[PASS]
[TEST] DockerHub
--------------------------------------------------
[PASS]
[TEST] Quay.io
--------------------------------------------------
[PASS]
[TEST] Run:AI Prometheus
--------------------------------------------------
[PASS]
[TEST] Run:AI Auth Provider
--------------------------------------------------
[PASS]
[TEST] OS Information
--------------------------------------------------
[LOG] Os Info: Linux runai-preinstall-diagnostics-qgss6 5.15.89+ #1 SMP Sat Mar 18 09:27:02 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
[PASS]
[TEST] Node connectivity check
--------------------------------------------------
[LOG] not all pods are ready [0/2], retrying in 5 seconds
[LOG] not all pods are ready [0/2], retrying in 5 seconds
[LOG] all daemonset pods are available
[LOG] attempting to ping pod [gke-runai-mvp-runai-pool-4a223292-5mql/runai-preinstall-diagnostics-bc48w]...
[LOG] [gke-runai-mvp-runai-pool-a6becd78-j7qs/runai-preinstall-diagnostics-qgss6] -> [gke-runai-mvp-runai-pool-4a223292-5mql/runai-preinstall-diagnostics-bc48w]: successfully pinged
[LOG] [gke-runai-mvp-runai-pool-a6becd78-j7qs/runai-preinstall-diagnostics-qgss6] -> [gke-runai-mvp-runai-pool-4a223292-5mql/runai-preinstall-diagnostics-bc48w]: node clocks are in sync
[LOG] attempting to ping pod [gke-runai-mvp-runai-pool-a6becd78-j7qs/runai-preinstall-diagnostics-qgss6]...
[LOG] [gke-runai-mvp-runai-pool-a6becd78-j7qs/runai-preinstall-diagnostics-qgss6] -> [gke-runai-mvp-runai-pool-a6becd78-j7qs/runai-preinstall-diagnostics-qgss6]: successfully pinged
[LOG] [gke-runai-mvp-runai-pool-a6becd78-j7qs/runai-preinstall-diagnostics-qgss6] -> [gke-runai-mvp-runai-pool-a6becd78-j7qs/runai-preinstall-diagnostics-qgss6]: node clocks are in sync
[PASS]
[COMPLETE]
cleaning up...
waiting for all resources to be deleted...
all resources were successfully deleted
[WARNING] Cluster setup includes components that will require the customization of Run:AI installation. For more details, see installation instructions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment