Skip to content

Instantly share code, notes, and snippets.

@neoakris
Last active April 26, 2024 04:00
Show Gist options
  • Save neoakris/bfc67fef83fecde6384ab4f8afe5fc95 to your computer and use it in GitHub Desktop.
Save neoakris/bfc67fef83fecde6384ab4f8afe5fc95 to your computer and use it in GitHub Desktop.
Example of Troubleshoot Prom GUI and GMP on GKE Autopilot

Overview

The following is based on 2 GCP / GKE How to Guides:

What this adds is additional checks / verification commands that can be run to help troubleshoot.

Step 0: Prereqs

  • provisioned GKE autopilot cluster v1.26.5 with all defaults
  • alias k=kubectl
  • get kubectl access to it
  • make sure kubectl get node shows at least 1 node. (Note a fresh GKE autopilot cluster will wait for you to deploy a pod before it'll provision nodes, so if you don't see at least 1 node run k run nginx --image=nginx, to force provision.)

Step 1: Set bash env vars

export PROJECT=chrism-playground-369416
export NAMESPACE=test
export SA_SHORT_NAME=gmp-test-sa
export SA_NAME=$SA_SHORT_NAME@$PROJECT.iam.gserviceaccount.com

Step 2: GKE Autopilot defaults to GMP (google managed prometheus) enabled & workload identity enabled

# Verify you see the 4 workloads associated with GMP enabled
k get po -n=gke-gmp-system 
# from --^, you should see alertmanager, gmp-operator, rule-evaluator, and collector,
# if you don't see collector check that you have at least 1 node running,
# since collector is a daemonset it'll wait for 1 node to exist before showing up.

# Verify nodes labeled with metadata-server-enabled (means workload identity enabled)
k get node -L=iam.gke.io/gke-metadata-server-enabled
# NAME                                           STATUS   ROLES    AGE     VERSION            GKE-METADATA-SERVER-ENABLED
# gk3-autopilot-cluster-1-pool-1-16322d28-zqw4   Ready    <none>   175m    v1.26.5-gke.1200   true
# gk3-autopilot-cluster-1-pool-1-4be8ad63-jlbw   Ready    <none>   6m31s   v1.26.5-gke.1200   true
# gk3-autopilot-cluster-1-pool-1-dbcf5863-76pf   Ready    <none>   168m    v1.26.5-gke.1200   true

Step 3: Deploy test workload and see prom metrics via curl

# Following comes from the docs:
###################################################################
# Deploy the demo app & podmonitor custom resource

k create ns $NAMESPACE

kubectl -n $NAMESPACE apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.7.0/examples/pod-monitoring.yaml

kubectl -n $NAMESPACE apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.7.0/examples/example-app.yaml

k scale deploy prom-example --replicas=1 -n=$NAMESPACE
####################################################################
# Verification Commands:

k get po -o wide -n=$NAMESPACE
# (pod IP of prom-example-5cd7b77867-plwh4 is 10.8.0.72)

k get deploy prom-example -n=$NAMESPACE -o yaml | grep "name: metrics" -B 2
#        ports:
#        - containerPort: 1234
#          name: metrics
# (^-- says metrics available on port 1234)

k run -it curl --image=curlimages/curl -n=$NAMESPACE -- sh
# ^-- run this from laptop shell, to gain pod shell

exit
# ^-- exit from pod shell to laptop shell

k exec -it pod/curl -n=$NAMESPACE -- curl 10.8.0.72:1234/metrics
# ^-- run this from laptop shell, runs curl from within pod named curl without switching shell context
# ^-- shows prometheus metrics
# 
# ...
# example_random_numbers_bucket{le="+Inf"} 4.4757322148e+10
# example_random_numbers_sum 3.571108803002773e+10
# example_random_numbers_count 4.4757322148e+10
# # HELP example_requests_total Total number of HTTP requests by status code and method.
# # TYPE example_requests_total counter
# example_requests_total{code="200",method="get"} 8611
# ...
#####################################################################

Step 4: Let's look at the metrics from the GCP GUI

  1. Go to the following URL
    https://console.cloud.google.com/monitoring/metrics-explorer
  2. If necessary update the URL to contain your project
    https://console.cloud.google.com/monitoring/metrics-explorer?project=chrism-playground-369416
  3. Click through the following logic to access Prom QL Query mode:
  • v-- Switch Mode from Builder to Code: (Query Language)
    image
  • v-- Switch Query Langauge to PromQL
    image
  • v-- Update timeframe to something like last 3 hours, and reference a prometheus metric that you saw when curling the prometheus metric endpoint. (example_requests_total came from curling the prometheus metric endpoint)
    image

Step 5: Deploy the prometheus GUI

Step 5A: Overview

  • The following link offers a good architecture diagram of how the prometheus GUI works
    https://cloud.google.com/stackdriver/docs/managed-prometheus#gmp-system-overview
  • To summarize it
    • Google has a metrics database called Monarch, it has a Prometheus API compatibility layer that's ~95% compatible with prometheus.
    • gmp-operator pod in gke-gmp-system namespace, configures collector pods in the same namespace based on podmonitor custom resources.
    • collector pod is basically a prometheus metric shipping agent, that ships to the monarch prom API endpoint, and metrics are stored in monarch
    • In step 5, we'll deploy a prom GUI frontend that will read from monarch, and present the data more like a traditional prometheus install that something like Grafana can read from.

Step 5B: Make a service account from the Prometheus Frontend GUI & Verify it has the correct Rights

# v-- set gcloud context
gcloud config set project $PROJECT

# v-- create GCP SA
gcloud iam service-accounts create $SA_SHORT_NAME

# v-- annotate the default Kube SA in the namespace, with a reference to the GCP SA, 
#     to establish a link between them as needed by GKE workload identity.
kubectl annotate serviceaccount \
  default \
  --namespace $NAMESPACE \
  iam.gke.io/gcp-service-account=$SA_NAME

# v-- add workloadIdentityUser GCP IAM Role, to default Kubernetes service
#     account in a kube namespace
gcloud iam service-accounts add-iam-policy-binding \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:$PROJECT.svc.id.goog[$NAMESPACE/default]" \
  $SA_NAME

# v-- add monitoring.viewer GCP IAM role to the GCP SA, which the kube SA
#     is now linked to
gcloud projects add-iam-policy-binding $PROJECT \
  --member=serviceAccount:$SA_NAME \
  --role=roles/monitoring.viewer

# Note: we won't specify a service account for the Prometheus GUI, so it'll 
#       use the default service account in that namespace, the above gave 
#       rights to the default service account in the namespace.

#############################################################################
# Verification Commands to validate what was just done

k get sa default -n=$NAMESPACE -o yaml | grep annotation -A 1
#  annotations:
#    iam.gke.io/gcp-service-account: gmp-test-sa@chrism-playground-369416.iam.gserviceaccount.com

gcloud projects get-iam-policy $PROJECT \
 --flatten="bindings[].members" \
 --format='table(bindings.role)' \
 --filter="bindings.members:$SA_NAME" 
# ROLE
# roles/monitoring.viewer

gcloud asset search-all-iam-policies --scope=projects/$PROJECT --query="$SA_NAME"
# ...
# policy:
#   bindings:
#   - members:
#     - serviceAccount:chrism-playground-369416.svc.id.goog[test/default]
#     role: roles/iam.workloadIdentityUser
# ...
# policy:
#   bindings:
#   - members:
#     - serviceAccount:gmp-test-sa@chrism-playground-369416.iam.gserviceaccount.com
#     role: roles/monitoring.viewer
# ...

Step 5C: Deploy Prometheus Frontend GUI

  • These instructions are mostly based on the docs, with a purposeful mistake, which can help debugging.
# v-- The sed is where the mistake happens
curl https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.7.0/examples/frontend.yaml |
sed 's/\$PROJECT_ID/$PROJECT/' |
kubectl apply -n $NAMESPACE -f -

k scale deploy frontend -n=$NAMESPACE --replicas=1
#  ^-- it defaults to 2 replicas which is unnecessary

k get po -n=$NAMESPACE

kubectl -n $NAMESPACE port-forward svc/frontend 9090
  • If the frontend.yaml is incorrect it may look like this
    image
  • Here's the problem
    image
  • Once the --query.project-id flag in the YAML is fixed to correctly reference the project where the IAM is configured for, it'll start to work and look like the following
    image
@mcscwizzy
Copy link

I cannot thank you enough!!! I absolutely despise Google Cloud and GKE. I'm more of an Azure/AWS person. I came into a new job and one of their first tasks was for me to fix their GMP/Grafana setup. Been at this for three days and you have saved my backside. I can't thank you enough!!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment