lvaylet/README.md

## README.md

      
    Raw
  

              README.md
            
          
    SLO Generator Demo Environment


Create a new project, for example slo-generator-demo, and set it to the current project, for example with:
gcloud projects create cloud-operations-sandbox-a5r3 --set-as-default


Open Cloud Shell and save the project ID and project number to environment variable with:
export PROJECT_ID=$(gcloud config get-value project)
export PROJECT_NUMBER=$(gcloud projects list --filter="$(gcloud config get-value project)" --format="value(PROJECT_NUMBER)")


If provisioning in Argolis, add the expected default network and override any organization policy that could prevent the creation of resources like the K8s cluster or the public Cloud SQL instance required by the Cloud Ops Sandbox, for example with:
gcloud services enable compute.googleapis.com

gcloud compute networks create default

for boolean_policy_id in sql.restrictPublicIp compute.requireShieldedVm compute.requireOsLogin
do
    gcloud resource-manager org-policies \
      disable-enforce constraints/${boolean_policy_id} \
      --project=${PROJECT_ID}
done

cat <<EOF | gcloud resource-manager org-policies set-policy --project=${PROJECT_ID} /dev/stdin
constraint: constraints/compute.vmExternalIpAccess
listPolicy:
  allValues: ALLOW
EOF   

cat <<EOF | gcloud resource-manager org-policies set-policy --project=${PROJECT_ID} /dev/stdin
constraint: constraints/compute.vmCanIpForward
listPolicy:
  allValues: ALLOW
EOF  


Provision the Cloud Ops Sandbox resources in this existing project with:
pip3 install google-cloud-pubsub
git clone https://github.com/GoogleCloudPlatform/cloud-ops-sandbox
cd cloud-ops-sandbox/provisioning
./sandboxctl create -p ${PROJECT_ID}


Wait a bit (20 minutes or so for a full provisioning!) for the success message:
module.monitoring.google_monitoring_alert_policy.availability_slo_burn_alert[0]: Creation complete after 26s [id=projects/united-concord-398009/alertPolicies/8248738219467070757]
module.monitoring.google_monitoring_alert_policy.availability_slo_burn_alert[8]: Creation complete after 26s [id=projects/united-concord-398009/alertPolicies/452382904543783588]
module.monitoring.google_monitoring_alert_policy.availability_slo_burn_alert[1]: Creation complete after 25s [id=projects/united-concord-398009/alertPolicies/16173015502557114609]

Apply complete! Resources: 83 added, 0 changed, 0 destroyed.

Outputs:

frontend_external_ip = "34.31.187.72"

Explore Cloud Ops Sandbox features by browsing

GKE Dashboard: https://console.cloud.google.com/kubernetes/workload?project=united-concord-398009
Monitoring Workspace: https://console.cloud.google.com/monitoring/?project=united-concord-398009
Try Online Boutique at http://34.31.187.72/


Click every URL and confirm everything works as expected.


Install the SLO Generator from PyPI, export the required environment variables and run a simple example from the documentation.
pip3 install slo-generator[cloud-monitoring]

mkdir slo-generator-demo
cd slo-generator-demo

export GAE_PROJECT_ID=${PROJECT_ID}
export CLOUD_OPS_PROJECT_ID=${PROJECT_ID}
export COLORED_OUTPUT=1

cat <<EOF > slo_gae_app_availability.yaml
apiVersion: sre.google.com/v2
kind: ServiceLevelObjective
metadata:
  name: gae-app-availability
  labels:
    service_name: gae
    feature_name: app
    slo_name: availability
spec:
  description: Availability of App Engine app
  backend: cloud_monitoring
  method: good_bad_ratio
  exporters:
  - cloud_monitoring
  service_level_indicator:
    filter_good: >
      project=${GAE_PROJECT_ID}
      metric.type="appengine.googleapis.com/http/server/response_count"
      resource.type="gae_app"
      ( metric.labels.response_code = 429 OR
        metric.labels.response_code = 200 OR
        metric.labels.response_code = 201 OR
        metric.labels.response_code = 202 OR
        metric.labels.response_code = 203 OR
        metric.labels.response_code = 204 OR
        metric.labels.response_code = 205 OR
        metric.labels.response_code = 206 OR
        metric.labels.response_code = 207 OR
        metric.labels.response_code = 208 OR
        metric.labels.response_code = 226 OR
        metric.labels.response_code = 304 )
    filter_valid: >
      project=${GAE_PROJECT_ID}
      metric.type="appengine.googleapis.com/http/server/response_count"
  goal: 0.95
EOF

cat <<EOF > shared_config.yaml
backends:
  cloud_monitoring:
    project_id: ${CLOUD_OPS_PROJECT_ID}

exporters:
  cloud_monitoring:
    project_id: ${CLOUD_OPS_PROJECT_ID}

error_budget_policies:
  default:
    steps:
    - name: 1 hour
      burn_rate_threshold: 9
      alert: true
      message_alert: Page to defend the SLO
      message_ok: Last hour on track
      window: 3600
    - name: 12 hours
      burn_rate_threshold: 3
      alert: true
      message_alert: Page to defend the SLO
      message_ok: Last 12 hours on track
      window: 43200
    - name: 7 days
      burn_rate_threshold: 1.5
      alert: false
      message_alert: Dev team dedicates 25% of engineers to the reliability backlog
      message_ok: Last week on track
      window: 604800
    - name: 28 days
      burn_rate_threshold: 1
      alert: false
      message_alert: Freeze release, unless related to reliability or security
      message_ok: Unfreeze release, per the agreed roll-out policy
      window: 2419200
EOF

slo-generator compute -f slo_gae_app_availability.yaml -c shared_config.yaml


Confirm all four SLOs are computed and displayed correctly, with an output like:
INFO - gae-app-availability             | 1 hour   | SLI: 100.0   % | SLO: 95.0 % | Gap: +5.0  % | BR: 0.0 / 9.0 | Alert: 0 | Good: 1085     | Bad: 0
INFO - gae-app-availability             | 12 hours | SLI: 99.7078 % | SLO: 95.0 % | Gap: +4.71 % | BR: 0.1 / 3.0 | Alert: 0 | Good: 13647    | Bad: 40
INFO - gae-app-availability             | 7 days   | SLI: 99.5062 % | SLO: 95.0 % | Gap: +4.51 % | BR: 0.1 / 1.5 | Alert: 0 | Good: 50382    | Bad: 250
INFO - gae-app-availability             | 28 days  | SLI: 99.5062 % | SLO: 95.0 % | Gap: +4.51 % | BR: 0.1 / 1.0 | Alert: 0 | Good: 50382    | Bad: 250
INFO - Run finished successfully in 3.0s.
INFO - Run summary | SLO Configs: 1 | Duration: 3.0s


Go to Monitoring > Metrics explorer and show that the following MQL query yields the same results for the good events:
fetch gae_app
| metric 'appengine.googleapis.com/http/server/response_count'
| { filter
      (metric.response_code == 200
       || metric.response_code == 201
       || metric.response_code == 202
       || metric.response_code == 203
       || metric.response_code == 203
       || metric.response_code == 204
       || metric.response_code == 205
       || metric.response_code == 206
       || metric.response_code == 207
       || metric.response_code == 208
       || metric.response_code == 226
       || metric.response_code == 304
       || metric.response_code == 429)
  ; ident }
| ratio


Provision a Cloud Run service that randomly fails and returns a 500 code. The Cloud Code sample Python project (with Flask, not Django) can easily be modified to offer this “feature”, for example by returning an error 25% of the time based on a random number generator:
import os
from random import randint
[...]
@app.route('/')
def hello():
    """Return either a friendly HTTP greeting or a 5xx error."""
    return_an_error = (randint(1, 4) == 1)
    if return_an_error:
        abort(500)

    message = "It's running!"
    [...]
Note that you might have to override another Org Policy in Argolis to allow unauthenticated users to connect to this Cloud Run Service. See Argolis Troubleshooting Tips for more details. Or configure the service to require authentication and call it from the command line as an authenticated user:
$ curl -H "Authorization: Bearer $(gcloud auth print-identity-token)" https://cloud-run-randomly-fails-4zbr2zmcxq-od.a.run.app
<!doctype html>
<html lang=en>
<title>500 Internal Server Error</title>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>


Retrieve the service URL with:
export CLOUD_RUN_SERVICE_URL=$(gcloud run services list --filter="cloud-run-randomly-fails" --format="value(URL)")


Generate traffic manually, for example with:
TOKEN=`gcloud auth print-access-token`
for i in {1..10}; do curl -H "Authorization: Bearer ${TOKEN}" ${CLOUD_RUN_SERVICE_URL}; done


Query the API using MQL with:
cat > query.json << EOF
{
    "query": "fetch cloud_run_revision | metric 'run.googleapis.com/request_count' | { filter metric.response_code_class == '2xx' ; ident } | ratio | group_by [] | within 3600s | every 3600s"
}
EOF
curl -d @query.json \
    -H "Authorization: Bearer ${TOKEN}" \
    --header "Content-Type: application/json" \
    -X POST \
    https://monitoring.googleapis.com/v3/projects/${PROJECT_ID}/timeSeries:query


Generate traffic and historical data every minute for later reuse with a Cloud Scheduler Job:
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
    --member serviceAccount:${PROJECT_NUMBER}-compute@developer.gserviceaccount.com \
    --role roles/run.invoker
gcloud scheduler jobs create http cloud-run-randomly-fails-load-tester \
    --schedule "* * * * *" \
    --uri "${CLOUD_RUN_SERVICE_URL}" \
    --http-method GET \
    --oidc-service-account-email ${PROJECT_NUMBER}-compute@developer.gserviceaccount.com \
    --location europe-west1


TODO


Use a dedicated project for the Cloud Operations Sandbox, so it is easier to provision and tear down? Or let Cloud Operations Sandbox CLI create one under the admin@lvaylet.altostrat.com identity to avoid billing account issues? Then keep slo-generator-demo pretty lean, with just a Cloud Run service that randomly fails and a load generator (with a Cloud Scheduler job). Here I had to destroy the sandbox with:
laurent@cloudshell:~ (slo-generator-demo)$ cd cloud-ops-sandbox/terraform/
laurent@cloudshell:~ (slo-generator-demo)$ project_id=$(gcloud config get-value project)
laurent@cloudshell:~ (slo-generator-demo)$ bucket_name="${project_id}-bucket"
laurent@cloudshell:~ (slo-generator-demo)$ terraform init -backend-config "bucket=${bucket_name}"
laurent@cloudshell:~ (slo-generator-demo)$ terraform destroy -var="project_id=${project_id}" -var="bucket_name=${bucket_name}"
And no file from the original repo was modified:
laurent@cloudshell:~/cloud-ops-sandbox (slo-generator-demo)$ git status
On branch main
Your branch is up to date with 'origin/main'.

nothing to commit, working tree clean