Skip to content

Instantly share code, notes, and snippets.

@lvaylet
Last active March 13, 2024 11:01
Show Gist options
  • Save lvaylet/b1ed49cbd23e82aab7239d8a7fb6fc79 to your computer and use it in GitHub Desktop.
Save lvaylet/b1ed49cbd23e82aab7239d8a7fb6fc79 to your computer and use it in GitHub Desktop.
SLO Generator Demo Environment

SLO Generator Demo Environment

  1. Create a new project, for example slo-generator-demo, and set it to the current project, for example with:

    gcloud projects create cloud-operations-sandbox-a5r3 --set-as-default
  2. Open Cloud Shell and save the project ID and project number to environment variable with:

    export PROJECT_ID=$(gcloud config get-value project)
    export PROJECT_NUMBER=$(gcloud projects list --filter="$(gcloud config get-value project)" --format="value(PROJECT_NUMBER)")
  3. If provisioning in Argolis, add the expected default network and override any organization policy that could prevent the creation of resources like the K8s cluster or the public Cloud SQL instance required by the Cloud Ops Sandbox, for example with:

    gcloud services enable compute.googleapis.com
    
    gcloud compute networks create default
    
    for boolean_policy_id in sql.restrictPublicIp compute.requireShieldedVm compute.requireOsLogin
    do
        gcloud resource-manager org-policies \
          disable-enforce constraints/${boolean_policy_id} \
          --project=${PROJECT_ID}
    done
    
    cat <<EOF | gcloud resource-manager org-policies set-policy --project=${PROJECT_ID} /dev/stdin
    constraint: constraints/compute.vmExternalIpAccess
    listPolicy:
      allValues: ALLOW
    EOF   
    
    cat <<EOF | gcloud resource-manager org-policies set-policy --project=${PROJECT_ID} /dev/stdin
    constraint: constraints/compute.vmCanIpForward
    listPolicy:
      allValues: ALLOW
    EOF  
  4. Provision the Cloud Ops Sandbox resources in this existing project with:

    pip3 install google-cloud-pubsub
    git clone https://github.com/GoogleCloudPlatform/cloud-ops-sandbox
    cd cloud-ops-sandbox/provisioning
    ./sandboxctl create -p ${PROJECT_ID}
  5. Wait a bit (20 minutes or so for a full provisioning!) for the success message:

    module.monitoring.google_monitoring_alert_policy.availability_slo_burn_alert[0]: Creation complete after 26s [id=projects/united-concord-398009/alertPolicies/8248738219467070757]
    module.monitoring.google_monitoring_alert_policy.availability_slo_burn_alert[8]: Creation complete after 26s [id=projects/united-concord-398009/alertPolicies/452382904543783588]
    module.monitoring.google_monitoring_alert_policy.availability_slo_burn_alert[1]: Creation complete after 25s [id=projects/united-concord-398009/alertPolicies/16173015502557114609]
    
    Apply complete! Resources: 83 added, 0 changed, 0 destroyed.
    
    Outputs:
    
    frontend_external_ip = "34.31.187.72"
    
    Explore Cloud Ops Sandbox features by browsing
    
    GKE Dashboard: https://console.cloud.google.com/kubernetes/workload?project=united-concord-398009
    Monitoring Workspace: https://console.cloud.google.com/monitoring/?project=united-concord-398009
    Try Online Boutique at http://34.31.187.72/
  6. Click every URL and confirm everything works as expected.

  7. Install the SLO Generator from PyPI, export the required environment variables and run a simple example from the documentation.

    pip3 install slo-generator[cloud-monitoring]
    
    mkdir slo-generator-demo
    cd slo-generator-demo
    
    export GAE_PROJECT_ID=${PROJECT_ID}
    export CLOUD_OPS_PROJECT_ID=${PROJECT_ID}
    export COLORED_OUTPUT=1
    
    cat <<EOF > slo_gae_app_availability.yaml
    apiVersion: sre.google.com/v2
    kind: ServiceLevelObjective
    metadata:
      name: gae-app-availability
      labels:
        service_name: gae
        feature_name: app
        slo_name: availability
    spec:
      description: Availability of App Engine app
      backend: cloud_monitoring
      method: good_bad_ratio
      exporters:
      - cloud_monitoring
      service_level_indicator:
        filter_good: >
          project=${GAE_PROJECT_ID}
          metric.type="appengine.googleapis.com/http/server/response_count"
          resource.type="gae_app"
          ( metric.labels.response_code = 429 OR
            metric.labels.response_code = 200 OR
            metric.labels.response_code = 201 OR
            metric.labels.response_code = 202 OR
            metric.labels.response_code = 203 OR
            metric.labels.response_code = 204 OR
            metric.labels.response_code = 205 OR
            metric.labels.response_code = 206 OR
            metric.labels.response_code = 207 OR
            metric.labels.response_code = 208 OR
            metric.labels.response_code = 226 OR
            metric.labels.response_code = 304 )
        filter_valid: >
          project=${GAE_PROJECT_ID}
          metric.type="appengine.googleapis.com/http/server/response_count"
      goal: 0.95
    EOF
    
    cat <<EOF > shared_config.yaml
    backends:
      cloud_monitoring:
        project_id: ${CLOUD_OPS_PROJECT_ID}
    
    exporters:
      cloud_monitoring:
        project_id: ${CLOUD_OPS_PROJECT_ID}
    
    error_budget_policies:
      default:
        steps:
        - name: 1 hour
          burn_rate_threshold: 9
          alert: true
          message_alert: Page to defend the SLO
          message_ok: Last hour on track
          window: 3600
        - name: 12 hours
          burn_rate_threshold: 3
          alert: true
          message_alert: Page to defend the SLO
          message_ok: Last 12 hours on track
          window: 43200
        - name: 7 days
          burn_rate_threshold: 1.5
          alert: false
          message_alert: Dev team dedicates 25% of engineers to the reliability backlog
          message_ok: Last week on track
          window: 604800
        - name: 28 days
          burn_rate_threshold: 1
          alert: false
          message_alert: Freeze release, unless related to reliability or security
          message_ok: Unfreeze release, per the agreed roll-out policy
          window: 2419200
    EOF
    
    slo-generator compute -f slo_gae_app_availability.yaml -c shared_config.yaml
  8. Confirm all four SLOs are computed and displayed correctly, with an output like:

    INFO - gae-app-availability             | 1 hour   | SLI: 100.0   % | SLO: 95.0 % | Gap: +5.0  % | BR: 0.0 / 9.0 | Alert: 0 | Good: 1085     | Bad: 0
    INFO - gae-app-availability             | 12 hours | SLI: 99.7078 % | SLO: 95.0 % | Gap: +4.71 % | BR: 0.1 / 3.0 | Alert: 0 | Good: 13647    | Bad: 40
    INFO - gae-app-availability             | 7 days   | SLI: 99.5062 % | SLO: 95.0 % | Gap: +4.51 % | BR: 0.1 / 1.5 | Alert: 0 | Good: 50382    | Bad: 250
    INFO - gae-app-availability             | 28 days  | SLI: 99.5062 % | SLO: 95.0 % | Gap: +4.51 % | BR: 0.1 / 1.0 | Alert: 0 | Good: 50382    | Bad: 250
    INFO - Run finished successfully in 3.0s.
    INFO - Run summary | SLO Configs: 1 | Duration: 3.0s
  9. Go to Monitoring > Metrics explorer and show that the following MQL query yields the same results for the good events:

    fetch gae_app
    | metric 'appengine.googleapis.com/http/server/response_count'
    | { filter
          (metric.response_code == 200
           || metric.response_code == 201
           || metric.response_code == 202
           || metric.response_code == 203
           || metric.response_code == 203
           || metric.response_code == 204
           || metric.response_code == 205
           || metric.response_code == 206
           || metric.response_code == 207
           || metric.response_code == 208
           || metric.response_code == 226
           || metric.response_code == 304
           || metric.response_code == 429)
      ; ident }
    | ratio
    
  10. Provision a Cloud Run service that randomly fails and returns a 500 code. The Cloud Code sample Python project (with Flask, not Django) can easily be modified to offer this “feature”, for example by returning an error 25% of the time based on a random number generator:

    import os
    from random import randint
    [...]
    @app.route('/')
    def hello():
        """Return either a friendly HTTP greeting or a 5xx error."""
        return_an_error = (randint(1, 4) == 1)
        if return_an_error:
            abort(500)
    
        message = "It's running!"
        [...]

    Note that you might have to override another Org Policy in Argolis to allow unauthenticated users to connect to this Cloud Run Service. See Argolis Troubleshooting Tips for more details. Or configure the service to require authentication and call it from the command line as an authenticated user:

    $ curl -H "Authorization: Bearer $(gcloud auth print-identity-token)" https://cloud-run-randomly-fails-4zbr2zmcxq-od.a.run.app
    <!doctype html>
    <html lang=en>
    <title>500 Internal Server Error</title>
    <h1>Internal Server Error</h1>
    <p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>
  11. Retrieve the service URL with:

    export CLOUD_RUN_SERVICE_URL=$(gcloud run services list --filter="cloud-run-randomly-fails" --format="value(URL)")
  12. Generate traffic manually, for example with:

    TOKEN=`gcloud auth print-access-token`
    for i in {1..10}; do curl -H "Authorization: Bearer ${TOKEN}" ${CLOUD_RUN_SERVICE_URL}; done
  13. Query the API using MQL with:

    cat > query.json << EOF
    {
        "query": "fetch cloud_run_revision | metric 'run.googleapis.com/request_count' | { filter metric.response_code_class == '2xx' ; ident } | ratio | group_by [] | within 3600s | every 3600s"
    }
    EOF
    curl -d @query.json \
        -H "Authorization: Bearer ${TOKEN}" \
        --header "Content-Type: application/json" \
        -X POST \
        https://monitoring.googleapis.com/v3/projects/${PROJECT_ID}/timeSeries:query
  14. Generate traffic and historical data every minute for later reuse with a Cloud Scheduler Job:

    gcloud projects add-iam-policy-binding ${PROJECT_ID} \
        --member serviceAccount:${PROJECT_NUMBER}-compute@developer.gserviceaccount.com \
        --role roles/run.invoker
    gcloud scheduler jobs create http cloud-run-randomly-fails-load-tester \
        --schedule "* * * * *" \
        --uri "${CLOUD_RUN_SERVICE_URL}" \
        --http-method GET \
        --oidc-service-account-email ${PROJECT_NUMBER}-compute@developer.gserviceaccount.com \
        --location europe-west1

TODO

  • Use a dedicated project for the Cloud Operations Sandbox, so it is easier to provision and tear down? Or let Cloud Operations Sandbox CLI create one under the admin@lvaylet.altostrat.com identity to avoid billing account issues? Then keep slo-generator-demo pretty lean, with just a Cloud Run service that randomly fails and a load generator (with a Cloud Scheduler job). Here I had to destroy the sandbox with:

    laurent@cloudshell:~ (slo-generator-demo)$ cd cloud-ops-sandbox/terraform/
    laurent@cloudshell:~ (slo-generator-demo)$ project_id=$(gcloud config get-value project)
    laurent@cloudshell:~ (slo-generator-demo)$ bucket_name="${project_id}-bucket"
    laurent@cloudshell:~ (slo-generator-demo)$ terraform init -backend-config "bucket=${bucket_name}"
    laurent@cloudshell:~ (slo-generator-demo)$ terraform destroy -var="project_id=${project_id}" -var="bucket_name=${bucket_name}"

    And no file from the original repo was modified:

    laurent@cloudshell:~/cloud-ops-sandbox (slo-generator-demo)$ git status
    On branch main
    Your branch is up to date with 'origin/main'.
    
    nothing to commit, working tree clean
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment