Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save allingeek/fdabeab1101188378e41c72f9a03814c to your computer and use it in GitHub Desktop.

Select an option

Save allingeek/fdabeab1101188378e41c72f9a03814c to your computer and use it in GitHub Desktop.

Application Telemetry in Go Services

Philosophy

The only way to understand the code you've written and its impact on business value is to measure the usage and performance of that code. Please take a few minutes and read Coda Hale’s slides from "Metrics, Metrics Everywhere".

Application metrics come in these basic flavors:

  • Histograms
    • some value sampled, counted, and stats
  • Meters
    • count events
    • aggregated into rates
    • moving averages provide recency
  • Timers
    • A histogram of durations and a meter of calls
  • Counters
    • stateful
    • like number of open connections, or total bytes processed
    • things that you want to increment or decrement over time
  • Gauges
    • stateless
    • instant value of something like "healthy" or "unhealthy"

Tools / Libraries

  1. Go Metrics
  2. InfluxDB Reporter for Go Metrics

Service Instrumentation BKM

General Guidance

  1. Always emit a consistent set of metrics - typically taken care of by framework.
  2. Collect a timer for every IO operation. I'm worried that what you think this means is "collect lots of timers." What this says is, "Collect a timer for every IO operation." This includes disk reads, calls to service dependencies, calls to databases, calls to caches, log appends (you shouldn't be logging manually), and any other operation that involves IO.
  3. Collect a timer for CPU intensive operations. How much time does your application spend rendering documents? Calculating hashes? (De)serializing JSON documents?
  4. Volumes are important. These could be requests, fatals, errors, specific errors, retries, etc. Use meters for these to understand event frequency and volume.

Metric Classes

There are three classes of metrics that should be collected for any instrumented service: critical metrics, triage metrics, and client metrics.

Critical Metrics

Critical metrics are those metrics that indicate overall service health and performance. These are the minimum set of metrics that a service author should be allowed to publish. While all metrics should be named / tagged using a consistent scheme critical metrics in particular should be consistently named across all services and have the same meaning. Common critical metrics for a service named, "Service" and a method named, "Method" include:

  • Service.Volume
  • Service.Latency
  • Service.Availability (1-overall fatal rate)
  • Service.Fatal-Rate
  • Service.Error-Rate
  • Service.Method.Volume
  • Service.Method.Latency
  • Service.Method.Availability (1-method fatal rate)
  • Service.Method.Fatal-Rate
  • Service.Method.Error-Rate

With these metrics a service owner can build rich alarms and dashboards that minimize the mean-time-to-resolution of operational events. For example, alarm systems that provide supression mechanics can supress alarms for those serivces with a common dependency on an unhealthy service. Service specific dashboards can be built that include both the critical and triage metrics for the target service, but also include the critical metrics for dependencies.

Triage Metrics

Triage metrics provide insight into the interal operation of the service. These include code path counters, dependency service call retry counters, dependency service call timers, failure cause counters (and rates), and payload size counters (summable with increment-by-X semantics). Consider the following example:

func getDataById(id string) interface{}, error {
  if id == `` {
    return nil, fmt.Errorf("id is empty")
  }

  // attempt to retrieve from cache
  cr, err := cache.get(id)
  if err != nil {
    // Dont overload the backend DB if the cache breaks. Avoid cascading failure.
    switch err.(type) {
    case *cache.ConnectionRefused:
      fallthrough
    case *cache.ConnectionTimeout:
      fallthrough
    case *cache.Unauthorized:
      return nil, err
    default:
      return nil, Unmodeled(err)
    }        
  }
  if cr != nil {
    return cr
  }

  i := 0
  r := nil
  for {
    i++

    // attempt to retrieve from DB
    dr, err := db.get(id)

    // handle failure
    if err != nil {
      switch err.(type) {
      case *db.ConnectionRefused:
        fallthrough
      case *db.ConnectionTimeout:
        fallthrough
      case *db.NoSuchDB:
        fallthrough
      case *db.Unauthorized:
        if i > 3 {
          return nil, err
        }
      default:
        if i > 3 {
          return nil, Unmodeled(err)
        }
      }  
    }

    // handle success
    if dr == nil {
      return nil, nil
    } else {
      r = dr
      break
    }
  }

  // attempt to place in cache
  err = cache.put(id, r)
  if err != nil {
    switch err.(type) {
    case *cache.ConnectionRefused:
      fallthrough
    case *cache.ConnectionTimeout:
      fallthrough
    case *cache.Unauthorized:
      fallthrough
    default:
      // ignore
    }
  }

  return r
}

This bit of code is something that should feel common for service owners. It attempts to retrieve a value for a given ID. First it attempts to load the value from cache, and handles any errors that may occur during that activity. If the value is not in cache it attempts to retireve the value from the database, and handles any errors that might occur. Finally, if a value is retrieved it is cached and returned, otherwise it is simply returned. There are several code paths, failure modes, and IO operations in this example. Now consider this fully instrumented example:

func getDataById(id string) interface{}, error {
  startTime := time.Now()
  totalLatency    := metrics.GetOrRegisterTimer("Service.getDataById.Latency", nil)
  defer totalLatency.Update(time.Since(startTime))

  cacheGetLatency := metrics.GetOrRegisterTimer("Service.getDataById.CacheRetrieval-Latency", nil)
  dbGetLatency    := metrics.GetOrRegisterTimer("Service.getDataById.DatabaseRetrieval-Latency", nil)
  cachePutLatency := metrics.GetOrRegisterTimer("Service.getDataById.CachePut-Latency", nil)

  metrics.GetOrRegisterMeter("Service.getDataById.Request", nil).Mark(1)

  mFatal := metrics.GetOrRegisterMeter("Service.getDataById.Fatal", nil)
  mError := metrics.GetOrRegisterMeter("Service.getDataById.Error", nil)
  mCached := metrics.GetOrRegisterMeter("Service.getDataById.UsedCachedResult", nil)
  mDB := metrics.GetOrRegisterMeter("Service.getDataById.UsedDatabaseResult", nil)
  mER := metrics.GetOrRegisterMeter("Service.getDataById.UsedEmptyResult", nil)
  
  mCacheGetCR := metrics.GetOrRegisterMeter("Service.getDataById.CacheRetrievalError-ConnectionRefused", nil)
  mCacheGetCT := metrics.GetOrRegisterMeter("Service.getDataById.CacheRetrievalError-ConnectionTimeout", nil)
  mCacheGetUA := metrics.GetOrRegisterMeter("Service.getDataById.CacheRetrievalError-Unauthorized", nil)
  mCacheGetUM := metrics.GetOrRegisterMeter("Service.getDataById.CacheRetrievalError-Unmodeled", nil)
  
  mDBCR := metrics.GetOrRegisterMeter("Service.getDataById.DatabaseError-ConnectionRefused", nil)
  mDBCT := metrics.GetOrRegisterMeter("Service.getDataById.DatabaseError-ConnectionTimeout", nil)
  mDBUA := metrics.GetOrRegisterMeter("Service.getDataById.DatabaseError-Unauthorized", nil)
  mDBUM := metrics.GetOrRegisterMeter("Service.getDataById.DatabaseError-Unmodeled", nil)
  mDBNDB := metrics.GetOrRegisterMeter("Service.getDataById.DatabaseError-NoDatabase", nil)
  
  mCachePutCR := metrics.GetOrRegisterMeter("Service.getDataById.CachePutError-ConnectionRefused", nil)
  mCachePutCT := metrics.GetOrRegisterMeter("Service.getDataById.CachePutError-ConnectionTimeout", nil)
  mCachePutUA := metrics.GetOrRegisterMeter("Service.getDataById.CachePutError-Unauthorized", nil)
  mCachePutUM := metrics.GetOrRegisterMeter("Service.getDataById.CachePutError-Unmodeled", nil)
  
  mDBAttempt := metrics.GetOrRegisterMeter("Service.getDataById.DatabaseRetrieval-Attempt", nil)
  mCachePutAttempt := metrics.GetOrRegisterMeter("Service.getDataById.CachePut-Attempt", nil)
  mUncached := metrics.GetOrRegisterMeter("Service.getDataById.UncachedResult", nil)

  if id == `` {
    mError.Mark(1)
    return nil, fmt.Errorf("id is empty")
  }

  // attempt to retrieve from cache

  it := time.Now()
  cr, err := cache.get(id)
  cacheGetLatency.Update(time.Since(it))
  if err != nil {
    switch err.(type) {
    case *cache.ConnectionRefused:
      mCacheGetCR.Mark(1)
    case *cache.ConnectionTimeout:
      mCacheGetCT.Mark(1)
    case *cache.Unauthorized:
      mCacheGetUA.Mark(1)          
    default:
      mCacheGetUM.Mark(1)
      err = Unmodeled(err)
    } 
    mFatal.Mark(1)
    return nil, err
  }
  if cr != nil {
    mCached.Mark(1)
    return cr, nil
  }

  // enter DB retrieval loop, capture result with r
  i := 0
  r := null
  for {
    mDBAttempt.Mark(1)
    i++

    // attempt to retrieve from DB
    it := time.Now()
    dr, err := db.get(id)
    dbGetLatency.Update(time.Since(it))

    // handle failures
    if err != nil {
      switch err.(type) {
      case *db.ConnectionRefused:
        mDBCR.Mark(1)
      case *db.ConnectionTimeout:
        mDBCT.Mark(1)
      case *db.Unauthorized:
        mDBUA.Mark(1)
      case *db.NoSuchDB:
        mDBNDB.Mark(1)
      default:
        mDBUM.Mark(1)
        err = Unmodeled(err)
      } 
      
      if i > 3 {
        mFatal.Mark(1)
        return nil, err
      }
    }

    // handle successes
    if dr == nil {
      mER.Mark(1)
      return nil, er
    } else {
      mDB.Mark(1)
      r = dr
      break
    }
  }

  // attempt to place in cache
  mCachePutAttempt.Mark(1)
  it := time.Now()
  err = cache.put(id, r)
  cachePutLatency.Update(time.Since(it))
  if err != nil {
    switch err.(type) {
    case *cache.ConnectionRefused:
      mCachePutCR.Mark(1)
    case *cache.ConnectionTimeout:
      mCachePutCT.Mark(1)
    case *cache.Unauthorized:
      mCachePutUA.Mark(1)          
    default:
      mCachePutUM.Mark(1)
    } 
    mUncached.Mark(1)
  }

  return r, nil
}

As shown in the typical example above, a properly instrumented function might feel overwhelmed by accounting code. This code should be non-branching and the best library implementations might implement metric collection via inline function. Overall the performance impact should be minimal as none of these operations should involve IO. In the case where the language does not support inline functions the developer might refactor to collect data with local variables and make single function calls just before exiting the function (perhaps from a try-finally block in Java).

Triage metrics help minimize time-to-root-cause for both operational events and debugging activities. A properly instrumented application will allow developers to triage bugs without ever consulting logs. These are particularly handing during late-night operational events.

Client Metrics

Client metrics present critical metrics from a client perspective and include network failures and latency. These are collected on the client side and are therefore more difficult to implement. Collection is simple if your clients are internal services and you trust them to publish metrics into a shared time-series database. If your clients are third-parties or applications deployed in hostile environments then another collection mechanism might be required.

Metric Taxonomies

Many metric libraries, wire protocols, and time-series databases support metric taxonomies. These might be called labels, tags, or properites. They are commonly used to index time-series data and so be sure to understand the implications and limitations on tag cardinality provided by your time-series database.

It is a good practice to tag service metrics with environment dimensions such as region, hostname, or software version. Dimensions that scale with your dataset or customer base are typically not appropriate for tags. Those dimensions do not aggregate well. They are typically associated with request tracing or specific failure debugging and are better addressed with log analysis. You might encounter a specific scenario - for examples VIP customers - that you want to track in aggregate. This would be mild compromise that provides visibility into the experience of particular user segments.

Publishing

Different metric collection libraries use different mechanisms for publishing those metrics periodically. If you use the go-metrics library you should use the go-metrics-influxdb reporter library. Starting the reporter requires that your service run the following code during startup:

import "github.com/vrischmann/go-metrics-influxdb"

// these parameters should be injected via application configuration
go influxdb.InfluxDB(
    metrics.DefaultRegistry, // metrics registry
    time.Second * 10,        // interval
    "http://metrics:8086",   // the InfluxDB url - we will standardize 
    "mydb",                  // your InfluxDB database
    "myuser",                // your InfluxDB user
    "mypassword",            // your InfluxDB password
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment