Skip to content

Instantly share code, notes, and snippets.

@unders
Last active May 30, 2019 05:31
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save unders/eb01b71cd1a355bc8f39912e79711e60 to your computer and use it in GitHub Desktop.
Save unders/eb01b71cd1a355bc8f39912e79711e60 to your computer and use it in GitHub Desktop.

Challanges

Metrics

It goes like this: “Once upon a time…something bad happened. The end.” How do you like this story?

Metrics, or stats, are numerical measures recorded by the application, such as counters, gauges, or timers. Metrics are very cheap to collect, since numeric values can be easily aggregated to reduce the overhead of transmitting that data to the monitoring system. They are also fairly accurate, which is why they are very useful for the actual monitoring (as the dictionary defines it) and alerting.

Yet the same capacity for aggregation is what makes metrics ill-suited for explaining the pathological behavior of the application. By aggregating data, we are throwing away all the context we had about the individual transactions.

Logs

In order to reconstruct the flight of the request from the many log streams, we need powerful logs aggregation technology and a distributed context propagation capability to tag all those logs in different processes with a unique request id that we can use to stitch those requests together. We might as well be using the real distributed tracing infrastructure at this point! Yet even after tagging the logs with a unique request id, we still cannot assemble them into an accurate sequence, because the timestamps from different servers are generally not comparable due to clock skews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment