fralalonde/METRICS.md

## METRICS.md

      
    Raw
  

              METRICS.md
            
          
    Metrics are a contract

An application's metrics names and types make up an implicit contract which consequences of ignoring can be serious.
Humans depend on Metrics

Metric are used by systems admins to configure monitoring systems. Good metric names accurately and consistently convey meaning of the associated metric data. Admins should not have to look at the application code to understand what each metric represents. The name of a metric is often the only documentation available. Critical human decisions may have to be quickly made based on metrics, in these situations, their names should be as helpful and trustable as possible.
Machines depend on Metrics

Dashboards and alerting systems entirely depend on the metrics applications provides them. Changing the identifiers or meaning of the metrics will break these downstream applications, negating the very reason to emit metrics in the first place :

Alarms that shouldn't have gone off will go off
Alarms that should have gone off will remain silent
Conditions will be ignored or overseen because of unwarranted prior or current alarm noise!

Actually, even altering the code around metrics may break downstream expectations (alerting tresholds).
Metrics needs conventions!

Because metrics are non-business, output-only data, they are often an afterthought in application programming. But because of their importance to the continuing success of a project, much care should be taken when instrumenting an application, whether existing or new. Following a few simple conventions will go a long way in making metrics monitoring easier, reliable and efficient.

Naming Individual Metrics

Every new metric in an application should adhere to the PUMAC criterias:


Permanent: Renaming metrics is painful and dangerous. You will not do it, or you will suffer the consequences. Once the app starts emitting metrics (even in QA), its name effectively becomes permanent. Think early, think twice!


Universal: Metric names are made up of words separated by underscores. Words use only lowercase letters and numbers (7bit ASCII). Metrics names should start with a letter. Any other punctuation or symbols are proscribed as they might have special use in a downstream system.


Meaningful: Every metric name should start with its subject. The subject can be made of many words. The subject should be consistent with domain terminology, so that analysts and admins speak the same language.


Accurate: Because metrics may propagate down through several subsystems (aggregation -> monitoring -> alerting -> paging) using different scales, names should remain explicit about units of measure. One system's time convention may not match the next's.


Concise: Keep it short, but readable. Prefer shorter words to truncation. No acronyms except for measurement units (i.e. ms for milliseconds) or industry-standard (i.e. tp99). If bandwidth is tight, prefer to send a little less frequent metrics (using aggregation or sampling) than compromise on naming conventions.


Regular counters and markers

By convention, the subject should be pluralized for Counters and Markers metrics, as they are their own unit of measurement. A counter's subject should generally be domain-related, rather than "mechanical", unless they're defined in a low-level subsystem where no application-level subjects are identifiable.
A qualifier should be appended to the subject to put it in context. The qualifier should be a single word adjective. It is suggested to qualify the subject even if is used in only a single metric to prevent any confusion arising from later adding a second metric for the subject.

passengers_boarded
train_cars_attached
db_records_commited
response_bytes_sent
bytes_returned

Error markers

Error metrics are a special case of counter naming rules. Because of their importance in monitoring, they should be easy to identify and thus have an additional contraint. Errors metrics should be described either as plural_subjects followed by the "_failed" qualifier, or as first-tier pluralized subjects ending in "_errors".

db_commits_failed
user_auth_errors

It is suggested to use _errors notation for internal server errors (db timeouts, resource exhaustion, etc) and _failed for client errors (authentication, bad requests, etc) as a hint of the source of the problem.
Timers

Timers should use subject (singular) + verb + abbreviated time unit (us, ms, s). A timer's subject will often be more "mechanical" in nature than a counter's, especially when measuring I/O operations.

transaction_commit_ms
http_request_process_ms

Gauges

Contrarily to counters, gauges are generally measured outside of any application operation context (instant measurement = context-less) and may thus omit a qualifier. But as gauge subjects may often have multiple measurable dimensions (e.g. memory - free vs. used), metric subject should be specific about which dimension and unit are used.

system_memory_free_bytes
pooled_db_connections_in_use


Metric Namespaces

Big metric systems will cover many or all aspects of an organization so they can be correlated. Deep hierarchies will evolve from the merging of these metric datasources. The precedence of each namespace does not matter too much, but their relative ordering should be fixed across all the organization.
Metric type

Graphite + Statsd splits metrics between counters and timers at the very top of the name space. This is very strange - one would expect metrics from the same to be sibblings? Maybe not, if you never correlate action time with item count.
Department

Accounting, engineering, etc
Environment

Name of state shared by multiple applications. Production environment should have a reserved name (e.g. "prod") that is used consistently across the org. If an application bridges multiple env (e.g. tests feeding off actual production data for realism) : if the app consumes from prod, the name of env farthest to production is used. If the app produces (from staging, planification env) prod data from another environment, the app should be labelled as prod.
Software System

The software system's name (not application) as it is known by the tech personnel. Systems are composed of multiple applications.
Geographical location

The hosts's geographical location. Can be a city name, an airport code, a postal code, a data center id, a cloud region...
Host name

Autoprovisioned on the cloud, host names mean less than they did, but they can still be useful.
Application

Actual name of the program being executed.
PID

Process ID of the originating application. Can be useful if multiple instances of the same app are running on the same host.
Application role

If an application can be run in multiple modes
Module / Subsystem

Large applications may be composed of multiple subsystems. In smaller apps, this can be used as the metric subject.