Logs vs Metrics and implementations
In working out my thoughts, this is borrowing from several sources, notably:
Monitoring means knowing what’s going on inside your system, how much traffic it’s getting, how it’s performing, how many errors there are. This is not the end goal though, merely a means. Our goal is to be able to detect, debug and resolve any problems that occur, and monitoring is an integral part of that process.
There is a division in approaches to collecting the monitoring data. These are logging as exemplified by Elasticsearch as part of the ELK stack (Elasticsearch, Logstash and Kibana), and metrics as exemplified by the TICK Stack (Telegraf, InfluxDB, Chronograf / Grafana, Kapacitor).
Logs messages are notifications about events as they pertain to a specific transaction. Metrics are notifications that an event occurred, without any ties to a transaction.
Ok so what’s the difference? Well again putting on my Operations hat, metrics can be incredibly smaller because they convey considerably less information. They’re also extremely easier to evaluate. Both of these points have impact around how we store, process and retain metrics.
A log file however, gives you details on a transaction which may allow you to tell a more complete story for a given event. The transactional nature of the log message in aggregate, gives you much more flexibility in terms of surfacing information (not just data) about the business.
Logging has other business purposes beyond monitoring, which are not relevant to my analysis here.
Both logs and metrics need to be collected, and there's a variety of ways to collect them.
ELK Stack (or ELKK, EFKK)
Summary: ELK is a popular open sourced application stack for visualizing and analyzing logs.
- Elasticsearch: Distributed Real-time search and analytics engine.
- Logstash: Collect and parse all data sources into an easy-to-read JSON format (Fluent is a modern replacement)
- Kibana: Elasticsearch data visualization engine
- Kafka: Data transport, queue, buffer, and short term storage
TICK Stack (or TIGK)
Summary: Solution for collecting, storing, visualizing and alerting on time-series data at scale. All components of the platform are designed to work together seamlessly.
- Telegraf: Collects time-series data from a variety of sources
- InfluxDB: Eventually consistent Time-series database
- Chronograf: Visualizes and graphs, replaced with Grafana sometimes
- Kapacitor: Alerting, ETL and detects anomalies in time-series data
Metrics old school stack
Summary: Well understood, established ecosystem.
- Metrics Gatherer - (statsd, collectd, dropwizard metrics)
- Listener (Carbon)
- Storage Database (Whisper or InfluxDB)
- Visualizer (Grafana, Graphite-Web)
Summary: Metrics pull based model
- PushGateway: for ephemeral or batch jobs
- uh...? I'm not well versed