Skip to content

Instantly share code, notes, and snippets.

@yvanzo
Created January 17, 2022 17:59
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save yvanzo/b2edd08d12038125377342f50043a684 to your computer and use it in GitHub Desktop.
Save yvanzo/b2edd08d12038125377342f50043a684 to your computer and use it in GitHub Desktop.
About system administration at MetaBrainz (December 2021)

About MetaBrainz system administration

System administration has generally always been assumed by a small subset of the MetaBrainz (MeB) team. Since zas joined in as System Administrator, there is hopefully a person dedicated to this topic, but this is a transversal topic not limited to the infrastructure. With the growing number of projects, there is a need for better sharing the workload among MeB team members. The current monitoring systems are somewhat a blackbox to most of the team, partly because the monitoring software were not designed to be operated by a team. Other upcoming challenges are deployment flexibility and security.

Monitoring systems

General architecture of monitoring systems

A monitoring system will usually:

  1. Collect raw data such as CPU load, disk usage, message queue size, HTTP return codes, and so on, (a.k.a. exporting metrics: gauges and counters)
  2. Process (and store) data to make it meaningful
  3. Draw relevant graphs for visual analysis
  4. Throw relevant alerts from automated analysis

Software will usually cover one or more of these tasks. The data processing task (#2) is partially covered by most software around, probably because it is by far the most difficult point to address.

Current monitoring systems at MeB

There currently are 3 ways to collect data, process data, and throw alerts in parallel:

  • Nagios (historic): low-level metrics about hosts and gateways
  • Telegraf + InfluxDB (main): most of the above, plus applications
  • Consul + Prometheus (experimental): hosts only for now

For the two latter ways, graphs are built and displayed (3) through Grafana which is also processing some data (2) and throwing some alerts (4):

  • stats.mb.o for Telegraf+InfluxDB data
  • promgraf.mb.o for Prometheus data (experimental)

Alerts are convoyed through Telegram channels:

  • one channel for Nagios,
  • one channel for InfluxDB/Grafana.

It clearly is a transitional situation. Nagios is still used because it just works: It is very reliable and almost never needed manual intervention. InfluxDB is not as reliable as Nagios but more versatile even though it is far from being perfect. Prometheus is promising a lot but still at early experimentation stage in MeB infrastructure. It has the positive side effect that these systems are (partially) redundant: If InfluxDB is failing, then Nagios would still work.

Current issues with monitoring at MeB

Data processing happens in many different places (Nagios, Telegraf, InfluxDB, Prometheus, Grafana) which makes it difficult to maintain:

  • Each software has its own logic and language to process data;
  • Each software has its own storage backend for configuration.

Grafana’s configuration is stored in JSON but it supports history

  • per dashboard only - not globally, therefore it would not be reasonable to share adminisration access with the whole MeB team. Current roles don’t allow for fine-grain permissions either.

All alerts are convoyed through Telegram depending on the monitoring system, whereas it should be convoyed through topical channels (General infrastructure, MB, LB, BB, AB, and so on).

Configuration has to be updated for each host, it is semi-automated using fabric for telegraf but still requires a lot of manual configuration when deploying a new server.

More generally, for alerts to be reliable, it requires the server applications to reliably report issues. This is most often unnoticed in development setup as it doesn’t impair the application itself. However it may affect the deployment environment whose expectations are not always known by the developers, or simply throw falsy alerts. A recent example is MB server returning inappropriate 5xx HTTP codes.

Pertinence of Prometheus

Prometheus development is supported by the same large and sane community that originated Kubernetes. Many major software we are already using (HAProxy, Docker) are starting to support Prometheus. It is part of OpenMetrics, the de-facto standard for metrics.

Its data processing language PromQL is way better designed than languages of other monitoring software in place, according to zas who dealt with (and is still dealing with) all of these.

Its configuration is stored in files, therefore it can be maintained collaboratively through a git repository like the MeB team is used to (with per rule history, global history, pull request workflow). It can also be deployed through consul for convenience.

Alerts could be send to different channels (topical channels on Telegram, email…) by combining tags and Prometheus Alertmanager.

Previous systems are configured per host, and it still is how Prometheus is set up by deploying Node Export on each host. But implementing a Prometheus /metrics endpoint in MeB applications would also allow to configure it per application, which would help with deploying any application on any server at any time.

Finally, it supports throwing predictive alerts from curves analysis rather than waiting (often too late) for thresholds to be reached.

References:

Deployment flexibility

Servers used to be physical hosts, running Unix services. This is gradually replaced with virtual hosts, running Docker. Ultimately, it should be possible to quickly move applications to other hosts depending on hosts availability and applications needs. This is the general direction that should be followed when making server applications/Docker services easier to deploy/operate/monitor.

References:

Availability

Longer-term consideration (Summit 22?): Have a general status page. There are many prerequisites to it, including having a proper monitoring system.

Security

Longer-term consideration (Summit 22?): Use Vault to manage secrets which are currently spread among various inappropriate places.

References:

Suggested plan of action

Immediately actionable tasks

  • List procedures and requirements for deploying/operating/monitoring services in general
  • Document these procedures (e.g. stopping a service) for each service/project

Mid-term tasks

  • Document network architecture and deployment platform
  • Implement all of these requirements (e.g. image tags) for the main services/projects
  • Normalize services which currently are out of sight (e.g. CAA helper on purple)

Longer-term tasks

  • Migrating from InfluxDB to Prometheus
  • Implement Prometheus /metrics endpoint for each server application:
    • For system administration at first;
    • For the application itself (e.g. usage statistics) eventually.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment