yvanzo/meb-sysadmin-dec2021.md Secret

## meb-sysadmin-dec2021.md

      
    Raw
  

              meb-sysadmin-dec2021.md
            
          
    About MetaBrainz system administration

System administration has generally always been assumed by a small
subset of the MetaBrainz (MeB) team. Since zas joined in as System
Administrator, there is hopefully a person dedicated to this topic,
but this is a transversal topic not limited to the infrastructure.
With the growing number of projects, there is a need for better
sharing the workload among MeB team members. The current monitoring
systems are somewhat a blackbox to most of the team, partly because
the monitoring software were not designed to be operated by a team.
Other upcoming challenges are deployment flexibility and security.
Monitoring systems

General architecture of monitoring systems

A monitoring system will usually:

Collect raw data such as CPU load, disk usage, message queue size, HTTP return codes, and so on,
(a.k.a. exporting metrics: gauges and counters)
Process (and store) data to make it meaningful
Draw relevant graphs for visual analysis
Throw relevant alerts from automated analysis

Software will usually cover one or more of these tasks.  The data
processing task (#2) is partially covered by most software around,
probably because it is by far the most difficult point to address.
Current monitoring systems at MeB

There currently are 3 ways to collect data, process data, and throw alerts in parallel:

Nagios (historic): low-level metrics about hosts and gateways
Telegraf + InfluxDB (main): most of the above, plus applications
Consul + Prometheus (experimental): hosts only for now

For the two latter ways, graphs are built and displayed (3) through
Grafana which is also processing some data (2) and throwing some
alerts (4):

stats.mb.o for Telegraf+InfluxDB data
promgraf.mb.o for Prometheus data (experimental)

Alerts are convoyed through Telegram channels:

one channel for Nagios,
one channel for InfluxDB/Grafana.

It clearly is a transitional situation. Nagios is still used because
it just works: It is very reliable and almost never needed manual
intervention. InfluxDB is not as reliable as Nagios but more versatile
even though it is far from being perfect. Prometheus is promising a
lot but still at early experimentation stage in MeB infrastructure.
It has the positive side effect that these systems are (partially)
redundant: If InfluxDB is failing, then Nagios would still work.
Current issues with monitoring at MeB

Data processing happens in many different places (Nagios, Telegraf,
InfluxDB, Prometheus, Grafana) which makes it difficult to maintain:

Each software has its own logic and language to process data;
Each software has its own storage backend for configuration.

Grafana’s configuration is stored in JSON but it supports history

per dashboard only - not globally, therefore it would not be
reasonable to share adminisration access with the whole MeB team.
Current roles don’t allow for fine-grain permissions either.

All alerts are convoyed through Telegram depending on the monitoring
system, whereas it should be convoyed through topical channels
(General infrastructure, MB, LB, BB, AB, and so on).
Configuration has to be updated for each host, it is semi-automated
using fabric for telegraf but still requires a lot of manual
configuration when deploying a new server.
More generally, for alerts to be reliable, it requires the server
applications to reliably report issues. This is most often unnoticed
in development setup as it doesn’t impair the application itself.
However it may affect the deployment environment whose expectations
are not always known by the developers, or simply throw falsy alerts.
A recent example is MB server returning inappropriate 5xx HTTP codes.
Pertinence of Prometheus

Prometheus development is supported by the same large and sane
community that originated Kubernetes. Many major software we are
already using (HAProxy, Docker) are starting to support Prometheus.
It is part of OpenMetrics, the de-facto standard for metrics.
Its data processing language PromQL is way better designed than
languages of other monitoring software in place, according to zas who
dealt with (and is still dealing with) all of these.
Its configuration is stored in files, therefore it can be maintained
collaboratively through a git repository like the MeB team is used to
(with per rule history, global history, pull request workflow).
It can also be deployed through consul for convenience.
Alerts could be send to different channels (topical channels on
Telegram, email…) by combining tags and Prometheus Alertmanager.
Previous systems are configured per host, and it still is how
Prometheus is set up by deploying Node Export on each host.
But implementing a Prometheus /metrics endpoint in MeB applications
would also allow to configure it per application, which would help
with deploying any application on any server at any time.
Finally, it supports throwing predictive alerts from curves analysis
rather than waiting (often too late) for thresholds to be reached.
References:

https://prometheus.io/
https://openmetrics.io/
https://www.haproxy.com/blog/haproxy-exposes-a-prometheus-metrics-endpoint/
https://docs.docker.com/config/daemon/prometheus/

Deployment flexibility

Servers used to be physical hosts, running Unix services.
This is gradually replaced with virtual hosts, running Docker.
Ultimately, it should be possible to quickly move applications to
other hosts depending on hosts availability and applications needs.
This is the general direction that should be followed when making
server applications/Docker services easier to deploy/operate/monitor.
References:

https://github.com/metabrainz/serviceregistrator
http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_next_upstream
https://docs.docker.com/config/containers/live-restore/

Availability

Longer-term consideration (Summit 22?): Have a general status page. There
are many prerequisites to it, including having a proper monitoring system.
Security

Longer-term consideration (Summit 22?): Use Vault to manage secrets
which are currently spread among various inappropriate places.
References:

https://www.vaultproject.io/

Suggested plan of action

Immediately actionable tasks


List procedures and requirements for deploying/operating/monitoring services in general
Document these procedures (e.g. stopping a service) for each service/project

Mid-term tasks


Document network architecture and deployment platform
Implement all of these requirements (e.g. image tags) for the main services/projects
Normalize services which currently are out of sight (e.g. CAA helper on purple)

Longer-term tasks


Migrating from InfluxDB to Prometheus
Implement Prometheus /metrics endpoint for each server application:

For system administration at first;
For the application itself (e.g. usage statistics) eventually.