- https://prometheus.io/
- https://prometheus.io/docs/introduction/overview/
- https://github.com/prometheus
- https://github.com/prometheus/nagios_plugins a nagios plugin to query prometheus
- https://github.com/cloudflare/unsee alerts dashboard
- https://github.com/ncabatoff/prombench for generating a lot of scrape to benchmark volume
- https://github.com/qvl/promplot for generatic static plot graphs (for mail reports?)
- https://coreos.com/operators/prometheus/docs/latest/high-availability.html about high availability
- https://gitlab.com/gitlab-org/gitlab-monitor ruby code for a web exporter for application custom monitoring
- https://github.com/prometheus/alertmanager#high-availability if we need a deathstar
- https://github.com/UnderGreen/ansible-prometheus if we use ansible
- https://github.com/weaveworks/grafanalib grafanalib - python lib to create grafana templates
- https://prometheus.io/webtools/alerting/routing-tree-editor/ alertmanager routes visualization and testing
- https://github.com/weaveworks/cortex prometheus as a service (could make sense for customers?) uses generic read+write remtoe
- http://chromix.io - https://github.com/ChronixDB/chronix.ingester - read-only for analysis
- https://github.com/google/mtail transform logs into metrics
- https://gitlab.com/gitlab-com/runbooks/blob/master/howto/monitoring-overview.md prometheus architecture at gitlab
- https://gitlab.com/gitlab-com/runbooks/tree/master/alerts gitlab config for alerts
- https://gitlab.com/gitlab-com/runbooks/blob/master/.gitlab-ci.yml rules checker for prometheus in Gitlab CI
- https://www.robustperception.io/scaling-and-federating-prometheus/ about scaling up
- https://www.robustperception.io/federation-what-is-it-good-for/ about federation and aggregation
- https://prometheus.io/blog/2017/04/10/promehteus-20-sneak-peak/ the next version of prometheus
- https://fabxc.org/blog/2017-04-10-writing-a-tsdb/ details on what new storgae in v2 will be based on
- http://blog.alexellis.io/prometheus-monitoring/ general overview of a prometheus setup
- https://www.digitalocean.com/community/tutorials/how-to-query-prometheus-on-ubuntu-14-04-part-1 PromQL missing doc
- https://www.youtube.com/watch?v=MuHkckZg5L0&list=PLqm7NmbgjUExeDZU8xb2nxz-ysnjuC2Mz&index=7 remote storage
- https://www.influxdata.com/prometheus-influxdb-thoughts/
- https://blog.acolyer.org/2017/03/10/chronix-long-term-storage-and-retrieval-technology-for-anomaly-detection-in-operational-data/ for long-term retention
- https://www.youtube.com/playlist?list=PLoz-W_CUquUlCq-Q0hy53TolAhaED9vmU promcon august 2016
- https://www.youtube.com/playlist?list=PLqm7NmbgjUExeDZU8xb2nxz-ysnjuC2Mz cloudnativecon prometheus track april 2017
- https://www.youtube.com/watch?v=likpVWB5Lvo&index=4&list=PLoz-W_CUquUlCq-Q0hy53TolAhaED9vmU digitalocean scaling prometheus on 2 millions of servers
- one single prometheus server is ok until 1k or 2k nodes -> need sharding + prometheus proxy for grafana
- more than 8G ram tuning is mandatory
- alertmanager to replace nagios. but it has high availablity lack.
- sharding issue: add a shard, don't import old data. they thought using a kafka storage, now uses casandra store
- give data to customers on per-vm basis
- vulcan: fork of prometheus (on github) in advance on prometheus. designed to store on cassandra
- downsampling done before cassandra, because need for 8 month data retention (10 cassandra nodes cluster for 40 metrics on a million machines)
- https://www.youtube.com/watch?v=Cvbc60T1uUY&index=14&list=PLoz-W_CUquUlCq-Q0hy53TolAhaED9vmU debian support for prometheus
- prometheus, alertmanager and pushgateway are packaged. node-exporter, mysql exporter also in package. in unstable, and pass to testing in 5 days. testing packages work fine on stable.
- https://www.youtube.com/watch?v=XvqaYbiTOMg&list=PLoz-W_CUquUlCq-Q0hy53TolAhaED9vmU&index=15 highly available alertmanager
- work in progress (as of august 2016)
- https://www.youtube.com/watch?v=r6N5-1Jyifk&index=16&list=PLoz-W_CUquUlCq-Q0hy53TolAhaED9vmU vulcan
- the digitalocean api-compatible alternative
- long-term storage
- requires kafka, zookeeper, cassandra, elasticsearch, it's a lot
- https://www.youtube.com/watch?v=KoU_DquChS8&index=21&list=PLoz-W_CUquUlCq-Q0hy53TolAhaED9vmU grafana master class
- mostly basic information about grafana. good primer.
- use annotations from search or state change
- on templates, if using multiselect , need to use the =~ matcher
- panel repeater using template variables
- simple json datasource for pushing arbitrary events, may be of some use for mep process
- https://www.youtube.com/watch?v=yrK6z3fpu1E&index=22&list=PLoz-W_CUquUlCq-Q0hy53TolAhaED9vmU alerting in prometheus
- 4 golden: latency, traffic, errors, cause-based warnings (capacity/saturation
- avoid static tresholds: make treshold relative to context (ex. errors relative to traffic)
- predict_linear for predictive alerts
- alerts grouping (includes summary) and alert inhibition (alerts dependencies) for less alerts traffic
- anomaly detection with holt_winters
- https://www.youtube.com/watch?v=b5-SvvZ7AwI&list=PLoz-W_CUquUlCq-Q0hy53TolAhaED9vmU&index=5 labels
- regexp on labels, relabelling
- metric relabel are another type, after scrape, just how data is stored
- https://www.youtube.com/watch?v=KXq5ibSj2qA write an exporter
- when app doesn't have a /metrics in prometheus format
- ipmi, snmp and blackbox
- metrics naming:
- need to include the unit. use seconds instead of milliseconds
- need to have explicit prefix
- need suffix with type of data (_sum or _count)
- _ratio 0 to 1
- _total is a counter, gauge should not have suffix
- process_ and scrape_ prefix are reserved
- user label partitionning when there are various values (like disk free) except latency
- min, max, stddev are useless
- return a 500, scrape_up will be 0
- https://www.youtube.com/watch?v=gNmWzkGViAY talk about borgmon from google qhich is totally same as prometheus
- https://www.youtube.com/watch?v=NFPGtbQfL1A what gitlab does with prometheus (up to customer facing usage)
- https://www.youtube.com/watch?v=MuHkckZg5L0 about the remote write/read possibilities
- this is experimental as the video was shot
- there is sample bridge for influx, graphite, opentsdb
- but those backends suck a bit:
- influxdb open source version don't do clustering. But maybe we don't need clustering for db? remoting make dual-prometheus already that's something
- opentsdb is backed by hadoop, hbase, java heavy machinery, hard to maintain
- graphite brings yet another backend like cassandra or cyanite
- https://www.youtube.com/watch?v=XQdEVY2l2e0 about high availablity of the alert manager
- UUID is generated
- gossip protocol to share data about aggregation, alerting etc..
- https://youtu.be/67Ulrq6DxwA?list=PLqm7NmbgjUExeDZU8xb2nxz-ysnjuC2Mz counting with prometheus
- why aliases make graph reload is not the same graph
- when counter resets to 0, new value is added to old value because counters never decrease so if it decreases it's interpreted as a reset
- irate stress the average on the last 2 values on the timexpan given: more accurate and spiky
- increase() is just rate() multiplied by the number of seconds of the timestamp
- histograms are expensive, as quantiles are
- in rate() the range (timestamp) should at least be 2 or 3 scrape steps
- https://youtu.be/XQdEVY2l2e0?list=PLqm7NmbgjUExeDZU8xb2nxz-ysnjuC2Mz Alertmanager on Its Way to High Availability
- using gossip for having alertmanagers communicate silences and notifications between other instances
- using mesh from weaveworks
- ensures at least one notif is sent out, sometimes gossip is hindered by network and then there is duplication
- no master but a different treatment delay on each alertmanager is used to let gossip communicate its shit
- memory usage of alertmanager is low because retention of alerts is low
- when new alertmanager is popped up, it gets updated by gossip, plus gossip do consistency communication checks from time to time
- https://youtu.be/jpb6fLQOgn4?list=PLqm7NmbgjUExeDZU8xb2nxz-ysnjuC2Mz Understanding and Extending Prometheus AlertManager
- good general overview of Alertmanager
- alertmanager has its own api for third party notifications
- there is a visual editor and tester for alertmanager routes https://prometheus.io/webtools/alerting/routing-tree-editor/
- shows how to hack alertmanager code to add features. UI is in angular but planned to move to elm and/or react
- https://youtu.be/bfSMDERvkZY?list=PLqm7NmbgjUExeDZU8xb2nxz-ysnjuC2Mz Grafana is Not Enough: DIY User Interfaces for Prometheus
- grafana templates as code (in python) = grafanalib https://github.com/weaveworks/grafanalib
- interesting example for postmortems, custom dashboards with editable queries
- https://youtu.be/MuHkckZg5L0?list=PLqm7NmbgjUExeDZU8xb2nxz-ysnjuC2Mz Integrating Long-Term Storage with Prometheus
- remote storage is an old story (issue #10)
- local storgae can work in combination optoinaly
- http + protobuf protocol -> bridge -> remote storgae
- influxdb, opentsdb and graphite are just example bridges
remote_write
conf directive just takes a destination url, orremote_read
for read- read is centralized promql evaluation
- future: federation of shards for reads
- cortex is the flagship of remote storage
- still experimental - remote will be only in next release 1.7
- https://youtu.be/lo0Y1QyGh0k?list=PLqm7NmbgjUExeDZU8xb2nxz-ysnjuC2Mz Prometheus: The Unsung Heroes
- usage of mtail to scrape logs from brocade syslog (smaller footprint than grok_exporter)
- usage of snmp_exporter, featuring a generator
- https://youtu.be/hPC60ldCGm8?list=PLqm7NmbgjUExeDZU8xb2nxz-ysnjuC2Mz Configuring Prometheus for High Performance
- at start prometheus increase memory usage, but at some point it plateau. finding where the plateau will be is hard to predict
- storage.local.memory-chunks : chunks are current unpersisted chunks plus persisted and cached ones (total memory / 6)
- max-chunks-to-persist (memory-chunks / 2)
- see https://www.robustperception.io/how-much-ram-does-my-prometheus-need-for-ingestion/
- prometheus 2.0 is totally rewriten and will have better memory management (to be release in 2017)
- https://youtu.be/U_nQObcAxSk?list=PLqm7NmbgjUExeDZU8xb2nxz-ysnjuC2Mz Monitor My Socks: Using Prometheus in a Polyglot Open Source Microservices
- about integtrating metrics endpoints in microservices framework