Skip to content

Instantly share code, notes, and snippets.

@ssm ssm/talk.md
Last active Oct 8, 2018

Embed
What would you like to do?
lightning talk about rrdcached and munin

Scaling munin

Running a large munin master is all about herding bottlenecks. Memory, CPU and storage IO are candidates for "bottleneck of the day"

Munin used to be cron driven.

Then we got CGI.

Which then became "FastCGI". (quotation marks intended). FastCGI is great, but it was not really fast, nor particularly stable, until munin 2.x. (The FastCGI container would restart FastCGI scripts which died, leaving us with a missing image now and then)

munin-update

Connects to all the nodes, in parallel, and reads plugin config and values.

Writes to RRD files as fast as it can. A great candidate for IO bottlenecks.

munin-html

When in cron mode, writes all static pages to a directory. This is really a disk killer. This is a single threaded process, which made it take a lot of time as well.

When in cgi mode, writes a storable file for munin-cgi-html.

munin-graph

When in cron mode, writes all the daily graph images to disk, every 5 minutes. Weekly, monthly and yearly graphs are graphed less often.

munin-cgi-html

Role: generate web pages, on demand.

munin-cgi-html does not read the configuration in /etc/munin, at all. It reads the data structures left by "munin-html", which runs every 5 minutes. That means:

If you change something, you need to run "munin-html". This will write a new storable, and munin-cgi-html will pick this up.

The next page load, after munin-html has run, will be slower, since the data structure in the storable will be loaded into the fastcgi process.

On a master with a few hundred nodes, this will be noticeably slower. (As in: "dammit "-slower). Not really helping. :)

munin-cgi-graph

Role: serve images on demand.

Checks if the cache has a recent enough image.

If not: Generate graphing command for rrd, which should hopefully result in some sort of image, describing exactly what our original problem is, so we can solve it.

(Usually, this will be: /var somewhere is full, and I could not notify you, since /var was full, so I could not send you a mail. Did you know that /var was full? I did...)

This is the place where you'd look up "munin-limits" and integration with nagios. or you'd look at trends and predictions with munin, which helps you if you look at your graphs often, and not when are figuring out what's wrong now.

Tradeoffs

CGI and FastCGI is a tradeoff. Munin would not spend time graphing everything, and we'd wait a bit more for each graph to be generated.

We lose simplicity (serve static files from this directory).

We lose web page serving speed (images and pages are generated).

We gain capacity. (store raw data only. Do not read all RRD files, and do not write all images and pages every 5 minutes)

This enables us to scale to a larger amount of nodes per master.

Scaling with rrdcached

RRD cache daemon helps with the following:

Writes: All the writes from munin-update are spooled, and RRD files are written in the background.

Reads: For any RRD files read, the RRD cache daemon will write any spooled data for them, and then say "OK, you can read that file".

The effect is dramatic: The image at http://goo.gl/RWQwW describes what happened when I added rrdcached to a busy munin master.

Note that this is a logarithmic graph.

What is needed?

An instance of rrdcached with the ability to read and write to the munin directory.

A line of configuration for munin, telling it to use rrdcached.

(I suggest you get the rrdcached going before you configure munin)

Supervise this process

If rrdcached stops, munin will stop. Run this process supervised. That means:

If you've got Ubuntu, write an upstart config for it. If you've got Debian, you've lots of choices, including upstart, systemd, monit, runit.

If you haven't got anything, make a cron job. :)

Important flags

Paths for the rrdcached :

-B -b /var/lib/munin/
-j /var/lib/munin/rrdcached-journal/

Communications socket:

-m 0660 -l unix:/run/munin/rrdcached.sock

Note: You'll need to add read and write permissions for the FastCGI / CGI started by the web server as well.

Performance settings:

-F      # always flush data on shutdown

-w 1800 # Wait 30 minutes before writing data

-z 1800 # Delay writes by a random factor of up to 30 minutes
        # (this should be equal to, or lower than, “-w”)

-f 3600 # Flush all data every hour

Documentation

Using "rrdcached" to scale munin is documented at https://munin.readthedocs.org/en/latest/master/rrdcached.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.