Running a large munin master is all about herding bottlenecks. Memory, CPU and storage IO are candidates for "bottleneck of the day"
Munin used to be cron driven.
Then we got CGI.
Which then became "FastCGI". (quotation marks intended). FastCGI is great, but it was not really fast, nor particularly stable, until munin 2.x. (The FastCGI container would restart FastCGI scripts which died, leaving us with a missing image now and then)
Connects to all the nodes, in parallel, and reads plugin config and values.
Writes to RRD files as fast as it can. A great candidate for IO bottlenecks.
When in cron mode, writes all static pages to a directory. This is really a disk killer. This is a single threaded process, which made it take a lot of time as well.
When in cgi mode, writes a storable file for munin-cgi-html.
When in cron mode, writes all the daily graph images to disk, every 5 minutes. Weekly, monthly and yearly graphs are graphed less often.
Role: generate web pages, on demand.
munin-cgi-html does not read the configuration in /etc/munin, at all. It reads the data structures left by "munin-html", which runs every 5 minutes. That means:
If you change something, you need to run "munin-html". This will write a new storable, and munin-cgi-html will pick this up.
The next page load, after munin-html has run, will be slower, since the data structure in the storable will be loaded into the fastcgi process.
On a master with a few hundred nodes, this will be noticeably slower. (As in: "dammit "-slower). Not really helping. :)
Role: serve images on demand.
Checks if the cache has a recent enough image.
If not: Generate graphing command for rrd, which should hopefully result in some sort of image, describing exactly what our original problem is, so we can solve it.
(Usually, this will be: /var somewhere is full, and I could not notify you, since /var was full, so I could not send you a mail. Did you know that /var was full? I did...)
This is the place where you'd look up "munin-limits" and integration with nagios. or you'd look at trends and predictions with munin, which helps you if you look at your graphs often, and not when are figuring out what's wrong now.
CGI and FastCGI is a tradeoff. Munin would not spend time graphing everything, and we'd wait a bit more for each graph to be generated.
We lose simplicity (serve static files from this directory).
We lose web page serving speed (images and pages are generated).
We gain capacity. (store raw data only. Do not read all RRD files, and do not write all images and pages every 5 minutes)
This enables us to scale to a larger amount of nodes per master.
Scaling with rrdcached
RRD cache daemon helps with the following:
Writes: All the writes from munin-update are spooled, and RRD files are written in the background.
Reads: For any RRD files read, the RRD cache daemon will write any spooled data for them, and then say "OK, you can read that file".
The effect is dramatic: The image at http://goo.gl/RWQwW describes what happened when I added rrdcached to a busy munin master.
Note that this is a logarithmic graph.
What is needed?
An instance of rrdcached with the ability to read and write to the munin directory.
A line of configuration for munin, telling it to use rrdcached.
(I suggest you get the rrdcached going before you configure munin)
Supervise this process
If rrdcached stops, munin will stop. Run this process supervised. That means:
If you've got Ubuntu, write an upstart config for it. If you've got Debian, you've lots of choices, including upstart, systemd, monit, runit.
If you haven't got anything, make a cron job. :)
Paths for the rrdcached :
-B -b /var/lib/munin/ -j /var/lib/munin/rrdcached-journal/
-m 0660 -l unix:/run/munin/rrdcached.sock
Note: You'll need to add read and write permissions for the FastCGI / CGI started by the web server as well.
-F # always flush data on shutdown -w 1800 # Wait 30 minutes before writing data -z 1800 # Delay writes by a random factor of up to 30 minutes # (this should be equal to, or lower than, “-w”) -f 3600 # Flush all data every hour
Using "rrdcached" to scale munin is documented at https://munin.readthedocs.org/en/latest/master/rrdcached.html