Skip to content

Instantly share code, notes, and snippets.

@jessereynolds
Created June 6, 2012 00:05
Show Gist options
  • Save jessereynolds/2878994 to your computer and use it in GitHub Desktop.
Save jessereynolds/2878994 to your computer and use it in GitHub Desktop.
collectd exec plugin fork freeze
#!/bin/bash
HOSTNAME="${COLLECTD_HOSTNAME:-localhost}"
INTERVAL="${COLLECTD_INTERVAL:-10}"
while sleep "$INTERVAL"; do
VALUE=1.23
echo "PUTVAL \"$HOSTNAME/exec-magic/gauge-magic_level\" interval=$INTERVAL N:$VALUE"
done
root@jesse-precise-desktop:~# bin/test_collectd_exec.sh
Wed Jun 6 09:39:24 CST 2012: starting collectd
Starting statistics collection and monitoring daemon: collectd,,.
Wed Jun 6 09:39:24 CST 2012: exec started?
2461 pts/2 S+ 0:00 \_ /bin/bash bin/test_collectd_exec.sh
2470 pts/2 S+ 0:00 \_ grep collect
2466 ? Ss 0:00 /usr/sbin/collectdmon -P /var/run/collectdmon.pid -- -C /etc/collectd/collectd.conf
2468 ? SLl 0:00 \_ collectd -C /etc/collectd/collectd.conf -f
Wed Jun 6 09:39:26 CST 2012: stopping collectd
Stopping statistics collection and monitoring daemon: collectd.
collectd is stopped.
collectd is stopped.
collectd is stopped.
collectd is stopped.
Wed Jun 6 09:39:30 CST 2012: killing any remaining processes
collectd: no process found
Wed Jun 6 09:39:31 CST 2012: starting collectd
Starting statistics collection and monitoring daemon: collectd,,.
Wed Jun 6 09:39:31 CST 2012: exec started?
2461 pts/2 S+ 0:00 \_ /bin/bash bin/test_collectd_exec.sh
2520 pts/2 S+ 0:00 \_ grep collect
2516 ? Ss 0:00 /usr/sbin/collectdmon -P /var/run/collectdmon.pid -- -C /etc/collectd/collectd.conf
2518 ? RLl 0:00 \_ collectd -C /etc/collectd/collectd.conf -f
Wed Jun 6 09:39:33 CST 2012: stopping collectd
Stopping statistics collection and monitoring daemon: collectd.
collectd (2531) is running.
collectd (2531) is running.
^C
root@jesse-precise-desktop:~# ps afx | grep collectd
2531 ? S 0:00 collectd -C /etc/collectd/collectd.conf -f
root@jesse-precise-desktop:~# strace -p 2531
Process 2531 attached - interrupt to quit
futex(0x7fa983a9edb0, FUTEX_WAIT_PRIVATE, 2, NULL^C <unfinished ...>
Process 2531 detached
#!/bin/bash
# this starts and stops collectd and looks to see if any collectd processes remain
while true ; do
echo "`date`: starting collectd"
/etc/init.d/collectd start
echo "`date`: exec started?"
ps afx | grep collect
sleep 2
echo "`date`: stopping collectd"
/etc/init.d/collectd stop
/etc/init.d/collectd status
sleep 1
/etc/init.d/collectd status
sleep 1
/etc/init.d/collectd status
sleep 1
/etc/init.d/collectd status
echo "`date`: killing any remaining processes"
killall -9 collectd
sleep 1
done
@jessereynolds
Copy link
Author

It seems to be easier to reproduce after a fresh reboot of the VM.

The problem was first observed on Ubuntu Precise 12.04 LTS (Server) 64 bit running on VMWare ESX:

Linux bp-dvmh-collectd-01 3.2.0-24-generic #37-Ubuntu SMP Wed Apr 25 08:43:22 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

The environment used to reproduce this error (above) is Ubuntu Precise 12.04 LTS (Desktop) 64 bit running on VirtualBox 4 on Mac OS X 10.7.4:

Linux jesse-precise-desktop 3.2.0-24-generic #39-Ubuntu SMP Mon May 21 16:52:17 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

@jessereynolds
Copy link
Author

collectd configuration (contents of /etc/collectd) from the jesse-precise-desktop vm can be downloaded from here: http://jessereynolds.com/etc_collectd_precise_desktop-20120606-0940.tgz

@jessereynolds
Copy link
Author

I've managed to reproduce this on a lucid vm (ubuntu 10.04.3, 32 bit) with collectd 4.8.2 but only the once, out of about fifty attempts:

Wed Jun 6 22:24:14 CST 2012: starting collectd
Starting statistics collection and monitoring daemon: collectd.
Wed Jun 6 22:24:15 CST 2012: exec started?
1518 pts/0 S+ 0:00 _ /bin/bash ./test_collectd_exec.sh
2065 pts/0 S+ 0:00 _ grep collect
2052 ? Ss 0:00 /usr/sbin/collectdmon -P /var/run/collectdmon.pid -- -C /etc/collectd/collectd.conf
2054 ? Sl 0:00 _ collectd -C /etc/collectd/collectd.conf -f
2062 ? S 0:00 _ collectd -C /etc/collectd/collectd.conf -f
Wed Jun 6 22:24:17 CST 2012: stopping collectd
Stopping statistics collection and monitoring daemon: collectd.
collectd (2062) is running.
collectd (2062) is running.
collectd (2062) is running.
collectd (2062) is running.
Wed Jun 6 22:24:21 CST 2012: killing any remaining processes

Linux lucid32 2.6.32-33-generic #70-Ubuntu SMP Thu Jul 7 21:09:46 UTC 2011 i686 GNU/Linux

@edmondac
Copy link

edmondac commented Jun 6, 2012

On a slackware-ish linux running collectd 4.10.7 I get this error about one in ten times (roughly) - but it seems to go away when I comment out the ping plugin...

[/var/log/collectd.log] 2012-06-06 16:06:24 UTC exec plugin: exec_read_one: Waiting for `/usr/bin/diskmonitor' to exit.
[/var/log/collectd.log] 2012-06-06 16:06:24 UTC exec plugin: Child 7067 exited with status 15.
[/var/log/collectd.log] 2012-06-06 16:06:24 UTC exec plugin: Sent SIGTERM to 0
[/var/log/syslog] [2012-06-06 16:06:24 UTC] [INFO] daemon 127.0.0.1 collectd[7055]: exec plugin: Sent SIGTERM to 0

WIth collectd 5 this happens much more often, and again commenting out the ping plugin makes the problem go away.

@jessereynolds
Copy link
Author

I've raised a bug report here: collectd/collectd#89

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment