wzin/gist:bd32463788cc1a3b2eb5f351e23bc34a

## gistfile1.txt
# howto
- know your metric (events vs sampled measurements)
 - create grafana dashboard for that
 - define data ranges
 - understand the data and define tresholds for triggers
- make sure you have a way to monitor your numbers
- start sampling numbers with items
- apply derivative (delta) or integral type (sum over time) of function on the data to detect errors

# influx samples (sampled every 10 seconds):

telegraf..ros_respawns < measurements with events

2017-03-30T12:52:45.417421562Z	"lg-head-350"	"roscoe"	"Service: kmlsync (localhost:8765) hasn't become accessible within 15 seconds"	"lg_earth"	"testhq"	1
2017-03-30T12:52:45.526780797Z	"lg-head-350"	"roscoe"	"timeout exceeded while waiting for service /readiness_node/ready"	"lg_adhoc_browser"	"testhq"	1
2017-03-30T12:52:52.339374351Z	"lg-head-350"	"roscoe"	"timeout exceeded while waiting for service /readiness_node/ready"	"lg_adhoc_browser"	"testhq"	1
2017-03-30T12:52:53.139542817Z	"lg-head-350"	"roscoe"	"timeout exceeded while waiting for service /readiness_node/ready"	"lg_adhoc_browser"	"testhq"	1


SELECT count(*) from ros_respawns there time > now() - 300s => return 4

^ run it every 5 minutes

[   5m=0e   ][   5m=0e   ][   5m=1e   ][   5m=0e   ]

samples:
0,0,1,0

We want to detect nonzero values and raise an alarm about them.

- this will come and go every 5 minutes for intermittent errors since the item frequency is 5m
- this will persist for persistent problems (e.g. respawns of ROS node due to misconfirugation)
	# howto
	- know your metric (events vs sampled measurements)
	- create grafana dashboard for that
	- define data ranges
	- understand the data and define tresholds for triggers
	- make sure you have a way to monitor your numbers
	- start sampling numbers with items
	- apply derivative (delta) or integral type (sum over time) of function on the data to detect errors

	# influx samples (sampled every 10 seconds):

	telegraf..ros_respawns < measurements with events

	2017-03-30T12:52:45.417421562Z "lg-head-350" "roscoe" "Service: kmlsync (localhost:8765) hasn't become accessible within 15 seconds" "lg_earth" "testhq" 1
	2017-03-30T12:52:45.526780797Z "lg-head-350" "roscoe" "timeout exceeded while waiting for service /readiness_node/ready" "lg_adhoc_browser" "testhq" 1
	2017-03-30T12:52:52.339374351Z "lg-head-350" "roscoe" "timeout exceeded while waiting for service /readiness_node/ready" "lg_adhoc_browser" "testhq" 1
	2017-03-30T12:52:53.139542817Z "lg-head-350" "roscoe" "timeout exceeded while waiting for service /readiness_node/ready" "lg_adhoc_browser" "testhq" 1


	SELECT count(*) from ros_respawns there time > now() - 300s => return 4

	^ run it every 5 minutes

	[ 5m=0e ][ 5m=0e ][ 5m=1e ][ 5m=0e ]

	samples:
	0,0,1,0

	We want to detect nonzero values and raise an alarm about them.

	- this will come and go every 5 minutes for intermittent errors since the item frequency is 5m
	- this will persist for persistent problems (e.g. respawns of ROS node due to misconfirugation)