Skip to content

Instantly share code, notes, and snippets.

@binarytemple
Forked from angrycub/MonitoringRiak.md
Created October 17, 2017 10:22
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save binarytemple/3ad7ae5114acd055af31ca259f267cde to your computer and use it in GitHub Desktop.
Save binarytemple/3ad7ae5114acd055af31ca259f267cde to your computer and use it in GitHub Desktop.
Additional Information about Monitoring Riak

Additional Information about Monitoring Riak

Riak is a complex system that includes many moving parts to monitor, such as the health of the hardware, the well-being of the software, and the responsiveness of the network. This document will discuss metrics, thresholds, and values that indicate when your monitoring system should be sending alarms.

System metrics to monitor

Metric Threshold
CPU 75% * number of cores
Memory 70% - buffers
Disk Space 75%
Network 70% sustained
File Descriptors 75% of ulimit
Swap > 0 KB

Log Events to Alert On

String Log File Reason
eaccess console.log File/Directory Permissions Issue
emfile console.log Exhausted File Handles
erofs console.log File System Mounted in Read-Only Mode
noproc console.log Unexpectedly Missing Process
undef console.log Missing/Incorrect Erlang Modules
system_limit console.log Erlang VM Resource Exhaustion
Compaction error LevelDB LOGs LevelDB Compaction Error
waiting LevelDB LOGs LevelDB Stalls

Riak Nagios Project

Located at http://github.com/basho_labs/riak_nagios. Provided tests include:

  • check_connection_pools
  • check_file_handle_count
  • check_leveldb_compaction
  • check_node
  • check_node_up
  • check_port_count
  • check_riak_kv_up
  • check_riak_repl

Relevant Blog Posts on Monitoring

Riak-Admin Stat (Exometer) [Riak 2.0+ only]

Listing all of the statistics and their values

$ riak-admin stat show '*.**'

Note: The single ticks are necessary to prevent the shell from attempting to interpret the '*' as a globbing operator

Querying for a specific statistic

$ riak-admin stat show riak.riak_kv.vnode.gets  
 [riak,riak_kv,vnode,gets]: [{count,0},{one,0}]

Querying for child statistics

$ riak-admin stat show 'riak.riak_kv.vnode.gets.*'  
 [riak,riak_kv,vnode,gets,time]: [{n,0},{mean,0},{min,0},{max,0},{median,0},{50,0},{75,0},{90,0},{95,0},{99,0},{999,0}]

Note: The single ticks are necessary to prevent the shell from attempting to interpret the '*' as a globbing operator

Querying for a subtree

$ riak-admin stat show 'riak.riak_kv.vnode.gets.**'  
 [riak,riak_kv,vnode,gets]: [{count,0},{one,0}]
 [riak,riak_kv,vnode,gets,time]: [{n,0},{mean,0},{min,0},{max,0},{median,0},{50,0},{75,0},{90,0},{95,0},{99,0},{999,0}]

Note: The single ticks are necessary to prevent the shell from attempting to interpret the '*' as a globbing operator

Getting a list of all statistics and their type

$ riak-admin stat info -type '*.**'
 [riak,common,cpu_stats]: type = cpu
 [riak,common,mem_stats]: type = function
 [riak,common,memory_stats]: type = function
 [riak,riak_api,pbc_connects]: type = spiral
 [riak,riak_api,pbc_connects,active]: type = function
 [riak,riak_core,converge_delay]: type = duration
 [riak,riak_core,dropped_vnode_requests_total]: type = counter
 [riak,riak_core,gossip_received]: type = spiral
 [riak,riak_core,handoff_timeouts]: type = counter
 [riak,riak_core,ignored_gossip_total]: type = counter
 [riak,riak_core,rebalance_delay]: type = duration
 ...

This output could be used as input to a script that generates a CollectD configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment