Riemann is a quite simple but powerful monitoring system. In its core it is an event processing framework allowing a user to handle an incoming event stream. Events are small objects that originate from a host, that has a service and metric value associated with it. Events may also have tags and a free text description. This simple data structure enables an application transmit both exception tracebacks and application metric data using the same mechanism.
Riemann has a few problems though. First, it is a single-machine application, only allowing vertical scaling. Secondly, there's no redundancy or availability solution. In this post we'll try to address the first issue.
The obvious solution is to have multiple Riemann instances running and then partition the
event stream based on the service
field of the event. Given a set of Riemann instances,
S, a partitioning router R can take events and forward them to an instance in S according
to consistent hashing on the key (service field).
Since configuration is code in Riemann this "partitioning router" can be a running instance
of Rieeman that has a set of tcp-client
s that is spreads events over.