warmfusion/README.md

## README.md

      
    Raw
  

              README.md
            
          
    High availability monitoring with Sensu

sensu | May 8, 2013
Redundancy. Availability. Scalability. Should sound familiar to you if you work in the web. Every system I build has to fit those 3 main criterias. I would also throw in manageability. If I can’t use Chef with it, I’m probably trying to use the wrong tool for the job.
There’s always been one exception though: my monitoring tool. Nagios. Zabbix. Zenoss. Shinken. Used them all. Each of them have shortcomings when it comes to the four criterias listed above. Different ones for each.
So, that said, a few months back, I was searching for something fresh. Something I could easily manage with Chef. Because I tend to forget things, and I wanted to automate as much as possible our monitoring solution. Don’t get me wrong, there’s nothing as good as developers to monitor your stuff, but I tend to like to know there’s something wrong before they show up at my desk. Even if it’s only 1-2 minutes ;)
And there it was: Sensu. Simple. Lightweight. Elegant. And to top it off, it’s written in Ruby and uses RabbitMQ. I was intrigued. After testing it, and building a proof of concept, my team and I decided to adopt it.
Like everything else, it started slow. Simple. And then, a few days ago, I decided to start building redundancy into it as much as possible. Turns out, I now have full HA on all components on my stack: Sensu (admin, server and API), RabbitMQ and Redis.
You’ll find below how I got there. I’ll try to share some gotchas with you. But as with all my other posts, this assumes you know your way around a UNIX system. I’m not going to hold your hand. It’s definitely NOT a copy/paste kinda tutorial.
Load balancing

Simple enough. 2 VMs running Keepalived and HAproxy. The first one I only use to manage the VIP through the magic of VRRP. HAproxy takes care of load balancing different protocols at the layer 3 and 7. Active/passive. Nothing fancy.
RabbitMQ cluster

Again, real easy to achieve. Well documented and easily implemented. Set your HAproxy frontend/backend to tcp mode and that’s it. You can even go nuts and load balance the management web interface.
NOTE: Make sure to replicate your queues across your cluster. See info here.
Sensu server

Easy peasy. Just deploy 2 nodes with the same configuration. There rest is automagic, all builtin. Almost too easy :)
Sensu API

Same thing.
Sensu admin

Regular web app. Nothing to write home about :)
Redis cluster

Now, that was some good old nerd fun. Some hacking fun. Had to come up with something novel I never used before. I wish I could claim that idea as my own, but its not :) In any case, for the first time in my life, I had to actually activate xinetd instead of disabling it. I was shocked!
You will need 2 VMs. Xinetd. And this script.
NOTE: This is tested with HAproxy, but should work with any layer 7 HTTP load balancer.
First, deploy a master/slave Redis setup. You should easily find the documentation online for that. Once you got that, you need to deploy Redis Sentinel to monitor the nodes, and promote the slave to masterdom if the current master fails for some reason.
Here’s my configuration file located at /etc/redis-sentinel.conf:
sentinel monitor redis-01 1.2.3.4 6379 1
sentinel down-after-milliseconds redis-01 1200
sentinel failover-timeout redis-01 2000
sentinel can-failover redis-01 yes
sentinel parallel-syncs redis-01 1

sentinel monitor redis-02 1.2.3.5 6379 2
sentinel down-after-milliseconds redis-02 1200
sentinel failover-timeout redis-02 2000
sentinel can-failover redis-02 yes
sentinel parallel-syncs redis-02 1

The values for down-after-milliseconds and failover-timeout are the lowest I could configure before getting into a flaping state with the setup.
Now, how does this all work? If redis-01 fails for any reason, Sentinel will reconfigure redis-02 to become the master. Since we’re behind a load balancer, the application won’t notice the downtime. Or should be smart enough to try to reconnect.
So what about when redis-01 comes back online? Well, Sentinel will reconfigure it as a slave.
Problem is, with all this, you have both instances running (unless one is down) and HAproxy doesn’t understand Redis’ protocol. So it can’t tell which is the master and which is a slave. So it’ll try to forward packets to both. With some queries failing, since we’re ending up on a read only slave.
So at that point, it appeared I had hit a wall. Or not, as it would turn out. Read on, to learn what’s the secret sauce!
Xinetd
That pesky daemon. It’s responsible for a lot of false positives when you scan a machine with a tool like Nessus. Always disabled it. Uninstall the package even. Well, not in this case.
Turns out, you can configure Xinetd to listen on a port, and when it receives a connection, it can run a script for you. Super handy. Remember that script I mentioned earlier? We’ll be needing it now.
So, HAproxy doesn’t understand Redis’ protocol, but it understands well HTTP. So, we’ll trick it by getting it to monitor a port we setup with Xinetd, and that port will answer with HTML/1.1 200 OK or HTML/1.1 503 Service Unavailable depending on if we’re master or slave. That way, if we’re a slave, as far as HAproxy is concerned, the node will be down and it won’t route traffic to it. But if we’re a master, we’ll get all the traffic. We only have one node available at any given time, but there’s one on standby, with all our current data ready to fail over within 5 seconds.
Here’s my /etc/xinetd.d/redis as an example:
# default: on
# description: redis role check
service redischk
{
        flags           = REUSE
        socket_type     = stream
        port            = 12345
        wait            = no
        user            = nobody
        server          = /usr/sbin/redis-role.sh
        log_on_failure  += USERID
        disable         = no
        only_from       = 0.0.0.0/0
        per_source      = UNLIMITED
}

That’s it. The only issue I had was when I migrated from the old Redis to the new Redis cluster. Sensu lost all its states. No biggie in our case. You could probably copy the data to the cluster before switching over if that’s critical for you.