mlafeldt/x.md

## x.md

      
    Raw
  

              x.md
            
          
    Nomad Resilience Testing with Vagrant

Based on https://github.com/hashicorp/nomad/tree/master/demo/vagrant, I did some resilience testing to figure out how reliable Nomad's periodic jobs are.
Test scenario

I used this minimal periodic job, which executes a shell script every minute:
# example.nomad
job "example" {
  type        = "batch"
  datacenters = ["dc1"]

  group "example" {
    task "example" {
      driver = "raw_exec"

      config {
        command = "/bin/bash"
        args    = ["/vagrant/slack.sh"]
      }

      resources {
      }
    }
  }

  periodic {
    cron             = "* *  *  *  *"
    prohibit_overlap = true
  }
}
The shell script sends a message to Slack:
# slack.sh (hacked together from StackOverflow)
escapedText="Nomad $(date) $@ ${NOMAD_ALLOC_ID}"

webhook_url=https://hooks.slack.com/services/<...>

json="{\"channel\": \"$channel\", \"username\":\"$username\", \"icon_emoji\":\"ghost\", \"attachments\":[{\"color\":\"danger\" , \"text\": \"$escapedText\"}]}"

curl -s -d "payload=$json" "$webhook_url"
For testing, I started a Nomad server and 2 clients:
vagrant ssh -c '/vagrant/nomad agent -config /vagrant/server.hcl'
vagrant ssh -c '/vagrant/nomad agent -config /vagrant/client1.hcl'
vagrant ssh -c '/vagrant/nomad agent -config /vagrant/client2.hcl'

I then started the cron job inside the VM:
nomad run example.nomad

I used watch nomad status to "monitor" the execution.
Failure modes

I tested the following failure modes:

Server down (SIGINT)
One client down (SIGINT)
Both clients down (SIGINT)
Server and clients down (SIGINT)
VM restart via vagrant reload

I also changed the setting prohibit_overlap to true and false to test whether Nomad actually prohibits/allows execution of overlapping jobs.
Results

During resilience testing I unfortunately encountered a critical bug in Nomad causing successful batch jobs to be run again after restarting a stopped client.  Luckily, the bug was fixed yesterday (no kidding). I did another round of testing using a local master build of Nomad. This time, the results were very satisfying, with Nomad picking up the periodic job every single time after starting the server and at least one client. The prohibit_overlap feature also worked as expected. 👍
Restarting the VM caused Nomad to forget all running jobs. After some investigation, I found the reason: the server was configured to write data to /tmp/server1, which doesn't survive reboots. After changing the data_dir setting to /home/vagrant/server1, the jobs survived rebooting too. 👍
All in all, local testing with Vagrant was a success.