angrycub/Job_anti-affinity_in_action.md

## Job_anti-affinity_in_action.md

      
    Raw
  

              Job_anti-affinity_in_action.md
            
          
    Job Anti-Affinity in Action

Overview

This guide will walk you through creating and executing a job that will demonstrate Nomad's job anti-affinity rules and, in clusters with memory limited Nomad clients, filtering based resource exhaustion.
Sample Environment


One Nomad Server Node
Three Nomad Client Nodes

768 MB RAM total (providing 761 MB RAM in nomad node-status -self)


Process

Create the sample job by running nomad init.
$ nomad init
Example job file written to example.nomad

Optionally, you can filter out all of the default job file's commentary with the following command
$ grep -v -e "^\s*#" example.nomad | grep -v -e '^[[:space:]]*$' > example2.nomad && mv example2.nomad example.nomad

This will create the sample job we use in the Nomad getting started guide which spins up a Docker instance running Redis.
We will want to change the count to a number higher than your Nomad client count.  My sample cluster has three(3) client nodes, so I am going to set the count = 5.  This can be done with a text editor, but I will incluce a sed one-liner that will make the change.
sed -i 's/count = 1/count = 5/g' example.nomad

Plan the job with nomad plan
$ nomad plan example.nomad
+ Job: "example"
+ Task Group: "cache" (5 create)
  + Task: "redis" (forces create)

Scheduler dry-run:
- All tasks successfully allocated.

Job Modify Index: 0
To submit the job with version verification run:

nomad run -check-index 0 example.nomad

When running the job with the check-index flag, the job will only be run if the
server side version matches the job modify index returned. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.

This plan indicates that we will have 5 allocations scheduled when we run the plan.  Let's run nomad run example.nomad to execute the plan in our cluster.
$ nomad run example.nomad 
==> Monitoring evaluation "0b415d85"
    Evaluation triggered by job "example"
    Allocation "19640ba9" created: node "05129072", group "cache"
    Allocation "2ea73fda" created: node "1dabfc7d", group "cache"
    Allocation "3f8ae2ea" created: node "ab58ba15", group "cache"
    Allocation "5782372b" created: node "ab58ba15", group "cache"
    Allocation "69b40fa5" created: node "1dabfc7d", group "cache"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "0b415d85" finished with status "complete"

As you can see we have 5 new allocations and that two nodes, "ab58ba15" and "05129072", have two allocations of the job while "1dabfc7d" has one.
We can see the anti-affinity rules impact on scoring when we consider the allocations that are colocated on a single node.  For the following commands I will look at allocation "3373c269" and "c3e98fa2".
Running the nomad alloc-status command with the -verbose flag will provide the scoring information in the Placement Metrics section.  For example:
$ nomad alloc-status -verbose 3f8ae2ea
ID                  = 3f8ae2ea-8d62-f31d-1c91-b64e8ee92595
Eval ID             = 0b415d85-741f-ea5b-e2d5-f44632b334a4
Name                = example.cache[2]
Node ID             = ab58ba15-6591-2b37-f9e8-4720bd07189a
Job ID              = example
Client Status       = running
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created At          = 08/01/17 10:48:16 EDT
Evaluated Nodes     = 2
Filtered Nodes      = 0
Exhausted Nodes     = 0
Allocation Time     = 21.649µs
Failures            = 0

Task "redis" is "running"
Task Resources
CPU        Memory           Disk     IOPS  Addresses
3/500 MHz  988 KiB/256 MiB  300 MiB  0     db: 10.0.0.23:30265

Recent Events:
Time                   Type        Description
08/01/17 10:48:17 EDT  Started     Task started by client
08/01/17 10:48:16 EDT  Task Setup  Building Task Directory
08/01/17 10:48:16 EDT  Received    Task received by client

Placement Metrics
  * Score "ab58ba15-6591-2b37-f9e8-4720bd07189a.binpack" = 8.269191
  * Score "1dabfc7d-a92f-00f2-1cb6-0be3f000e542.binpack" = 12.803662
  * Score "1dabfc7d-a92f-00f2-1cb6-0be3f000e542.job-anti-affinity" = -20.000000

Here we can see that this allocation considered two equally weighted nodes  whose full UUID node names are the first element of the dotted attribute name.  We can also see that both nodes are being penalized for having an existing copy of the job running on them.  Since the binpack scoring algorithm has a 0-18 scoring range, subtracting 20 is sufficient to cause any node already running a job to have a lower score than one that is not.
Comparing the Placement Metrics of all of the running allocations:
Allocation "19640ba9"
Placement Metrics
  * Score "ab58ba15-6591-2b37-f9e8-4720bd07189a.binpack" = 12.803662
  * Score "ab58ba15-6591-2b37-f9e8-4720bd07189a.job-anti-affinity" = -20.000000
  * Score "05129072-6258-4ea6-79bf-03bd31418ac7.binpack" = 8.269191

Allocation "2ea73fda"
Placement Metrics
  * Score "1dabfc7d-a92f-00f2-1cb6-0be3f000e542.binpack" = 8.269191
  * Score "05129072-6258-4ea6-79bf-03bd31418ac7.binpack" = 8.269191

Allocation "3f8ae2ea"
Placement Metrics
  * Score "ab58ba15-6591-2b37-f9e8-4720bd07189a.binpack" = 8.269191
  * Score "1dabfc7d-a92f-00f2-1cb6-0be3f000e542.binpack" = 12.803662
  * Score "1dabfc7d-a92f-00f2-1cb6-0be3f000e542.job-anti-affinity" = -20.000000

Allocation "5782372b"
Placement Metrics
  * Resources exhausted on 1 nodes
  * Dimension "memory exhausted" exhausted on 1 nodes
  * Score "ab58ba15-6591-2b37-f9e8-4720bd07189a.binpack" = 12.803662
  * Score "ab58ba15-6591-2b37-f9e8-4720bd07189a.job-anti-affinity" = -20.000000
  * Score "05129072-6258-4ea6-79bf-03bd31418ac7.binpack" = 12.803662
  * Score "05129072-6258-4ea6-79bf-03bd31418ac7.job-anti-affinity" = -20.000000

Allocation "69b40fa5"
Placement Metrics
  * Score "1dabfc7d-a92f-00f2-1cb6-0be3f000e542.binpack" = 12.803662
  * Score "1dabfc7d-a92f-00f2-1cb6-0be3f000e542.job-anti-affinity" = -20.000000
  * Score "05129072-6258-4ea6-79bf-03bd31418ac7.binpack" = 12.803662
  * Score "05129072-6258-4ea6-79bf-03bd31418ac7.job-anti-affinity" = -20.000000

We can even see that when Nomad was scheduling allocation "5782372b" that one of the nodes no longer had sufficient free memory to be considered as a viable candidate for scheduling this job and was filtered out before scoring.
Resource Exhaustion and the Scheduler

If resource exhaustion is causing an allocation of the job to fail to be scheduled, it will be noted in the output of nomad run and nomad status for your job.
For example, If I attempt to run 7 copies of the example job in my sample cluster, I will not have enough uncommitted RAM to place a third copy of the job on any cluster node. The scheduler knows this and will wait to place that allocation until sufficient resources are available.
Evaluation "a23ca152" finished with status "complete" but failed to place all allocations:
    Task Group "cache" (failed to place 1 allocation):
      * Resources exhausted on 3 nodes
      * Dimension "memory exhausted" exhausted on 3 nodes
    Evaluation "f3ee745a" waiting for additional capacity to place remainder

or nomad status «job-id»
$ nomad status example`
...
Placement Failure
Task Group "cache":
  * Resources exhausted on 3 nodes
  * Dimension "memory exhausted" exhausted on 3 nodes
...

Hopefully this walkthrough clarifies job anti-affinity and resource filtering during the placement process.