This guide will walk you through creating and executing a job that will demonstrate Nomad's job anti-affinity rules and, in clusters with memory limited Nomad clients, filtering based resource exhaustion.
- One Nomad Server Node
- Three Nomad Client Nodes
- 768 MB RAM total (providing 761 MB RAM in
nomad node-status -self
)
- 768 MB RAM total (providing 761 MB RAM in
Create the sample job by running nomad init
.
$ nomad init
Example job file written to example.nomad
Optionally, you can filter out all of the default job file's commentary with the following command
$ grep -v -e "^\s*#" example.nomad | grep -v -e '^[[:space:]]*$' > example2.nomad && mv example2.nomad example.nomad
This will create the sample job we use in the Nomad getting started guide which spins up a Docker instance running Redis.
We will want to change the count
to a number higher than your Nomad client count. My sample cluster has three(3) client nodes, so I am going to set the count = 5
. This can be done with a text editor, but I will incluce a sed
one-liner that will make the change.
sed -i 's/count = 1/count = 5/g' example.nomad
Plan the job with nomad plan
$ nomad plan example.nomad
+ Job: "example"
+ Task Group: "cache" (5 create)
+ Task: "redis" (forces create)
Scheduler dry-run:
- All tasks successfully allocated.
Job Modify Index: 0
To submit the job with version verification run:
nomad run -check-index 0 example.nomad
When running the job with the check-index flag, the job will only be run if the
server side version matches the job modify index returned. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
This plan indicates that we will have 5 allocations scheduled when we run the plan. Let's run nomad run example.nomad
to execute the plan in our cluster.
$ nomad run example.nomad
==> Monitoring evaluation "0b415d85"
Evaluation triggered by job "example"
Allocation "19640ba9" created: node "05129072", group "cache"
Allocation "2ea73fda" created: node "1dabfc7d", group "cache"
Allocation "3f8ae2ea" created: node "ab58ba15", group "cache"
Allocation "5782372b" created: node "ab58ba15", group "cache"
Allocation "69b40fa5" created: node "1dabfc7d", group "cache"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "0b415d85" finished with status "complete"
As you can see we have 5 new allocations and that two nodes, "ab58ba15" and "05129072", have two allocations of the job while "1dabfc7d" has one.
We can see the anti-affinity rules impact on scoring when we consider the allocations that are colocated on a single node. For the following commands I will look at allocation "3373c269" and "c3e98fa2".
Running the nomad alloc-status
command with the -verbose
flag will provide the scoring information in the Placement Metrics section. For example:
$ nomad alloc-status -verbose 3f8ae2ea
ID = 3f8ae2ea-8d62-f31d-1c91-b64e8ee92595
Eval ID = 0b415d85-741f-ea5b-e2d5-f44632b334a4
Name = example.cache[2]
Node ID = ab58ba15-6591-2b37-f9e8-4720bd07189a
Job ID = example
Client Status = running
Client Description = <none>
Desired Status = run
Desired Description = <none>
Created At = 08/01/17 10:48:16 EDT
Evaluated Nodes = 2
Filtered Nodes = 0
Exhausted Nodes = 0
Allocation Time = 21.649µs
Failures = 0
Task "redis" is "running"
Task Resources
CPU Memory Disk IOPS Addresses
3/500 MHz 988 KiB/256 MiB 300 MiB 0 db: 10.0.0.23:30265
Recent Events:
Time Type Description
08/01/17 10:48:17 EDT Started Task started by client
08/01/17 10:48:16 EDT Task Setup Building Task Directory
08/01/17 10:48:16 EDT Received Task received by client
Placement Metrics
* Score "ab58ba15-6591-2b37-f9e8-4720bd07189a.binpack" = 8.269191
* Score "1dabfc7d-a92f-00f2-1cb6-0be3f000e542.binpack" = 12.803662
* Score "1dabfc7d-a92f-00f2-1cb6-0be3f000e542.job-anti-affinity" = -20.000000
Here we can see that this allocation considered two equally weighted nodes whose full UUID node names are the first element of the dotted attribute name. We can also see that both nodes are being penalized for having an existing copy of the job running on them. Since the binpack scoring algorithm has a 0-18 scoring range, subtracting 20 is sufficient to cause any node already running a job to have a lower score than one that is not.
Comparing the Placement Metrics of all of the running allocations:
Allocation "19640ba9"
Placement Metrics
* Score "ab58ba15-6591-2b37-f9e8-4720bd07189a.binpack" = 12.803662
* Score "ab58ba15-6591-2b37-f9e8-4720bd07189a.job-anti-affinity" = -20.000000
* Score "05129072-6258-4ea6-79bf-03bd31418ac7.binpack" = 8.269191
Allocation "2ea73fda"
Placement Metrics
* Score "1dabfc7d-a92f-00f2-1cb6-0be3f000e542.binpack" = 8.269191
* Score "05129072-6258-4ea6-79bf-03bd31418ac7.binpack" = 8.269191
Allocation "3f8ae2ea"
Placement Metrics
* Score "ab58ba15-6591-2b37-f9e8-4720bd07189a.binpack" = 8.269191
* Score "1dabfc7d-a92f-00f2-1cb6-0be3f000e542.binpack" = 12.803662
* Score "1dabfc7d-a92f-00f2-1cb6-0be3f000e542.job-anti-affinity" = -20.000000
Allocation "5782372b"
Placement Metrics
* Resources exhausted on 1 nodes
* Dimension "memory exhausted" exhausted on 1 nodes
* Score "ab58ba15-6591-2b37-f9e8-4720bd07189a.binpack" = 12.803662
* Score "ab58ba15-6591-2b37-f9e8-4720bd07189a.job-anti-affinity" = -20.000000
* Score "05129072-6258-4ea6-79bf-03bd31418ac7.binpack" = 12.803662
* Score "05129072-6258-4ea6-79bf-03bd31418ac7.job-anti-affinity" = -20.000000
Allocation "69b40fa5"
Placement Metrics
* Score "1dabfc7d-a92f-00f2-1cb6-0be3f000e542.binpack" = 12.803662
* Score "1dabfc7d-a92f-00f2-1cb6-0be3f000e542.job-anti-affinity" = -20.000000
* Score "05129072-6258-4ea6-79bf-03bd31418ac7.binpack" = 12.803662
* Score "05129072-6258-4ea6-79bf-03bd31418ac7.job-anti-affinity" = -20.000000
We can even see that when Nomad was scheduling allocation "5782372b" that one of the nodes no longer had sufficient free memory to be considered as a viable candidate for scheduling this job and was filtered out before scoring.
If resource exhaustion is causing an allocation of the job to fail to be scheduled, it will be noted in the output of nomad run
and nomad status
for your job.
For example, If I attempt to run 7 copies of the example job in my sample cluster, I will not have enough uncommitted RAM to place a third copy of the job on any cluster node. The scheduler knows this and will wait to place that allocation until sufficient resources are available.
Evaluation "a23ca152" finished with status "complete" but failed to place all allocations:
Task Group "cache" (failed to place 1 allocation):
* Resources exhausted on 3 nodes
* Dimension "memory exhausted" exhausted on 3 nodes
Evaluation "f3ee745a" waiting for additional capacity to place remainder
or nomad status «job-id»
$ nomad status example`
...
Placement Failure
Task Group "cache":
* Resources exhausted on 3 nodes
* Dimension "memory exhausted" exhausted on 3 nodes
...
Hopefully this walkthrough clarifies job anti-affinity and resource filtering during the placement process.