Created
November 3, 2017 18:33
-
-
Save lava/bd9b7c90c067ae3298f7daf2df9fd99f to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Relaxed Agent state recovery | |
# Problem | |
Most experienced mesos users will have encountered the following error message: | |
E1103 18:30:03.451825 12204 slave.cpp:6302] EXIT with status 1: Failed to perform recovery: Incompatible agent info detected. | |
------------------------------------------------------------ | |
[...] | |
------------------------------------------------------------ | |
To remedy this do as follows: | |
Step 1: rm -f /path/to/work_dir/meta/slaves/latest | |
This ensures agent doesn't recover old live executors. | |
Step 2: Restart the agent. | |
Whenever a mesos-agent process is started, it checks for the existence | |
of checkpointed information from the last time it was running by detecting | |
a symlink called `slaves/latest` in its state directory. | |
If this information exists but doesnt match the slave configuration that | |
was passed via command-line flags or from the environment, the slave will | |
detect this situation and refuse to start until either the state directory | |
is erased or the flags are changed to match the previous state. | |
In particular hostname, port, resources, attributes and domain of the agent | |
are required to be the same. This can make automated deployments of mesos | |
configuration all but impossible, since any change in the agent configuration | |
can require administrators to manually visit each host and remove the state | |
directory. | |
The objective of this document is to provide users with a way to relax this | |
check, by adding one or two new command-line flags to the mesos-agent binary. | |
# Design Considerations | |
There are actually several closely related questions that must be considered: | |
1) Should agents be allowed to change some of their persistent state while keeping the same agent id? | |
2) Should agents be allowed to discard the old slave state without user intervention? | |
3) Should tasks be allowed to keep running after their slave changed parts of its state? | |
For the current mesos implementation, the answers (1) and (2) with "yes, sometimes", | |
and (3) with "no". In particular, agents are allowed to recycle older agent | |
ids after the machine was rebooted. | |
For 3, as far as I'm aware there are two main reason for the current behaviour: | |
First, if we allow resources to be changed while a task is running, we might end up | |
with, e.g. a database assuming it has x GiB of memory available that is being killed | |
for using up just a fraction of that. | |
The other would be the example of a hypothetical "secure" attribute: | |
A framework wants to run its tasks only on hosts marked "secure", but after the task | |
started an admin changes the firewall rules so the host is now connected to the public internet, | |
therefore considered insecure, and restarts the agent. | |
If arbitrary attribute changes are permitted, the framework will happily continue running its task | |
on an insecure host. | |
However it should be noted that this problem already exists in the current implementation: | |
Trivially, if the agent is not restarted after changing the firewall settings in our example, | |
the scheduler will happily continue to run its tasks on an insecure host. And even if the agent | |
is shut down and refuses to restart, it is up to the executor if and when it wants to kill the | |
task, which it should do before ever knowing that the attribute was changed. | |
# Proposal | |
## Agent changes | |
a) | |
Add a flag `--ignore_checkpointed_state=[resources|attributes|domains|none]` whose value | |
is a JSON array of strings. | |
The implementation would be straightforward, the only change required seems to make | |
the comparison in slave.cpp more fine-grained to be able to check if the mismatches | |
are contained in the list of allowed exceptions, followed by writing the updated | |
state back to disk. | |
The default value would be `["domains"]`, since that is still new enough that no existing | |
frameworks should depend on the domain being constant. | |
b) | |
When running an old master together with an agent that already implements the proposed | |
changes, it can happen that the master doesn't correctly update its information about | |
the slave state after reregistration, and sends offers to frameworks that cannot actually | |
be fulfilled. (as explained above, that can even happen right now, but is much harder to | |
trigger) | |
Therefore, a warning should be prominently displayed that by allowing the slave to ignore | |
parts of the checkpointed slave, the user understands and accepts that the above situation | |
can occur. | |
If this turns out to be not enough, a new check should be added that verifies that if one of | |
the proposed options was set to a non-default value and the master version is too small, | |
the agent will refuse to connect and exit with an appropriate error message. | |
c) | |
Additionally, when the slave detects that a running task had its environment changed | |
in such a manner, it will generate a TASK_RUNNING update with reason | |
REASON_SLAVE_STATE_CHANGED, and the data field of this update will contain the | |
new SlaveInfo. Note that the scheduler is only guaranteed to receive this update if | |
slave checkpointing is enabled on the agents. | |
Most likely we should additionally require a new framework capability SLAVE_STATE_OBSERVER | |
in order to send these messages, to avoid generating a lot of network | |
traffic for frameworks which are not prepared to handle this anyways. | |
d) | |
Add a boolean flag `--recover_state=true|false` that controls the behaviour of | |
the agent when encountering an existing state directory: | |
- true: current behaviour, only restart if current and previous slave states match | |
- false: keep the slave id, replace the old state and kill all running tasks. (this is the current behaviour for the first time the agent is started after the host was rebooted) | |
Note that (d) can be implemented almost completely independently from (a),(b) and (c) which | |
form a group. It could therefore be implemented as a separate step. It solves a closely related | |
but still separate problem, but would likely be easier to use for administrators. | |
Any of these changes require the corresponding changes in the master described below. | |
## Master changes | |
On the master, the code path that handles agent failover already allows the agent | |
to change its version and capability-bits. This needs to be extended such that the | |
new slave state fully replaces the previous one. | |
In particular, it must be verified that the allocator can handle a slave where | |
tasks consume more than 100% of the total available resources. | |
Curiously, the code path that handles master failover already does what is needed: | |
When a slave reregisters that was found in the replicated log ("recovered"), the old | |
slave info is deleted and the *new* slave info is stored in the list of active slaves, | |
without verifying that they match. Most likely it was implemented like this to | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment