lava/agent-recovery

## agent-recovery
Relaxed Agent state recovery

# Problem

Most experienced mesos users will have encountered the following error message:

    E1103 18:30:03.451825 12204 slave.cpp:6302] EXIT with status 1: Failed to perform recovery: Incompatible agent info detected.
    ------------------------------------------------------------
    [...]
    ------------------------------------------------------------
    To remedy this do as follows:
    Step 1: rm -f /path/to/work_dir/meta/slaves/latest
            This ensures agent doesn't recover old live executors.
    Step 2: Restart the agent.


Whenever a mesos-agent process is started, it checks for the existence
of checkpointed information from the last time it was running by detecting
a symlink called `slaves/latest` in its state directory.

If this information exists but doesnt match the slave configuration that
was passed via command-line flags or from the environment, the slave will
detect this situation and refuse to start until either the state directory
is erased or the flags are changed to match the previous state.

In particular hostname, port, resources, attributes and domain of the agent
are required to be the same. This can make automated deployments of mesos
configuration all but impossible, since any change in the agent configuration
can require administrators to manually visit each host and remove the state
directory.

The objective of this document is to provide users with a way to relax this
check, by adding one or two new command-line flags to the mesos-agent binary.


# Design Considerations

There are actually several closely related questions that must be considered:

1) Should agents be allowed to change some of their persistent state while keeping the same agent id?
2) Should agents be allowed to discard the old slave state without user intervention?
3) Should tasks be allowed to keep running after their slave changed parts of its state?

For the current mesos implementation, the  answers (1) and (2) with "yes, sometimes",
and (3) with "no". In particular, agents are allowed to recycle older agent
ids after the machine was rebooted.

For 3, as far as I'm aware there are two main reason for the current behaviour:

First, if we allow resources to be changed while a task is running, we might end up
with, e.g. a database assuming it has x GiB of memory available that is being killed
for using up just a fraction of that.

The other would be the example of a hypothetical "secure" attribute:
A framework wants to run its tasks only on hosts marked "secure", but after the task
started an admin changes the firewall rules so the host is now connected to the public internet,
therefore considered insecure, and restarts the agent.
If arbitrary attribute changes are permitted, the framework will happily continue running its task
on an insecure host.

However it should be noted that this problem already exists in the current implementation:
Trivially, if the agent is not restarted after changing the firewall settings in our example,
the scheduler will happily continue to run its tasks on an insecure host. And even if the agent
is shut down and refuses to restart, it is up to the executor if and when it wants to kill the
task, which it should do before ever knowing that the attribute was changed.


# Proposal

## Agent changes

a)
Add a flag `--ignore_checkpointed_state=[resources|attributes|domains|none]` whose value
is a JSON array of strings.

The implementation would be straightforward, the only change required seems to make
the comparison in slave.cpp more fine-grained to be able to check if the mismatches
are contained in the list of allowed exceptions, followed by writing the updated
state back to disk.

The default value would be `["domains"]`, since that is still new enough that no existing
frameworks should depend on the domain being constant.

b)
When running an old master together with an agent that already implements the proposed
changes, it can happen that the master doesn't correctly update its information about
the slave state after reregistration, and sends offers to frameworks that cannot actually
be fulfilled. (as explained above, that can even happen right now, but is much harder to
trigger)

Therefore, a warning should be prominently displayed that by allowing the slave to ignore
parts of the checkpointed slave, the user understands and accepts that the above situation
can occur.

If this turns out to be not enough, a new check should be added that verifies that if one of
the proposed options was set to a non-default value and the master version is too small,
the agent will refuse to connect and exit with an appropriate error message.


c)
Additionally, when the slave detects that a running task had its environment changed
in such a manner, it will generate a TASK_RUNNING update with reason
REASON_SLAVE_STATE_CHANGED, and the data field of this update will contain the
new SlaveInfo. Note that the scheduler is only guaranteed to receive this update if
slave checkpointing is enabled on the agents.

Most likely we should additionally require a new framework capability SLAVE_STATE_OBSERVER
in order to send these messages, to avoid generating a lot of network
traffic for frameworks which are not prepared to handle this anyways.


d)
Add a boolean flag `--recover_state=true|false` that controls the behaviour of
the agent when encountering an existing state directory:

  - true: current behaviour, only restart if current and previous slave states match
  - false: keep the slave id, replace the old state and kill all running tasks. (this is the current behaviour for the first time the agent is started after the host was rebooted)


Note that (d) can be implemented almost completely independently from (a),(b) and (c) which
form a group. It could therefore be implemented as a separate step. It solves a closely related
but still separate problem, but would likely be easier to use for administrators.
Any of these changes require the corresponding changes in the master described below.


## Master changes

On the master, the code path that handles agent failover already allows the agent
to change its version and capability-bits. This needs to be extended such that the
new slave state fully replaces the previous one.

In particular, it must be verified that the allocator can handle a slave where
tasks consume more than 100% of the total available resources.

Curiously, the code path that handles master failover already does what is needed:
When a slave reregisters that was found in the replicated log ("recovered"), the old
slave info is deleted and the *new* slave info is stored in the list of active slaves,
without verifying that they match. Most likely it was implemented like this to
	Relaxed Agent state recovery

	# Problem

	Most experienced mesos users will have encountered the following error message:

	E1103 18:30:03.451825 12204 slave.cpp:6302] EXIT with status 1: Failed to perform recovery: Incompatible agent info detected.
	------------------------------------------------------------
	[...]
	------------------------------------------------------------
	To remedy this do as follows:
	Step 1: rm -f /path/to/work_dir/meta/slaves/latest
	This ensures agent doesn't recover old live executors.
	Step 2: Restart the agent.


	Whenever a mesos-agent process is started, it checks for the existence
	of checkpointed information from the last time it was running by detecting
	a symlink called `slaves/latest` in its state directory.

	If this information exists but doesnt match the slave configuration that
	was passed via command-line flags or from the environment, the slave will
	detect this situation and refuse to start until either the state directory
	is erased or the flags are changed to match the previous state.

	In particular hostname, port, resources, attributes and domain of the agent
	are required to be the same. This can make automated deployments of mesos
	configuration all but impossible, since any change in the agent configuration
	can require administrators to manually visit each host and remove the state
	directory.

	The objective of this document is to provide users with a way to relax this
	check, by adding one or two new command-line flags to the mesos-agent binary.


	# Design Considerations

	There are actually several closely related questions that must be considered:

	1) Should agents be allowed to change some of their persistent state while keeping the same agent id?
	2) Should agents be allowed to discard the old slave state without user intervention?
	3) Should tasks be allowed to keep running after their slave changed parts of its state?

	For the current mesos implementation, the answers (1) and (2) with "yes, sometimes",
	and (3) with "no". In particular, agents are allowed to recycle older agent
	ids after the machine was rebooted.

	For 3, as far as I'm aware there are two main reason for the current behaviour:

	First, if we allow resources to be changed while a task is running, we might end up
	with, e.g. a database assuming it has x GiB of memory available that is being killed
	for using up just a fraction of that.

	The other would be the example of a hypothetical "secure" attribute:
	A framework wants to run its tasks only on hosts marked "secure", but after the task
	started an admin changes the firewall rules so the host is now connected to the public internet,
	therefore considered insecure, and restarts the agent.
	If arbitrary attribute changes are permitted, the framework will happily continue running its task
	on an insecure host.

	However it should be noted that this problem already exists in the current implementation:
	Trivially, if the agent is not restarted after changing the firewall settings in our example,
	the scheduler will happily continue to run its tasks on an insecure host. And even if the agent
	is shut down and refuses to restart, it is up to the executor if and when it wants to kill the
	task, which it should do before ever knowing that the attribute was changed.



	# Proposal

	## Agent changes

	a)
	Add a flag `--ignore_checkpointed_state=[resources\|attributes\|domains\|none]` whose value
	is a JSON array of strings.

	The implementation would be straightforward, the only change required seems to make
	the comparison in slave.cpp more fine-grained to be able to check if the mismatches
	are contained in the list of allowed exceptions, followed by writing the updated
	state back to disk.

	The default value would be `["domains"]`, since that is still new enough that no existing
	frameworks should depend on the domain being constant.

	b)
	When running an old master together with an agent that already implements the proposed
	changes, it can happen that the master doesn't correctly update its information about
	the slave state after reregistration, and sends offers to frameworks that cannot actually
	be fulfilled. (as explained above, that can even happen right now, but is much harder to
	trigger)

	Therefore, a warning should be prominently displayed that by allowing the slave to ignore
	parts of the checkpointed slave, the user understands and accepts that the above situation
	can occur.

	If this turns out to be not enough, a new check should be added that verifies that if one of
	the proposed options was set to a non-default value and the master version is too small,
	the agent will refuse to connect and exit with an appropriate error message.


	c)
	Additionally, when the slave detects that a running task had its environment changed
	in such a manner, it will generate a TASK_RUNNING update with reason
	REASON_SLAVE_STATE_CHANGED, and the data field of this update will contain the
	new SlaveInfo. Note that the scheduler is only guaranteed to receive this update if
	slave checkpointing is enabled on the agents.

	Most likely we should additionally require a new framework capability SLAVE_STATE_OBSERVER
	in order to send these messages, to avoid generating a lot of network
	traffic for frameworks which are not prepared to handle this anyways.


	d)
	Add a boolean flag `--recover_state=true\|false` that controls the behaviour of
	the agent when encountering an existing state directory:

	- true: current behaviour, only restart if current and previous slave states match
	- false: keep the slave id, replace the old state and kill all running tasks. (this is the current behaviour for the first time the agent is started after the host was rebooted)



	Note that (d) can be implemented almost completely independently from (a),(b) and (c) which
	form a group. It could therefore be implemented as a separate step. It solves a closely related
	but still separate problem, but would likely be easier to use for administrators.
	Any of these changes require the corresponding changes in the master described below.


	## Master changes

	On the master, the code path that handles agent failover already allows the agent
	to change its version and capability-bits. This needs to be extended such that the
	new slave state fully replaces the previous one.

	In particular, it must be verified that the allocator can handle a slave where
	tasks consume more than 100% of the total available resources.

	Curiously, the code path that handles master failover already does what is needed:
	When a slave reregisters that was found in the replicated log ("recovered"), the old
	slave info is deleted and the new slave info is stored in the list of active slaves,
	without verifying that they match. Most likely it was implemented like this to