noodley/gnc-fault-protection-notes

## gnc-fault-protection-notes
Link: https://pub-lib.jpl.nasa.gov/docushare/dsweb/Get/Document-316/08-031+GN%26C+Fault+Protection+Fundamentals.pdf

Fault tolerance vs. Variation tolerance
Variations - the changes to system behavior that are within the design of the system
Fault - departure from intended functionality
q - As we model a system can the architecture be changed to change previous faults into variations?
As a system scales do previous faults, such as hardware failures, become variations due to their frequency?

As fault protection systems grow they can become the source of failure (think Oracle RAC or Cisco Spanning Tree). The
increased complexity and ad hoc nature of fault protection mechanisms are signs of a loss of architectural integrity.

Definitions
States - What can change in a system.

Behaviors - the rules that determine which "histories of system state over time are possible." How the set of available
states can change.
q - Should I think of a state as a collection of sets and the behaviors as a set of functions that can operate on each of
the state sets?

Scope - The states and behaviors that are considered inside the system architecture and what are outside of the
architecture.

Objectives - The set of acceptable behaviors. A system failure is a violation of the systems objectives.

Control - "The deliberate exercise of influence on an open system to achieve an objective" .
q - If each possible state is a vertex in a graph, would behaviors be the edges in a directed graph connecting the states?
If so, can control be seen as a path finding exercise moving from the current state to the objective state?

Control System - A separate system which has knowledge about the systems states, behaviors and uses it to influence
(control) the system to reach its objectives. (q - The path finding algorithm?)
Control loops - The closed loop interaction between a between a control system and the system being designed. A system
operating with its only control coming from control loops is said to be goal based.

cognizance - The knowledge about a system that a control system has in order to influence the system towards its
objectives. Through cognizance a control system determines 'how' a system can reach its objectives.

An example of a simple control system -> system interaction:
"an objective presented to an attitude control system might have been to achieve a particular sun
sensor output voltage (corresponding to some angle)."

"In more sophisticated systems, it is common to see models and the state we care about appear explicitly"

Transparency of objectives - an objective on a closed loop system is a "model of desired behavior." Transparency is only
achieved when the objective defines all behaviors in a way where success or failure is obvious to both the issuer and the
control system.

In a closed loop system, definition of the objectives is extremely important. A control system with a poorly defined
objective can lead a a system failure despite proper operation of all components.

While a well-defined closed loop system is preferable, from a fault protection perspective, to an open loop system.
Operators will often prefer a transparent open loop system due to most implemented closed loop systems being poorly defined
and opaque to the operator.

Contemporary fault monitoring systems do not often have knowledge of the objectives when a fault occurs. Fault protection
systems then often make corrections based on predefined rules rather than based on the objective directly. These rules are
created based on existing system biases which leads to vulnerabilities when those system biases go outside their expected
range. When the biases move beyond a variation into a fault.
q - Consider default nagios monitoring of a system. It notifies when a disk is full or close to full, but has no knowledge
of why such a thing matters. Simplistic methods to correct such an error (automated log rotate, etc.) are very fragile as
soon as the cause for the full disk are outside of the expected set. Is this understanding correct?

"There are two parts then to making objectives truly transparent: 1) be explicit about the existence and
full meaning (success versus failure) of every objective, and 2) give full responsibility for managing
objectives, including their failure, to the control system responsible for achieving them in the first place."

Transparency of models
Models provide an ideal state to compare the current state to. If the current state differs from the models expectation a
fault may have occurred and a correction must be made.

Two distinct types of error.
Control errors - a problem in meeting objectives. The control system does not know of a set of behaviors that will get from
the current state to an objective state. Requires a fault response, how to return to a state that can reach the objective.

Expectation errors - a problem in the knowledge of state. The current state does not match any of the states a control
system has a model for. Requires a change in knowledge by identifying how this state occurred. A possible response is the
diagnosis of a fault.

If a model is not transparent and a fault protection system is activated, the fault protection system may not how to return
to 'normal' operations once the fault is corrected. How far along a behavior path did the system go before the fault? What
behavior actions remain? Is it safe to restart the behavior?

"Architectural features for coordinating system objectives" provide a way for a system to deal with competing objectives.

Methods of preventing failure propagation between systems
. Contain the failure with margin - provide enough spare capacity that a failure will not take a system below require
functionality. (Have 4 power supplies when only 1 is necessary for operation).
. Report the loss of functionality to a central authority that adjust the objectives to route around the failure and the
gives the new objectives to affected systems. This is known as the "safing" approach.

Transparency of Knowledge
A control system can be split into two parts: determination of state and the decision of how to procede from the current
state to reach the objectives.
It is an ideal to keep the control system as simple as possible so that it only contains the knowledge of the controlled system's state and the set of the controlled system's objectives.

The collection of state knowledge is not a simple task. As state changes over time and there is an inherit time lag between when the state is collected/measured and when it reaches the control system. To deal with the time lag a behavior model is applied to the measurement to determine the actual state. This is why transparency of models is so important. If the models are flawed or in conflict the determination of state may be incorrect.

Another difficulty with the determination of state is when subsystems report conflicting measurements. Conflicting measurements require a determination to be made by the control system on which measurements to rely on to determine state. This determination will rely on biases about the system state which may lead to failures. It is important to establish a single source of truth for each measurement.

As time passes a determination of state becomes more uncertain. Unless further evidence is presented the producer of the data should provide a confidence value. This should be done by the data producer not the control system. The control system should not contain assumptions on how the evidence changes over time.

"For transparency of knowledge then, it is essential to 1) make knowledge explicit, 2) represent it clearly
with honest representation of its timeliness and uncertainty, and 3) strive for a single source of truth."

Transparency of Control
Control decisions should only be made based on the knowledge of system behavior, the system states, and the objectives.
Using additional outside information points to a flaw in either the assigned objectives (there are outside objectives that
the system should follow) or the defined states and behaviors (the control system must attempt to reconcile the two sources
of knowledge). If the control system makes assumptions about a system by reconciling an external data source it can lead to
failures and make identifying the cause of a failure more difficult.

As a control system performs actions on a system those actions should become part of the state knowledge. The control
system should not base actions on previous actions because it makes assumptions about the desired behavior.
q - Is changing a system simply another datapoint to be considered when trying to reach future objectives?

"Control transparency, is consequently fairly easy to achieve, simply by 1) avoiding any basis for
decision besides knowledge of objective, behavior, and state"

Fault protection
Fault protection actions should not be treated differently from any other control action. All actions should be made
towards meeting the system's objectives and fault protection actions are simply a subset of the available actions that
can be performed.

As system complexity increases fault protection actions often becomes normal actions. If a system is continuously
encountering new states, there is little difference between fault protection actions and other actions to meet the
objective.

The goal of all actions should be the preservation of function to meet the objective.

Risk analysis and fault trees are part of the same behavior modeling exorcise as modeling expected 'normal' behavior. A fault is nothing more than a state the system may be in due to a set of prior behaviors.

Fault protection should not be tested separately from normal system behavior. Test the systems ability to remain functional
throughout all of the possible states. Regardless of whether those states are failure states or normal operational states.
q - Why doesn't Jenkins test how our monitoring system operates with our software?

Evaluate systems (including fault protection systems) based on their ability to reach the system's objectives. Specific
error -> response evaluation of a fault protection system is not as useful because it does not take into account how the
response lead to meeting the objectives.

A supervisory system may be necessary in fault protection to manage faults to the control systems themselves. For example, moving a control system on failing hardware to backup hardware.

Deal with faults as locally as possible. If the correction for a fault will result in the inability to complete objectives the correction must be handled by a higher level system so that the knowledge of the reduction in capabilities is available across systems.

"Diagnosis of a failure is merely the impartial assertion of new
knowledge regarding some failure to satisfy design functionality expectations."

Two general strategies when a control system knows that an error has occurred and there are multiple possibilities as to why
it occurred and how to fix it.
. Reducing complexity in order to remove ambiguity. Fail back to a simpler system so that the error state, if it still exists, is easier to identify.
. Perform actions that will generate more information about the error state. The result of the action provides more evidence for why the error occurs.

Objectives should be as explicit and as complete as possible. If the objectives provided are not the true objectives, then fault protection can select the wrong actions to correct the problem.

"The art in fault protection is not always to be right, but rather to be wrong as painlessly as possible."

Make models as simple as possible.
	Link: https://pub-lib.jpl.nasa.gov/docushare/dsweb/Get/Document-316/08-031+GN%26C+Fault+Protection+Fundamentals.pdf

	Fault tolerance vs. Variation tolerance
	Variations - the changes to system behavior that are within the design of the system
	Fault - departure from intended functionality
	q - As we model a system can the architecture be changed to change previous faults into variations?
	As a system scales do previous faults, such as hardware failures, become variations due to their frequency?

	As fault protection systems grow they can become the source of failure (think Oracle RAC or Cisco Spanning Tree). The
	increased complexity and ad hoc nature of fault protection mechanisms are signs of a loss of architectural integrity.

	Definitions
	States - What can change in a system.

	Behaviors - the rules that determine which "histories of system state over time are possible." How the set of available
	states can change.
	q - Should I think of a state as a collection of sets and the behaviors as a set of functions that can operate on each of
	the state sets?

	Scope - The states and behaviors that are considered inside the system architecture and what are outside of the
	architecture.

	Objectives - The set of acceptable behaviors. A system failure is a violation of the systems objectives.

	Control - "The deliberate exercise of influence on an open system to achieve an objective" .
	q - If each possible state is a vertex in a graph, would behaviors be the edges in a directed graph connecting the states?
	If so, can control be seen as a path finding exercise moving from the current state to the objective state?

	Control System - A separate system which has knowledge about the systems states, behaviors and uses it to influence
	(control) the system to reach its objectives. (q - The path finding algorithm?)
	Control loops - The closed loop interaction between a between a control system and the system being designed. A system
	operating with its only control coming from control loops is said to be goal based.

	cognizance - The knowledge about a system that a control system has in order to influence the system towards its
	objectives. Through cognizance a control system determines 'how' a system can reach its objectives.

	An example of a simple control system -> system interaction:
	"an objective presented to an attitude control system might have been to achieve a particular sun
	sensor output voltage (corresponding to some angle)."

	"In more sophisticated systems, it is common to see models and the state we care about appear explicitly"

	Transparency of objectives - an objective on a closed loop system is a "model of desired behavior." Transparency is only
	achieved when the objective defines all behaviors in a way where success or failure is obvious to both the issuer and the
	control system.

	In a closed loop system, definition of the objectives is extremely important. A control system with a poorly defined
	objective can lead a a system failure despite proper operation of all components.

	While a well-defined closed loop system is preferable, from a fault protection perspective, to an open loop system.
	Operators will often prefer a transparent open loop system due to most implemented closed loop systems being poorly defined
	and opaque to the operator.

	Contemporary fault monitoring systems do not often have knowledge of the objectives when a fault occurs. Fault protection
	systems then often make corrections based on predefined rules rather than based on the objective directly. These rules are
	created based on existing system biases which leads to vulnerabilities when those system biases go outside their expected
	range. When the biases move beyond a variation into a fault.
	q - Consider default nagios monitoring of a system. It notifies when a disk is full or close to full, but has no knowledge
	of why such a thing matters. Simplistic methods to correct such an error (automated log rotate, etc.) are very fragile as
	soon as the cause for the full disk are outside of the expected set. Is this understanding correct?

	"There are two parts then to making objectives truly transparent: 1) be explicit about the existence and
	full meaning (success versus failure) of every objective, and 2) give full responsibility for managing
	objectives, including their failure, to the control system responsible for achieving them in the first place."

	Transparency of models
	Models provide an ideal state to compare the current state to. If the current state differs from the models expectation a
	fault may have occurred and a correction must be made.

	Two distinct types of error.
	Control errors - a problem in meeting objectives. The control system does not know of a set of behaviors that will get from
	the current state to an objective state. Requires a fault response, how to return to a state that can reach the objective.

	Expectation errors - a problem in the knowledge of state. The current state does not match any of the states a control
	system has a model for. Requires a change in knowledge by identifying how this state occurred. A possible response is the
	diagnosis of a fault.

	If a model is not transparent and a fault protection system is activated, the fault protection system may not how to return
	to 'normal' operations once the fault is corrected. How far along a behavior path did the system go before the fault? What
	behavior actions remain? Is it safe to restart the behavior?

	"Architectural features for coordinating system objectives" provide a way for a system to deal with competing objectives.

	Methods of preventing failure propagation between systems
	. Contain the failure with margin - provide enough spare capacity that a failure will not take a system below require
	functionality. (Have 4 power supplies when only 1 is necessary for operation).
	. Report the loss of functionality to a central authority that adjust the objectives to route around the failure and the
	gives the new objectives to affected systems. This is known as the "safing" approach.

	Transparency of Knowledge
	A control system can be split into two parts: determination of state and the decision of how to procede from the current
	state to reach the objectives.
	It is an ideal to keep the control system as simple as possible so that it only contains the knowledge of the controlled system's state and the set of the controlled system's objectives.

	The collection of state knowledge is not a simple task. As state changes over time and there is an inherit time lag between when the state is collected/measured and when it reaches the control system. To deal with the time lag a behavior model is applied to the measurement to determine the actual state. This is why transparency of models is so important. If the models are flawed or in conflict the determination of state may be incorrect.

	Another difficulty with the determination of state is when subsystems report conflicting measurements. Conflicting measurements require a determination to be made by the control system on which measurements to rely on to determine state. This determination will rely on biases about the system state which may lead to failures. It is important to establish a single source of truth for each measurement.

	As time passes a determination of state becomes more uncertain. Unless further evidence is presented the producer of the data should provide a confidence value. This should be done by the data producer not the control system. The control system should not contain assumptions on how the evidence changes over time.

	"For transparency of knowledge then, it is essential to 1) make knowledge explicit, 2) represent it clearly
	with honest representation of its timeliness and uncertainty, and 3) strive for a single source of truth."

	Transparency of Control
	Control decisions should only be made based on the knowledge of system behavior, the system states, and the objectives.
	Using additional outside information points to a flaw in either the assigned objectives (there are outside objectives that
	the system should follow) or the defined states and behaviors (the control system must attempt to reconcile the two sources
	of knowledge). If the control system makes assumptions about a system by reconciling an external data source it can lead to
	failures and make identifying the cause of a failure more difficult.

	As a control system performs actions on a system those actions should become part of the state knowledge. The control
	system should not base actions on previous actions because it makes assumptions about the desired behavior.
	q - Is changing a system simply another datapoint to be considered when trying to reach future objectives?

	"Control transparency, is consequently fairly easy to achieve, simply by 1) avoiding any basis for
	decision besides knowledge of objective, behavior, and state"

	Fault protection
	Fault protection actions should not be treated differently from any other control action. All actions should be made
	towards meeting the system's objectives and fault protection actions are simply a subset of the available actions that
	can be performed.

	As system complexity increases fault protection actions often becomes normal actions. If a system is continuously
	encountering new states, there is little difference between fault protection actions and other actions to meet the
	objective.

	The goal of all actions should be the preservation of function to meet the objective.

	Risk analysis and fault trees are part of the same behavior modeling exorcise as modeling expected 'normal' behavior. A fault is nothing more than a state the system may be in due to a set of prior behaviors.

	Fault protection should not be tested separately from normal system behavior. Test the systems ability to remain functional
	throughout all of the possible states. Regardless of whether those states are failure states or normal operational states.
	q - Why doesn't Jenkins test how our monitoring system operates with our software?

	Evaluate systems (including fault protection systems) based on their ability to reach the system's objectives. Specific
	error -> response evaluation of a fault protection system is not as useful because it does not take into account how the
	response lead to meeting the objectives.

	A supervisory system may be necessary in fault protection to manage faults to the control systems themselves. For example, moving a control system on failing hardware to backup hardware.

	Deal with faults as locally as possible. If the correction for a fault will result in the inability to complete objectives the correction must be handled by a higher level system so that the knowledge of the reduction in capabilities is available across systems.

	"Diagnosis of a failure is merely the impartial assertion of new
	knowledge regarding some failure to satisfy design functionality expectations."

	Two general strategies when a control system knows that an error has occurred and there are multiple possibilities as to why
	it occurred and how to fix it.
	. Reducing complexity in order to remove ambiguity. Fail back to a simpler system so that the error state, if it still exists, is easier to identify.
	. Perform actions that will generate more information about the error state. The result of the action provides more evidence for why the error occurs.

	Objectives should be as explicit and as complete as possible. If the objectives provided are not the true objectives, then fault protection can select the wrong actions to correct the problem.

	"The art in fault protection is not always to be right, but rather to be wrong as painlessly as possible."

	Make models as simple as possible.