Skip to content

Instantly share code, notes, and snippets.

@hairyhum
Last active June 19, 2019 21:24
Show Gist options
  • Save hairyhum/85c7076a19b4a85fc6ac14dd2e259c05 to your computer and use it in GitHub Desktop.
Save hairyhum/85c7076a19b4a85fc6ac14dd2e259c05 to your computer and use it in GitHub Desktop.

Cluster state on disk:

  1. Mnesia schema db_nodes - nodes of the schema. Either disk nodes or nodes where tables are replicated to. extra_db_nodes - configuration telling mnesia which nodes to connect to on startup. running_db_nodes - nodes which mnesia is currently connected to. [1] table nodes - nodes on which tables are replicated. A list with all nodes, and with "active" nodes. "all nodes" is a subset of db_nodes, "active nodes" is a subset of running_db_nodes In a way db_nodes and running_db_nodes are same as "all nodes" and "active nodes" of the schema table
  2. nodes_running_at_shutdown a list of nodes, which are currently running. This is similar to running_db_nodes, but is monitored by the node_monitor. It's modified when a node starts, joins/leaves the cluster or when the rabbit process stops on a node.
  3. cluster_nodes.config two lists, one containing all clustered nodes, another containing disc nodes. modified when a node joins/leaves the cluster

Monitors:

  1. mnesia_monitor - a process linked to other monitors on all db_nodes.

  2. rabbit_node_monitor - monitors nodes (net_kernel:monitor_nodes/2) and rabbit processes on remote nodes.

  3. All the queues/channels/gm can monitor state across nodes.

Messages:

Common:

  • nodedown - a message from erlang internal node monitor. handled by mnesia_monitor to keep track on down nodes (does not directly remove them from running_db_nodes) and rabbit_node_monitor to track how many nodes are running for pause_minority and pause_if_all_down triggers check_partial_partition

  • nodeup - counterpart of nodedown. handled by mnesia_monitor to check the cluster status. This handler may send an inconsistent_database event. rabbit_node_monitor logs the event and does nothing

mnesia_minitor:

  • Link EXIT signal from mnesia_monitor Updates running_db_nodes and active nodes for all tables

rabbit_node_monitor:

  • notify_node_up notify all nodes from running_db_nodes (except self) by sending node_up to them

  • DOWN from rabbit process Update cluster status (removes the stopped node) Clean up transient queues, listeners, alarms. Updates partition tracking (handle_dead_rabbit)

  • node_up (not to be confused with nodeup) sent by a node monitor on a started remote node to notify the cluster (in a boot step) Update cluster status, update alarms, cleanup started node from recoverable slaves for mirrored queues (handle_live_rabbit)

  • joined_cluster/left_cluster - update cluster status

  • {mnesia_system_event, {inconsistent_database, running_partitioned_network, Node}} this message is being treated as reconnect after partial partition update alarms, cleanup started node from recoverable slaves for mirrored queues (handle_live_rabbit) record partitioned state I'm not sure this is the right message to report reconnect. This message may be emitted multiple times and does not necessary mean that a node have rejoined

Partial partition handling:

  • check_partial_partition: the message is sent by a node handling a nodedown message to all the running nodes except the sender and the node, which is "down". The message contains GUIDs of these two nodes

    A node, which receives this message, will check that the "down" node is actually down by checking it's status (in the node_monitor data) and by sending an RPC request to call rabbit:is_running/0 If the "down" node is running, the "checker" node responds to the "reporter" node with partial_partition message with the "checker" node and the "down" node The RPC request is sent in a one-off process.

    This feels dangerous intuitively and not that easy to reason about.

  • partial_partition: the message tells a node that there is a partial partition. It contains the "checker" node and the "not_really_down" node. On this message node monitor will force disconnect from the "checker" node and send it a partial_partition_disconnect message The node may also pause instead if it's in pause_minority or pause_if_all_down mode

  • partial_partition_disconnect: the message tells a node to disconnect from another node.

The assumption here is that a node should be promoted to a full partition, disconnecting from the "checker" node and leaving the "checker" and the "down" nodes in a partition together.

But because DOWN messages are symmetric and there is no additional coordination this process may leave entire cluster disconnected or keep disconnecting nodes for some time.

a note on disconnect:

When disconnecting, the nodes will disable reconnection for 1 second.

When some nodes are down, node monitor will ping entire cluster every 1 second.

Also it will send a cast keepalive message to all running nodes every 10 seconds.

[1] running_db_nodes: This value is maintained by internal mnesia monitors. A node is removed from this list when mnesia_monitor processes detects another mnesia_monitor to be "down". When rediscovering the node it will not be automatically re-added unless schema is merged. This can be called exolicitly: mnesia:change_config(extra_db_nodes, [Node]) or if the node restarts. You may need to set the same extra_db_nodes configuration, which is already there, to reconnect the cluster. When nodes are discovered, mnesia sends a message like this: {mnesia_system_event, {inconsistent_database, running_partitioned_network, Node}} to all processes subscribed to such events. This may happen every time mnesia checks schema consistency, both when the node is discovered to be up (e.g. a message is sent between nodes) or when connecting with mnesia:change_config(extra_db_nodes, ...).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment