Kuutamod Near Validator

Kuutamod is a distributed supervisor for neard that implements failover for NEAR validators

Following instructions on kuutamod/

Running a localnet setup consists of

  • hivemind consists of
    • consul as the RAFT consensus layer
    • 3 seperate near localnet nodes to start the network
  • validator with metrics available at curl localhost:2233/metrics
screen -S validator ./target/debug/kuutamod --neard-home .data/near/localnet/kuutamod0/ \
--voter-node-key .data/near/localnet/kuutamod0/voter_node_key.json \
--validator-node-key .data/near/localnet/node3/node_key.json \
--validator-key .data/near/localnet/node3/validator_key.json \
--near-boot-nodes $(jq -r .public_key < .data/near/localnet/node0/node_key.json)@
  • failover with metrics available at curl localhost:2234/metrics
screen -S failover ./target/debug/kuutamod \
  --exporter-address \
  --validator-network-addr \
  --voter-network-addr \
  --neard-home .data/near/localnet/kuutamod1/ \
  --voter-node-key .data/near/localnet/kuutamod1/voter_node_key.json \
  --validator-node-key .data/near/localnet/node3/node_key.json \
  --validator-key .data/near/localnet/node3/validator_key.json \
  --near-boot-nodes $(jq -r .public_key < .data/near/localnet/node0/node_key.json)@

Initial check of the validator and failover metrics Validator: kuutamod_state{type="Validating"} 1 Failover: kuutamod_state{type="Voting"} 1

Pass control + c to send a graceful shutdown command to the main validator

Check of the validator and failover metrics Validator: kuutamod_state{type="Validating"} 1 Failover: kuutamod_state{type="Validating"} 1

The failover has taken over the validatting responsibilities of the initial validator

When problems with the initial validator are fixed it can be restarted and it will start in a voting role until the failover dies in which it will take over validation

Problems with non-graceful Termination

With everything running via screen passing screen -X -S <session_name> kill will forcefully kill the process.

Passing this into the validator will kill it and the failover will properly take over validation (although there is a considerable ~1-2min delay especially compared to the quick failover when killed gracefully). The problem arises when trying to restart the validator process.

Restarting the validator process with the command above results in the following errors eventually killing the process

note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
level=warn pid=174131 message="Neard finished unexpectly with signal: 6 (core dumped)" target="kuutamod::supervisor" node_id=node
level=info pid=174131 message="state changed: Voting -> Startup" target="kuutamod::supervisor" node_id=node
level=info pid=174131 message="state changed: Startup -> Syncing" target="kuutamod::supervisor" node_id=node
level=info pid=174131 message="state changed: Syncing -> Registering" target="kuutamod::supervisor" node_id=node
level=info pid=174131 message="state changed: Registering -> Voting" target="kuutamod::supervisor" node_id=node
2022-07-18T18:58:04.693039Z  INFO neard: version="1.27.0" build="nix:1.27.0" latest_protocol=54
2022-07-18T18:58:04.693659Z  INFO near: Opening store database at ".data/near/localnet/kuutamod0/data"
2022-07-18T18:58:04.767130Z  INFO db: Created a new RocksDB instance. num_instances=1
2022-07-18T18:58:04.768723Z  INFO db: Dropped a RocksDB instance. num_instances=0
thread 'main' panicked at 'Failed to open the database: DBError("IO error: While lock file: .data/near/localnet/kuutamod0/data/LOCK: Resource temporarily unavailable")', core/store/src/

The errors point to a IO error regarding a LOCK file in the node's data directory. Presumably when the neard service is gracefully shut down it removes this LOCK but when it is forcefully shut down it is not removed.

