mksh/Beacon-Node-VC-Health-Checking.md

## Beacon-Node-VC-Health-Checking.md

      
    Raw
  

              Beacon-Node-VC-Health-Checking.md
            
          
    This document explains how to build automated health checking proxy
to serve the beacon API to validator client processes.
An Ethereum validator client process needs to periodically perform actions on chain,
such as attesting blocks proposed by other validators, publishing sync
committee message digests, or, if lucky, proposing their own blocks.
Despite the validator always posess the mechanism to produce
message data necessary for on-chain action, it has no mechanism to
distribute it to other Ethereum nodes. An Ethereum beacon node is another process,
which validator client is connecting to over HTTP API, and that beacon node process
has all the means to bring data into the network over a suite of Ethereum P2P
protocols. Once the validator client has produced the necessary protocol data, it sends it to
the beacon node, and then given beacon node acts upon the data and makes sure its peers
have received and acknowledged an attestation, block or sync committee message. This means,
despite the fact that the validator client is an authoritative participant in the
blockchain network, within current architecture of Ethereum proof-of-stake, it cannot
act without having a connection to a beacon node, and this connection is a critical
path of operation for the validators that secure Ethereum network.
Most Ethereum validator clients nowadays have developed support for load balancing
over multiple beacon nodes to fulfill their network duties. This means if some
beacon node becomes non-functional, in most cases, the validator client will not
use connection to such node, and will resort to using other nodes that were
specified in its configuration. For example, if the beacon node's HTTP API returns
503 instead of expected return code, the validator client will not use it for producing
next attestation or block, but will fail-over to the other node. There are other
scenarios when validator client will ditch beacon node, but depending on Ethereum
client implementation, the checks that validator client performs on the beacon node
are different. Every client will check if the corresponding beacon node is not syncing
currently, and will not perform any action if the node is not synced to the latest head.
However, taking into account Ethereum dual nature and presence of both beacon
(consensus) and execution nodes, to make sure blocks production will work well,
it is important to verify that execution node is also not syncing, not only consensus.
In today landscape of Ethereum client implementations, not every client
does this. As of time of this writing, Lighthouse validator client does custom checking to ensure
execution node is synced, and Teku does not.
There is one more challenge that is less important for persons who stake few validators,
but is more important for the node operators who routinely connect several hundreds of validators
to a single beacon node. To produce and distribute attestations efficiently, the set of Ethereum
validators is randomly divided into 64 attestation subnets, and every attestation message is distributed in
the set of peer nodes that operate on the specific subnet and is not delivered to validators
that do not subscribe to that subnet. However, largeish node operators usually need to
subscribe to all attestation subnets to make sure that their diverse validator set can deliver
messages to every necessary subnet every time their attestation duty is up. Now, on Ethereum
network not every beacon node subscribes to every subnet, and to be present on every subnet necessary
the beacon node needs to have sufficient number of peers, or otherwise some attestation messages
could be lost and not delivered to the network, because of beacon node not being connected to the required subnet.
This is why, for bigger node operators, when connecting validator clients to beacon nodes,
it is also important to verify that beacon nodes are connected to sufficient number of peers before
making validator clients to include specific node into operation. Operators must ensure nodes that
have been started up recently and are not having sufficient number of peers, or nodes that are
facing networking incident and can not keep up a sufficient number of peers, are not used by the validators.
So, at least three important health checks need to be done on the
connection between beacon node and validator client, to ensure validator client can trust
the beacon node to distribute all the necessary protocol messages.
These checks are as follows:

Beacon node is synced
Execution node is synced
Beacon node has sufficient number of peers to accommodate attestations on all subnets

To fulfill the health checking capacity described, validator clients and the health
checking layer described can use following Ethereum APIs:

beacon node HTTP method /eth/v1/node/syncing to make sure consensus node is not syncing
JSON RPC method eth_syncing to make sure execution node is not syncing
beacon node HTTP method /eth/v1/node/peer_count to get peer count, and ensure it is sufficient

In ideal world, all Ethereum client implementations should make all three checks on the
beacon node, before starting to use it to produce attestation, block or sync commitee messages.
But, this is not happening nowadays, and the functionality behind health checking is different
between different client implementations, as it seen by discrepancy on execution node checking between
Teku and Lighthouse, as stated above, and continues to evolve.
To address this incompleteness of consistent health checking behavior between different
Ethereum client implementation, a smart Ethereum node operator can employ custom proxy layer
for health checking the HTTP API that beacon node exposes. A custom proxy must be deployed for every
beacon node, and Ethereum validator client will connect to a custom proxy, instead of directly to a beacon node.
Every proxy will perform consistent number of health checks on beacon node, and in case if any of health checks fails,
will mark the beacon node as being out of rotation, resulting in 503 response for
every possible validator client request over HTTP API. Because of this effect, every validator client
will not use such faulty node to perform any of the validation duties. This can work the same regardless
of the client implementation, and can include more potential health check operations, if
it will be necessary as Ethereum protocol evolves.
A layout of health checking proxies, beacon, execution nodes and validator client,
that allows for consistent health checking is pictured below in an ASCII diagram.
               +-------------------+                   
               | Validator Client  |                   
               +--+---------------++                   
                  |               |                    
                  |               |                    
             Beacon API        Beacon API              
                  |               |                    
                  |               |                    
          +-------v-------+     +-v-------------+      
    +-----+Proxy (Health) |     |Proxy (Health) |      
    |     +---------+-----+     +---+-----------++     
    |               |               |            |     
    |            Beacon           Beacon HTTP    |     
    |            HTTP API         API            |     
    |          +----<-----+     +---<-----+      |     
    JSON       |Beacon    |     |Beacon   |      JSON  
    RPC        |Node      |     |Node     |      RPC   
    Health     |1         |     |2        |      Health
    Check      +----------+     +---------+      Check 
    |            JSON-RPC         JSON-RPC       |     
    |          +-----<----+    +----<-----+      |     
    |          |Execution |    |Execution |      |     
    |          |Node      |    |Node      |      |     
    |          |1         |    |2         <------+     
    +---------->          |    |          |            
               +----------+    +----------+              

There are multiple ways to implement such health checking layer for
Ethereum validation, and smart node operators should prefer to choose
the software implementation that already being used by their organization, be it
Istio, Traefik, Haproxy or any other health checking capable proxy software. For me
personally, Haproxy comes the best of the offering, because it includes Lua
programming language support that allows to implement health checks in a way
that is both efficient to make the job done, and pleasant to spend the time on programming
the logic.
The problem of health checking of beacon nodes is an interesting one to
solve. However, in the ideal world there should be a solution for such
health checking that is built into the validator clients and is mandated by a
protocol. The only suggestion I could have for the protocol developers, is to make sure that,
if such mechanism becomes an Ethereum protocol requirement sometimes, it will be extensible, to make
it easier to implement new health checks for people working on the future extensions
of protocol that might need custom health checking.