benhenryhunter/PostMortem.md

## PostMortem.md

      
    Raw
  

              PostMortem.md
            
          
    Overview

The bloXroute blockchain distribution network (BDN) encountered errors propagating partial blocks via p2p to Lighthouse nodes which caused all HTTP submissions with full payloads to be treated as duplicates.  Due to this treatment many beacon nodes were in a state of waiting for blobs to be shared over p2p while blobs were available during HTTP submissions.  The Lighthouse and bloXroute teams worked very closely together to identify the full scope of this problem and have provided more information below along with next steps for both bloXroute and Lighthouse teams.
Background

The bloXroute Blockchain Distribution Network (BDN) is heavily integrated with Lighthouse due to the high performance of Lighthouse. Out of this heavy reliance, specific BDN behavior uncovered some behavior in Lighthouse which caused block propagation errors during p2p gossip of blocks without p2p provision of blobs. In order to optimize global propagation time the BDN would gossip blocks over p2p to beacon node peers without gossiping blobs over p2p. For Lighthouse (v5.1.2), p2p block gossiping will continue to other nodes after receiving the initial gossip but the beacon nodes will only accept blobs from p2p from this point on. Because the BDN did not provide blobs over p2p the beacon nodes would not have them available causing missed slots with error messages spanning from “timeouts” to “data not available”.
During this time the relay was still publishing blocks, including the blobs, over HTTP to around a dozen beacon nodes but was getting a 202 response due to these nodes having already seen the block on p2p. Lighthouse did not use the blobs from these HTTP requests because it considered the block a duplicate and was waiting for blobs from p2p. Thus the blobs were getting to a small number of Lighthouse nodes but were not being processed, nor propagated to the wider network. The Lighthouse API was designed around the assumption that the block proposer would always use the HTTP API to publish whole blocks with blobs, and not send fragments separately on gossip. This assumption was motivated by the presence of unbundling attacks, which require relays to validate each block with a beacon node prior to publishing. However, in bloxroute’s case, unbundling attacks for some proposals are ruled out through a trust relationship with the proposer, which makes publishing without validation sufficiently safe.
Timeline

Initially after the Dencun upgrade the BDN had a bug accepting block publishing from the relay and did not begin propagating blocks until March 18th.  After that date we received a few reports of blobs not being available with blame on the relay not publishing blocks with blobs included.  We began to investigate this behavior of the relay intensely, tracking full publish payload and sharing successful responses from beacon nodes within relay operator telegram groups.
During this time another point of confusion and misdirection came from a small percentage of the blocks that were impacted never even came through our relays but still made it to our BDN from another relay’s beacon node where the above behavior had caused it to propagate quickly enough the network to cause the same stalemate of blobs unavailable.
Impact

A peak of 13% missed slots occurred during this incident.
Resolution

After the first release of the BDN behavior we were alerted to some cases of missed slots due to blobs not being seen.  We thoroughly investigated the relay, providing logs, beacon node logs to prove the relay had in fact published blobs to beacon nodes and provided them back to the validator.  The real problem was uncovered when we began to get more feedback from CL client developers who suggested that blocks without blobs were arriving on p2p gossip prior to the block publication requests. This revealed that it had to do with the previously mentioned BDN behavior and it was confirmed by Lighthouse that this behavior would cause Lighthouse to have issues propagating blocks.
We disabled headers (stopped offering bids to proposers), then payloads (stopped propagating blocks that were offered by other relays) from the relay, discovering that solved the problem entirely.  We then started using Beacon nodes which  were not connected to the BDN and published only to those nodes seeing that the issue did not occur again.  We noticed slow propagation after removing the BDN from the flow and shut off our relay for the night (headers and payloads).
The following day the BDN team disabled the propagation of blocks with blobs and we added our original beacon nodes back into the flow along with some other necessary performance improvements for publishing outside of the BDN.  The issue was fully resolved and publishing time was massively improved after releasing a couple of changes with the beacon nodes and publishing lifecycle.
Moving Forward

The BDN team is defining new criteria for testing and feature validation prior to release which will involve: increasing utilization of tools like kurtosis, improved usage of testnets, and closer collaboration with client teams.
The Lighthouse team is also working to make Lighthouse compatible with the BDN’s prior behavior, in order to improve resilience. The assumption about whole block publishing proved to be too strong in the presence of relay optimizations. The updated Lighthouse API can continue to provide strong protection against unbundling attacks while being less strict with duplicate checks and more liberal about publishing parts of a block that have already been seen.