Skip to content

Instantly share code, notes, and snippets.

@janx
Last active November 4, 2023 12:05
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save janx/cc4a3bcfff427df4809465c20319453c to your computer and use it in GitHub Desktop.
Save janx/cc4a3bcfff427df4809465c20319453c to your computer and use it in GitHub Desktop.
Postmortems CKB Interruption on Light Client Activation

Summary

The CKB mainnet chain experienced an abnormal 1-hour and 50-minute block interval starting from the block 11,333,728 on November 1, 2023, at 11:12:44 UTC. This interruption occurred due to an incompatibility between the new block hash calculation activated by the light client protocol and the mining clients with customized block construction logic. The block production of majority mining clients was resumed once they upgraded their code to support the new block hash calculation algorithm.

Timeline

  • 2023-11-01 11:12:44 UTC - Block 11,333,728 was mined.
  • 2023-11-01 11:21:44 UTC - The core team was alerted that there was no new block for more than 5 minutes.
  • 2023-11-01 11:28:00 UTC - An Accident Response Team (ART) was assembled.
  • 2023-11-01 11:37:00 UTC - The ART suspected the problem was related to the light client soft fork activation, started to contact mining pools for help and co-debugging.
  • 2023-11-01 12:10:00 UTC - The ART had confirmed the root cause (see the root cause analysis below) and started working with mining pools on fixes.
  • 2023-11-01 13:02:21 UTC - A patch was successfully deployed by 2miners. Few seconds later Block 11,333,729 was mined.
  • In the next 4 hours, as more and more mining pools successfully restored, the block interval was back to normal. The ART disassembled.

image

Block Interval, the Moving Average of the Last 25 Blocks

Root Cause Analysis

CKB mining clients request block templates from CKB nodes through RPC. They then search for the nonces required to solve the PoW puzzle before submitting the solved blocks back to the CKB nodes. This architecture allows miners to tweak block construction as they please. It also means professional miners, like mining pools, may develop and use their own block construction code to better integrate with their own infrastructure.

The CKB2021 hard fork includes a new feature, namely RFC0031, that modifies the algorithm for calculating a field in the block header. This new algorithm allows mining client ignore the new added field extension in block hash calculation when its empty. However, when the field extension is not empty, the mining client is required to include its value in block hash calculation, rather than ignore it. This way adding of a new field extension is made backward compatible with all existing mining clients (could be mining pools with customized infrastructure, or solo miners using the default mining function included), leave time for mining clients to upgrade to handle the new field properly.

The CKB2021 hard fork introduced the new field but it was always empty, until the activation of light client. The new light client protocol put the extension field into use, require adding a 32-byte light client value in the extension field on activation (See RFC0044). The new requirement on light client activation caused the incident, as most mining clients with customized block construction in the network were still running the old block hash calculation algorithm, which cannot build blocks with a non-empty extension field correctly.

The block 11,333,729 was the first to include a non-empty extension, which acted as the trigger for subsequent sharp drop in network hashrate. In proof-of-work networks, a significant decline of network hashrate leads to exteremly long block intervals.

Impact

There was no new block produced after light client activation for nearly 2 hours. This abnormal block interval was a serious interrruption of user experience, it could also cause security issues for users/dApps who must have their transactions included in time, or who rely on block header timestamp to determine timeout.

The mining clients with incompatible block construction algorithm wasted two hours mining invalid blocks.

Remediation

Since the problem was an incompatibility between the new protocol and mining clients with customized block construction, the ART decided the best solution was to work with mining clients experienced the problem to upgrade.

The ART worked with all contacted CKB mining pools to identify the problem, create customized patch, and deploy.

The ART can’t contact all mining clients because the network is anonymous. There could be some mining clients with customized block construction are still running an incompatible algorithm.

Preventative Measures

  • Prioritize mining clients compatibility when conceiving new protocol upgrades.
  • Setup a staging environment to test new protocol upgrades with mining clients.

Lessons Learned

  1. The CKB core team didn’t communicate the new changes and their consequences well to our community. Effective communication and coordination are vital within a decentralized community. The team should better communicate with known mining clients to ensure they are aware of the upcoming changes, have enough time to upgrade their code and have the environment to test the upgrade.
  2. The successful activation of the light client on CKB testnet was facilitated by the fact that there are no mining clients with customized block construction running there. The condition on CKB mainnet are much more complicated and diversified than testnet. A staging environment resembles CKB mainnet as much as possible is required.
  3. CKB mining pools are professional and swift responding. They are very supportive and helpful.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment