Our current upgrade module uses the gov module to trigger a planned upgrade, this has a few potential downsides:
- if a hot-fix release is needed, the network needs to wait for the full voting window
- if validators need to postpone an upgrade after the governance vote due to some issues found in testing, they can't do that
- no built-in way to abort an upgrade in case the upgrade handler fails
A signalling method has been discussed in the past, but wasn't formally specified. Here is a proposed approach.
Validators, rather than governance, "signal" upgrades and the state machine responds based on signalling thresholds:
type MsgSignalUpgrade struct {
// the self-delegate address of the validator
Validator sdk.AccAddress
// set to 0 if the upgrade should happen as soon as the quorum is reached, or future height for a planned upgrade
UpgradeHeight uint64
// the name of the upgrade - the new binary must have a handler with this name to apply migrations
UpgradeName string
// set this value to something greater than 2/3 to indicate that this upgrade requires a higher quorum,
// the rationale being that we usually actually want 80 or 90%+ of the network ready to do a smooth upgrade,
// in the case of security hot-fixes we may need to leave the low threshold of 2/3
QuorumNeeded sdk.Dec
}
Upgrades will only be triggered at the UpgradeHeight
if a QuorumNeeded
weight of validators have signalled
that they will upgrade. Validators could remove their signal even a few blocks before the height to postpone or abort.
Aborting would be handled by some --abort-upgrade <upgrade-name>
command line flag to the daemon. If set, this would
cause the current binary (without the upgrade handler) to not expect an upgrade handler and continue processing blocks.
This would likely only be used in cases when the upgrade handler itself hit some error. In the error case, currently
the only fix would be to release a new binary with either bug fixes or a no-op handler. The abort flag, instead, would
basically alter the state machine behavior of the current binary to allow a smooth abort in error cases, such as
what happened with cosmos-hub-3
.
I think the abort idea is good. Whether a command line flag is too dangerous (and we need a diff binary) is an open question, but definitely should have the choice of "perform scheduled upgrade A with binary XYZ" or "ignore upgrade A and continue with old binary". If this is not coordinated well, this can lead to a fork, so maybe a longer name, like
--unsafe-abort-upgrade-i-am-sure <upgrade-name>
.This takes care of the second two points.