Our current upgrade module uses the gov module to trigger a planned upgrade, this has a few potential downsides:
- if a hot-fix release is needed, the network needs to wait for the full voting window
- if validators need to postpone an upgrade after the governance vote due to some issues found in testing, they can't do that
- no built-in way to abort an upgrade in case the upgrade handler fails
A signalling method has been discussed in the past, but wasn't formally specified. Here is a proposed approach.
Validators, rather than governance, "signal" upgrades and the state machine responds based on signalling thresholds:
type MsgSignalUpgrade struct {
// the self-delegate address of the validator
Validator sdk.AccAddress
// set to 0 if the upgrade should happen as soon as the quorum is reached, or future height for a planned upgrade
UpgradeHeight uint64
// the name of the upgrade - the new binary must have a handler with this name to apply migrations
UpgradeName string
// set this value to something greater than 2/3 to indicate that this upgrade requires a higher quorum,
// the rationale being that we usually actually want 80 or 90%+ of the network ready to do a smooth upgrade,
// in the case of security hot-fixes we may need to leave the low threshold of 2/3
QuorumNeeded sdk.Dec
}
Upgrades will only be triggered at the UpgradeHeight
if a QuorumNeeded
weight of validators have signalled
that they will upgrade. Validators could remove their signal even a few blocks before the height to postpone or abort.
Aborting would be handled by some --abort-upgrade <upgrade-name>
command line flag to the daemon. If set, this would
cause the current binary (without the upgrade handler) to not expect an upgrade handler and continue processing blocks.
This would likely only be used in cases when the upgrade handler itself hit some error. In the error case, currently
the only fix would be to release a new binary with either bug fixes or a no-op handler. The abort flag, instead, would
basically alter the state machine behavior of the current binary to allow a smooth abort in error cases, such as
what happened with cosmos-hub-3
.
As to hotfixes, I think this is a rather complex solution, and maybe it just involves off-chain coordination.
If v0.x.0 -> v0.x.1 is just a hotfix patch that closes a security hole, people can just deploy it asap, and it doesn't need to be coordinated. As happened with a cosmos hub vulnerability. The only issue is if this was exploited. For AppHash to work out, we need to run the vulnerable binary until H and the improved on after H.
This is trickier if it is still being actively exploited. In any case, I think this should be coordinated off-chain and not an official upgrade.
One idea is to do something like ethereum, and have a feature gate ('if height < hotfix-height { do old code } else { do new code}
). And set
hofix-heightin
app.toml` (subjective file).The next official upgrade can remove this gate and just run new code, but this allows a subjective feature switch to be coordinated at a height. As a supplement to the on-chain coordination to be used for security fast response. I think we should avoid all on-chain governance for critical fixes (as the chain might be frozen for example).