Skip to content

Instantly share code, notes, and snippets.

@aaronc
Created September 26, 2019 13:50
Show Gist options
  • Save aaronc/4e4f0165b2dcbec6d18d352e88bf604e to your computer and use it in GitHub Desktop.
Save aaronc/4e4f0165b2dcbec6d18d352e88bf604e to your computer and use it in GitHub Desktop.
An alternate upgrade coordination mechanism

Motivation

Our current upgrade module uses the gov module to trigger a planned upgrade, this has a few potential downsides:

  • if a hot-fix release is needed, the network needs to wait for the full voting window
  • if validators need to postpone an upgrade after the governance vote due to some issues found in testing, they can't do that
  • no built-in way to abort an upgrade in case the upgrade handler fails

Proposal

A signalling method has been discussed in the past, but wasn't formally specified. Here is a proposed approach.

Signals

Validators, rather than governance, "signal" upgrades and the state machine responds based on signalling thresholds:

type MsgSignalUpgrade struct {
  // the self-delegate address of the validator
  Validator sdk.AccAddress 
  // set to 0 if the upgrade should happen as soon as the quorum is reached, or future height for a planned upgrade
  UpgradeHeight uint64 
  // the name of the upgrade - the new binary must have a handler with this name to apply migrations
  UpgradeName string 
  // set this value to something greater than 2/3 to indicate that this upgrade requires a higher quorum,
  // the rationale being that we usually actually want 80 or 90%+ of the network ready to do a smooth upgrade,
  // in the case of security hot-fixes we may need to leave the low threshold of 2/3
  QuorumNeeded sdk.Dec
}

Upgrades will only be triggered at the UpgradeHeight if a QuorumNeeded weight of validators have signalled that they will upgrade. Validators could remove their signal even a few blocks before the height to postpone or abort.

Aborting

Aborting would be handled by some --abort-upgrade <upgrade-name> command line flag to the daemon. If set, this would cause the current binary (without the upgrade handler) to not expect an upgrade handler and continue processing blocks. This would likely only be used in cases when the upgrade handler itself hit some error. In the error case, currently the only fix would be to release a new binary with either bug fixes or a no-op handler. The abort flag, instead, would basically alter the state machine behavior of the current binary to allow a smooth abort in error cases, such as what happened with cosmos-hub-3.

@ethanfrey
Copy link

I think the abort idea is good. Whether a command line flag is too dangerous (and we need a diff binary) is an open question, but definitely should have the choice of "perform scheduled upgrade A with binary XYZ" or "ignore upgrade A and continue with old binary". If this is not coordinated well, this can lead to a fork, so maybe a longer name, like --unsafe-abort-upgrade-i-am-sure <upgrade-name>.

This takes care of the second two points.

@ethanfrey
Copy link

As to hotfixes, I think this is a rather complex solution, and maybe it just involves off-chain coordination.

If v0.x.0 -> v0.x.1 is just a hotfix patch that closes a security hole, people can just deploy it asap, and it doesn't need to be coordinated. As happened with a cosmos hub vulnerability. The only issue is if this was exploited. For AppHash to work out, we need to run the vulnerable binary until H and the improved on after H.

This is trickier if it is still being actively exploited. In any case, I think this should be coordinated off-chain and not an official upgrade.

One idea is to do something like ethereum, and have a feature gate ('if height < hotfix-height { do old code } else { do new code}). And set hofix-heightinapp.toml` (subjective file).

The next official upgrade can remove this gate and just run new code, but this allows a subjective feature switch to be coordinated at a height. As a supplement to the on-chain coordination to be used for security fast response. I think we should avoid all on-chain governance for critical fixes (as the chain might be frozen for example).

@ethanfrey
Copy link

I like how irisnet uses signaling, but this requires people to actually run software than can handle eg. v1 and v2 simultaneously, so the signal in block header is a guarantee they already upgraded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment