Skip to content

Instantly share code, notes, and snippets.

@aaronc
Created September 26, 2019 13:50
Show Gist options
  • Save aaronc/4e4f0165b2dcbec6d18d352e88bf604e to your computer and use it in GitHub Desktop.
Save aaronc/4e4f0165b2dcbec6d18d352e88bf604e to your computer and use it in GitHub Desktop.
An alternate upgrade coordination mechanism

Motivation

Our current upgrade module uses the gov module to trigger a planned upgrade, this has a few potential downsides:

  • if a hot-fix release is needed, the network needs to wait for the full voting window
  • if validators need to postpone an upgrade after the governance vote due to some issues found in testing, they can't do that
  • no built-in way to abort an upgrade in case the upgrade handler fails

Proposal

A signalling method has been discussed in the past, but wasn't formally specified. Here is a proposed approach.

Signals

Validators, rather than governance, "signal" upgrades and the state machine responds based on signalling thresholds:

type MsgSignalUpgrade struct {
  // the self-delegate address of the validator
  Validator sdk.AccAddress 
  // set to 0 if the upgrade should happen as soon as the quorum is reached, or future height for a planned upgrade
  UpgradeHeight uint64 
  // the name of the upgrade - the new binary must have a handler with this name to apply migrations
  UpgradeName string 
  // set this value to something greater than 2/3 to indicate that this upgrade requires a higher quorum,
  // the rationale being that we usually actually want 80 or 90%+ of the network ready to do a smooth upgrade,
  // in the case of security hot-fixes we may need to leave the low threshold of 2/3
  QuorumNeeded sdk.Dec
}

Upgrades will only be triggered at the UpgradeHeight if a QuorumNeeded weight of validators have signalled that they will upgrade. Validators could remove their signal even a few blocks before the height to postpone or abort.

Aborting

Aborting would be handled by some --abort-upgrade <upgrade-name> command line flag to the daemon. If set, this would cause the current binary (without the upgrade handler) to not expect an upgrade handler and continue processing blocks. This would likely only be used in cases when the upgrade handler itself hit some error. In the error case, currently the only fix would be to release a new binary with either bug fixes or a no-op handler. The abort flag, instead, would basically alter the state machine behavior of the current binary to allow a smooth abort in error cases, such as what happened with cosmos-hub-3.

@ethanfrey
Copy link

I like how irisnet uses signaling, but this requires people to actually run software than can handle eg. v1 and v2 simultaneously, so the signal in block header is a guarantee they already upgraded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment