Skip to content

Instantly share code, notes, and snippets.

@philipcmonk
Created May 1, 2023 23:08
Show Gist options
  • Save philipcmonk/de5ba03b3ea733387fd13b758062cfce to your computer and use it in GitHub Desktop.
Save philipcmonk/de5ba03b3ea733387fd13b758062cfce to your computer and use it in GitHub Desktop.

"Stepwisdom" and "step nomadism" are two competing philosophies of how to sequence code upgrades, and they have a very long history in Urbit.

Stepwisdom is the idea that the system should guarantee that every update to a piece of code is run stepwise, in order. That is, you only ever upgrade from version n to n+1. This is very nice for developers -- instead of considering n different possible upgrade scenarios, you can consider only one. This sounds so nice that we've always planned to do it, though somehow we've never quite got there.

Step nomadism is the idea that you have no guarantee that any previous update was run -- you must be able to upgrade directly from any previous version. This is very nice for the upgrade system -- you can use sentences like "if that upgrade doesn't work, try this one" and "don't run old code". However, devs can't take as many shortcuts around upgrades.

While stepwisdom is the still the official story and has never been officially rejected, step nomadism has become ascendant -- in non-kelvin cases, the system no longer makes any specific attempts to maintain stepwise upgrades, and often they are not. Kelvins are the last bastion of support for stepwisdom, and even there we do not actually have hard guarantees of stepwisdom. And yet, even though practically we live in a step nomadic world, we still write apps and Arvo as though we make stepwise guarantees.

I'm going to argue that stepwisdom should be rejected completely, and that we should follow the ramifications of that all the way through the system.

Stepwisdom puts stringent requirements on the update system. For example, we must have a clear linear concept of update ordering for a particular app, with no disagreements between parties. This includes synchronizing the version included on boot with what's found on the network. These are doable, but we don't do them yet because they introduce significant complexity.

However, the worst requirement is that it requires either a certain amount of online-ness or the ability to run old code (and not just as a pure function -- as an agent with IO capabilities).

Consider the case of an app which wants to receive stepwise updates.

  • You suspend the app at 417K
  • The app receives update U at 417K, but you don't apply it because it's suspended.
  • You upgrade to 416K
  • The app receives update V, making it compatible with 416K
  • You try to unsuspend the app

In this case, you can't directly apply update U, because by the time you're even considering it, you're on 417K.

You have only a few options:

  • Require that you unsuspend and update any apps before a kelvin update (not practical), or else you'll have to nuke those apps and reinstall when desired, with no state migration. This violates the idea that you can suspend an app or your ship and turn it on again years later and everything still works.

  • Require that every required upgrade be rewritten for every kelvin, so that you can apply a version of update U forward-ported to 416K. This is very onerous for devs, because the amount of migration code goes up non-linearly.

  • Require that future %base kelvins always be able to run apps at past kelvins. I believe this is impractical with Arvo as currently designed -- there are far too many random bits of state and invariants to uphold.

  • Apply update V, forswearing stepwisdom and embracing step nomadism.

Step nomadism seems like the easy way out -- "do nothing". However, I believe the only practical way to make step nomadism safe is to reject the common architeture of keeping much of the app state implicit in the system. For example, it's not uncommon for an app to have critical information stored in a wire for an outstanding request instead of in its formal state. To update far into the future requires not just that the app receive its old formal state, but that every counterparty maintain the same semantics as before, returning responses in at least some way so that we can retrieve this implicit state.

This is totally impractical, and we should have abandoned it long ago. This suggests the following principle:

"Outstanding IO should always be ephemeral, and we should never rely on it returning. Any leases on the external world must be reflected in the formal state, and you must be able to rebuild those leases from scratch."

You rely on a Behn timer coming back? Write it in your formal state, so you can recreate it. You rely on a subscription staying alive? Write it in your formal state, so you can recreate it. You made some bespoke protocol where you sent some poke and you rely on receiving some other poke in response? Write it in your formal state, and ensure that your protocol has some way to recreate that situation.

Some of this can be automatically stored by Gall. For example, it knows what subscriptions you had open before the upgrade. However, it cannot automatically restore those subscriptions -- who knows if they're meaningful anymore? Actually, that's not a rhetorical question -- the app dev is the only one who knows what they meant before and what the modern equivalent is.

So what this looks like ultimately is that your +on-load gets not just a vase of your old state, but also a list of your old subscriptions, a list of timers you had set, a list of files you were subscribed to, etc. Crucially, none of this IO is still pending anymore. This is an archive, and it's your job to recreate those subscriptions, set those timers, etc.

(As an optimization, there may be two reload modes: one for small updates which does not cancel all your IO, and one for large updates (maybe triggered by kelvin changes and/or major version number changes to your app) which does.)

This is obviously sacrifices a certain way of working. It's a repudiation of the long-time Urbit ideal of "just send this IO, and you're guaranteed some kind of response, even if it's years later." It's my belief that this is in conflict with another Urbit ideal: if you shut down your app for years and then turn it back on, will it still work?

Whenever we run into these conflicts, we have to choose one to abandon, and I think the permanence property is more important than the developer convenience of not having to maintain their state explicitly.

The requirement that every app be able to respond to a "turn it off and on again" event (kind of like a %born) is a radical change, but it's very reminiscient of how we switched from claiming that a subscription would get you every fact in order (no connections!) to requiring that every app be able to handle arbitrary kicks and loss of data by implementing a "fetch the backlog" flow.

It also opens up whole frontiers of usability. For example, it implies that when you look at the state of an app, you're looking at the real state of the app -- if there's a problem, you'll see it. You literally could suspend an app, tweak its state in a noun editor (eg the dojo), unsuspend it, and you would expect it to work. Want to delete some state that you know you don't need to free up space? Just delete it! We can hardly imagine doing that now, because it would break a ton of invariants across your ship and the network.

The basic thrust of this proposal is to reject implicit state and make all state explicit. State is much simpler than code, and interpreting old state is much easier than interpreting old code. It's just a noun, so you can literally migrate it with a pure function.

I would like us to commit to step nomadism, so that we can stop pretending that it's okay to cut corners on upgrades.

@jackfoxy
Copy link

The basic thrust of this proposal is to reject implicit state and make all state explicit.

You hit it on the head. How much of this could be done by Gall?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment