c42f/StructuredConcurrency

## StructuredConcurrency
Chris Foster  4 days ago
Current notes are in my fork of the Juleps repo at
https://github.com/c42f/Juleps/blob/cjf/structured-concurrency/StructuredConcurrency.md
but most people seem to ignore the main Juleps repo so I haven't sent a PR yet (also it's very WIP).
We could move the notes to Jameson's julep wiki or somewhere else more conducive to collaboration? (edited)

Kiran Pamnany  3 days ago
This is nicely written -- good literature survey.
Kiran Pamnany  3 days ago
I'd be very interested in your (or anyone's) thoughts on Erlang's approach.

Chris Foster  3 days ago
I'm definitely interested in learning from everywhere, including the way that Erlang and Elixir supervisors work on OTP
Chris Foster  3 days ago
(Note that people like njsmith and elizarov seem quite aware of the Erlang prior art.)
Chris Foster  3 days ago
What would be best would be to learn from other language developers directly. So once we feel like we've got a decent survey I suggest we cross post it to the Structured Concurrency forum.

Jameson  3 days ago
That does look nice. Although I continue to reject the premise that Tasks need to be cancellable

Takafumi Arakaki  3 days ago
I think it is important to discuss cancellation and what njsmith call "black box rule" (which is IMHO more fundamental) separately.  I think that's why he discussed cancellation in a separate (also long) blog post.
Takafumi Arakaki  3 days ago
I posted a longer comment on the "black box rule"  here: https://github.com/c42f/Juleps/pull/1

Stefan Karpinski  2 days ago
Great writeup
Stefan Karpinski  2 days ago
@vtjnash what’s your issue with tasks being cancellable?
Stefan Karpinski  2 days ago
I’m curious what the counterargument to structured concurrency is
Stefan Karpinski  2 days ago
As in what are you giving up?
Stefan Karpinski  2 days ago
I feel like all the discussions focus on the benefits, which are compelling, but what are you giving up? It must make some kinds programs harder to write, getting into that side would be helpful
Stefan Karpinski  2 days ago
Some examples of patterns that can’t be written anymore and how you express them instead

Jameson  2 days ago
I just don’t think it has anything to do with structured concurrency

Stefan Karpinski  2 days ago
It’s not necessary, since the essential aspect is nesting of function execution, but two of the key benefits are that errors flow up the task tree and cancellation flows down
Stefan Karpinski  2 days ago
Given structured concurrency, why not support systematic cancellation? (edited)

Jameson  2 days ago
I should write a blog post counterpoint titled “cancellation considered harmful” that talks about how structured concurrency doesn’t need cancellation
Jameson  2 days ago
Or maybe not, because njs (Trio) already seems to have written it: https://vorpus.org/blog/timeouts-and-cancellation-for-humans/#an-escape-hatch

Stefan Karpinski  2 days ago
You keep saying that one doesn’t need cancellation to do structured concurrency, which is clearly true, but you haven’t made the case for why you shouldn’t have cancellation given structured concurrency, which provides a sensible behavior for it

Jameson  2 days ago
njs says it at the bottom of that post under the “asyncio” analysis
Jameson  2 days ago
Also, this is pretty good, if very long https://trio.discourse.group/t/graceful-shutdown/93 (after reading the whole thing, I think the top post covers everything, but the subsequent dialog is still a good discussion)
Trio forumTrio forum
Graceful Shutdown
Use case Imagine a web server. It’s handling many HTTP connections in parallel. The connections may have some kind of timeout: If there’s nothing coming from the cleint for a minute, the server shuts the connection down to prevent resource wastage and DoS attacks. When the server itself is being shut down it stops accepting new connections and gives existing connections 10 second to cleanly shut down. After 10 seconds it forcefully cancels any remaining connectons and exits. The problem Let’s ...
Reading time
17 mins :clock2:
Likes
2 :heart:
Feb 14th

Chris Foster  1 day ago
My current feeling is that cancellation at arbitrary points is deeply problematic in most systems which has tried to do it. But there are some exceptions, for example normal OS processes can generally be killed without much consequence.  And I've got a suspicion that Erlang supervisors might achieve something similar.What are the essential aspects of isolation which are required to make hard cancellation work? (edited)
Chris Foster  1 day ago
I do agree with @vtjnash that it's not obvious we must support timely cancellation for every task for us to reap many of the benefits of SC.  For example, we could support it only for IO operations. But by the same token this feels incomplete to me.

Jameson  1 day ago
kill is a horrible PITA and basically impossible to use safely. It’s a great hammer, but hard to use well. Mostly you just need to depend on the system (kernel) running garbage collection after you die (RIP)
Jameson  1 day ago
All IO objects inherently support cancellation already, we just don’t currently implement the ability to group them conveniently with @sync
Jameson  1 day ago
FWIW, I don’t know of any cancellation system (sans kill and Java terminate) that actually provides timely cancellation. They all seem to discuss how it should be done and is required, but is impossible.

Stefan Karpinski  1 day ago
Is the hangup here the guarantee?

Jameson  1 day ago
the other problem is that all of the cleanup you actually would want to do (closing I/O objects, deleting files, logging the error) would also get canceled, so no cleanup would actually occur
Jameson  1 day ago
Oh, and much of the time it won’t actually stop. It’s very easy to write something that just gets stuck unaware.
Jameson  1 day ago
kill -9 is basically fine, that’s rather like exit(137), and the kernel garbage collector will be left to clean up the pieces

Chris Foster  1 day ago
So what is it that makes kill -9 work when other things don't? It's the strong isolation between resources of one process and another allowing them to be gc'd.
But even kill -9 doesn't work sometimes. I mean, it can stop the process but that might take other parts of the system down when the process is unexpectedly missing.

Jameson  1 day ago
Right. I think the difference in practice is that sending kill -9 actually has the effect of closing all of the open resources. That’s a scenario the other processes are usually written to handle in some way.
Jameson  1 day ago
That is also stops running the code is somewhat irrelevant, because it’s already been cut off from all side-effect channels
Jameson  1 day ago
But cancel sort of does exactly the opposite of what you would ever want. It’s like you sent kill to the garbage collector, then resumed executing the code until it had finished cleaning up manually.

Stefan Karpinski  1 day ago
Can we maybe think constructively about how it could be made to work?

Chris Foster  2 hours ago
Yeah, that's the intention. I'm trying to understand Jameson's point of view :slightly_smiling_face:
What I think is interesting is that in the real world cancellation is definitely a thing: if your process starts behaving weirdly you kill -9 it.
If that locks up some parts of your OS, you restart the machine.
If that causes a distributed system to fail, you might need to restart a whole bunch of machines.Outside of computing, if a business unit is failing to make profit, it's restructured or terminated and its duties redistributed. If that causes the company to fails to service its debts, it goes into bankruptcy and gets restructured.It's all very messy, but there's some kind of interesting informal hierarchy of supervision going on, and I've got a feeling that this is where I need to start reading about Erlang supervisors.
Chris Foster  20 minutes ago

    the other problem is that all of the cleanup you actually would want to do (closing I/O objects, deleting files, logging the error) would also get canceled, so no cleanup would actually occur

@vtjnash I think you mean that all IO operations will return errors because cancellation is level-triggered? So you can't even call close, or send a goodbye message down a socket? Trio has a kind-of answer to this: cleanup scopes exist to explicitly protect against cancellation for the duration of the cleanup.I think this is a consistent answer because only people who are thinking carefully about cleanup will use this feature, so they are in the correct mindset when they enable it explicitly. I assume it can also be nested with timeouts, etc. (edited)

Jameson  13 hours ago
Right, IO is generally shorthand for any externally visible side effect
Jameson  13 hours ago
But now you’ve just directly contradicted everything you claimed earlier about cancellation. That’s not a consistent answer.
Jameson  13 hours ago
Just consider the example in Trio about how to use it correctly:

with trio.move_on_after(TIMEOUT):
    conn = make_connection()
    try:
        await conn.send_hello_msg()
    finally:
        with trio.move_on_after(CLEANUP_TIMEOUT) as cleanup_scope:
            cleanup_scope.shield = True
            await conn.send_goodbye_msg()

Jameson  13 hours ago
It’s a rambling mess complaining about how this is bad (“shooting yourself in the foot”) and incompatible with the trio design
Jameson  13 hours ago
But also that you need to do it.

Chris Foster  2 hours ago
Yeah, I think I've written a few bone headed things above but that's just part of the process of understanding the nature of the problem. (edited)
Chris Foster  2 hours ago
BTW, I don't mean to claim the trio way you've quoted above is bad or inconsistent; it makes perfect sense to me. Do you think the Trio cancellation semantics make sense for Julia? (edited)
Chris Foster  2 hours ago
I do realize (now) that preemptive hard cancellation of the kill -9 variety is a completely different approach to the problem of cancelling things. (edited)
Chris Foster  1 hour ago
There seems to be at least three different flavors of cancellation:
1. Cooperative with task-defined handlers (Trio style) where cancel points are explicit, and tasks may run cleanup code.
2. Hard preemptive cancellation (kill -9 style) where you need enough isolation so that resources can be GC'd. Incompatible with task-defined cleanup. Has several known in-process APIs which have been a historical disaster; Thread.stop etc.
3. Preemptive with task-defined cleanup handlers (InterruptException style) - a weird mixture where cancel points are "almost" everywhere, but task-defined cleanup can still run. Known to be a reliability disaster if just tossed into a language without thinking very hard about what "almost" means.

Jameson  20 minutes ago
Yes, the dialog here has been good. I’m just very opinionated about this subject, haha.
Jameson  19 minutes ago
1. Yes, but I think it’s faulty to say it’s explicit in Trio. It’s perhaps explicit in Go, but in Trio it’s merely limited. This design seems to be most nearly a clone of pthreads_cancel?
Jameson  13 minutes ago
2. Is know to excel in certain cases, and might be also called the Erlang model. As you point out, it’s also a known foot-gun if the language isn’t sufficiently restricted.
Jameson  9 minutes ago
3. This seems uncommon, although C does (allow for) defining it. It mentions that this is very limited in what functionality is permissible on such a child thread however
Jameson  6 minutes ago
I think you’re missing 4: Julia-style cancellation, aka OO cancellation. Any IO object can be cancelled by calling close, which initiates termination of outstanding requests.
	Chris Foster 4 days ago
	Current notes are in my fork of the Juleps repo at
	https://github.com/c42f/Juleps/blob/cjf/structured-concurrency/StructuredConcurrency.md
	but most people seem to ignore the main Juleps repo so I haven't sent a PR yet (also it's very WIP).
	We could move the notes to Jameson's julep wiki or somewhere else more conducive to collaboration? (edited)

	Kiran Pamnany 3 days ago
	This is nicely written -- good literature survey.
	Kiran Pamnany 3 days ago
	I'd be very interested in your (or anyone's) thoughts on Erlang's approach.

	Chris Foster 3 days ago
	I'm definitely interested in learning from everywhere, including the way that Erlang and Elixir supervisors work on OTP
	Chris Foster 3 days ago
	(Note that people like njsmith and elizarov seem quite aware of the Erlang prior art.)
	Chris Foster 3 days ago
	What would be best would be to learn from other language developers directly. So once we feel like we've got a decent survey I suggest we cross post it to the Structured Concurrency forum.

	Jameson 3 days ago
	That does look nice. Although I continue to reject the premise that Tasks need to be cancellable

	Takafumi Arakaki 3 days ago
	I think it is important to discuss cancellation and what njsmith call "black box rule" (which is IMHO more fundamental) separately. I think that's why he discussed cancellation in a separate (also long) blog post.
	Takafumi Arakaki 3 days ago
	I posted a longer comment on the "black box rule" here: https://github.com/c42f/Juleps/pull/1

	Stefan Karpinski 2 days ago
	Great writeup
	Stefan Karpinski 2 days ago
	@vtjnash what’s your issue with tasks being cancellable?
	Stefan Karpinski 2 days ago
	I’m curious what the counterargument to structured concurrency is
	Stefan Karpinski 2 days ago
	As in what are you giving up?
	Stefan Karpinski 2 days ago
	I feel like all the discussions focus on the benefits, which are compelling, but what are you giving up? It must make some kinds programs harder to write, getting into that side would be helpful
	Stefan Karpinski 2 days ago
	Some examples of patterns that can’t be written anymore and how you express them instead

	Jameson 2 days ago
	I just don’t think it has anything to do with structured concurrency

	Stefan Karpinski 2 days ago
	It’s not necessary, since the essential aspect is nesting of function execution, but two of the key benefits are that errors flow up the task tree and cancellation flows down
	Stefan Karpinski 2 days ago
	Given structured concurrency, why not support systematic cancellation? (edited)

	Jameson 2 days ago
	I should write a blog post counterpoint titled “cancellation considered harmful” that talks about how structured concurrency doesn’t need cancellation
	Jameson 2 days ago
	Or maybe not, because njs (Trio) already seems to have written it: https://vorpus.org/blog/timeouts-and-cancellation-for-humans/#an-escape-hatch

	Stefan Karpinski 2 days ago
	You keep saying that one doesn’t need cancellation to do structured concurrency, which is clearly true, but you haven’t made the case for why you shouldn’t have cancellation given structured concurrency, which provides a sensible behavior for it

	Jameson 2 days ago
	njs says it at the bottom of that post under the “asyncio” analysis
	Jameson 2 days ago
	Also, this is pretty good, if very long https://trio.discourse.group/t/graceful-shutdown/93 (after reading the whole thing, I think the top post covers everything, but the subsequent dialog is still a good discussion)
	Trio forumTrio forum
	Graceful Shutdown
	Use case Imagine a web server. It’s handling many HTTP connections in parallel. The connections may have some kind of timeout: If there’s nothing coming from the cleint for a minute, the server shuts the connection down to prevent resource wastage and DoS attacks. When the server itself is being shut down it stops accepting new connections and gives existing connections 10 second to cleanly shut down. After 10 seconds it forcefully cancels any remaining connectons and exits. The problem Let’s ...
	Reading time
	17 mins :clock2:
	Likes
	2 :heart:
	Feb 14th

	Chris Foster 1 day ago
	My current feeling is that cancellation at arbitrary points is deeply problematic in most systems which has tried to do it. But there are some exceptions, for example normal OS processes can generally be killed without much consequence. And I've got a suspicion that Erlang supervisors might achieve something similar.What are the essential aspects of isolation which are required to make hard cancellation work? (edited)
	Chris Foster 1 day ago
	I do agree with @vtjnash that it's not obvious we must support timely cancellation for every task for us to reap many of the benefits of SC. For example, we could support it only for IO operations. But by the same token this feels incomplete to me.

	Jameson 1 day ago
	kill is a horrible PITA and basically impossible to use safely. It’s a great hammer, but hard to use well. Mostly you just need to depend on the system (kernel) running garbage collection after you die (RIP)
	Jameson 1 day ago
	All IO objects inherently support cancellation already, we just don’t currently implement the ability to group them conveniently with @sync
	Jameson 1 day ago
	FWIW, I don’t know of any cancellation system (sans kill and Java terminate) that actually provides timely cancellation. They all seem to discuss how it should be done and is required, but is impossible.

	Stefan Karpinski 1 day ago
	Is the hangup here the guarantee?

	Jameson 1 day ago
	the other problem is that all of the cleanup you actually would want to do (closing I/O objects, deleting files, logging the error) would also get canceled, so no cleanup would actually occur
	Jameson 1 day ago
	Oh, and much of the time it won’t actually stop. It’s very easy to write something that just gets stuck unaware.
	Jameson 1 day ago
	kill -9 is basically fine, that’s rather like exit(137), and the kernel garbage collector will be left to clean up the pieces

	Chris Foster 1 day ago
	So what is it that makes kill -9 work when other things don't? It's the strong isolation between resources of one process and another allowing them to be gc'd.
	But even kill -9 doesn't work sometimes. I mean, it can stop the process but that might take other parts of the system down when the process is unexpectedly missing.

	Jameson 1 day ago
	Right. I think the difference in practice is that sending kill -9 actually has the effect of closing all of the open resources. That’s a scenario the other processes are usually written to handle in some way.
	Jameson 1 day ago
	That is also stops running the code is somewhat irrelevant, because it’s already been cut off from all side-effect channels
	Jameson 1 day ago
	But cancel sort of does exactly the opposite of what you would ever want. It’s like you sent kill to the garbage collector, then resumed executing the code until it had finished cleaning up manually.

	Stefan Karpinski 1 day ago
	Can we maybe think constructively about how it could be made to work?

	Chris Foster 2 hours ago
	Yeah, that's the intention. I'm trying to understand Jameson's point of view :slightly_smiling_face:
	What I think is interesting is that in the real world cancellation is definitely a thing: if your process starts behaving weirdly you kill -9 it.
	If that locks up some parts of your OS, you restart the machine.
	If that causes a distributed system to fail, you might need to restart a whole bunch of machines.Outside of computing, if a business unit is failing to make profit, it's restructured or terminated and its duties redistributed. If that causes the company to fails to service its debts, it goes into bankruptcy and gets restructured.It's all very messy, but there's some kind of interesting informal hierarchy of supervision going on, and I've got a feeling that this is where I need to start reading about Erlang supervisors.
	Chris Foster 20 minutes ago

	the other problem is that all of the cleanup you actually would want to do (closing I/O objects, deleting files, logging the error) would also get canceled, so no cleanup would actually occur

	@vtjnash I think you mean that all IO operations will return errors because cancellation is level-triggered? So you can't even call close, or send a goodbye message down a socket? Trio has a kind-of answer to this: cleanup scopes exist to explicitly protect against cancellation for the duration of the cleanup.I think this is a consistent answer because only people who are thinking carefully about cleanup will use this feature, so they are in the correct mindset when they enable it explicitly. I assume it can also be nested with timeouts, etc. (edited)

	Jameson 13 hours ago
	Right, IO is generally shorthand for any externally visible side effect
	Jameson 13 hours ago
	But now you’ve just directly contradicted everything you claimed earlier about cancellation. That’s not a consistent answer.
	Jameson 13 hours ago
	Just consider the example in Trio about how to use it correctly:

	with trio.move_on_after(TIMEOUT):
	conn = make_connection()
	try:
	await conn.send_hello_msg()
	finally:
	with trio.move_on_after(CLEANUP_TIMEOUT) as cleanup_scope:
	cleanup_scope.shield = True
	await conn.send_goodbye_msg()

	Jameson 13 hours ago
	It’s a rambling mess complaining about how this is bad (“shooting yourself in the foot”) and incompatible with the trio design
	Jameson 13 hours ago
	But also that you need to do it.

	Chris Foster 2 hours ago
	Yeah, I think I've written a few bone headed things above but that's just part of the process of understanding the nature of the problem. (edited)
	Chris Foster 2 hours ago
	BTW, I don't mean to claim the trio way you've quoted above is bad or inconsistent; it makes perfect sense to me. Do you think the Trio cancellation semantics make sense for Julia? (edited)
	Chris Foster 2 hours ago
	I do realize (now) that preemptive hard cancellation of the kill -9 variety is a completely different approach to the problem of cancelling things. (edited)
	Chris Foster 1 hour ago
	There seems to be at least three different flavors of cancellation:
	1. Cooperative with task-defined handlers (Trio style) where cancel points are explicit, and tasks may run cleanup code.
	2. Hard preemptive cancellation (kill -9 style) where you need enough isolation so that resources can be GC'd. Incompatible with task-defined cleanup. Has several known in-process APIs which have been a historical disaster; Thread.stop etc.
	3. Preemptive with task-defined cleanup handlers (InterruptException style) - a weird mixture where cancel points are "almost" everywhere, but task-defined cleanup can still run. Known to be a reliability disaster if just tossed into a language without thinking very hard about what "almost" means.

	Jameson 20 minutes ago
	Yes, the dialog here has been good. I’m just very opinionated about this subject, haha.
	Jameson 19 minutes ago
	1. Yes, but I think it’s faulty to say it’s explicit in Trio. It’s perhaps explicit in Go, but in Trio it’s merely limited. This design seems to be most nearly a clone of pthreads_cancel?
	Jameson 13 minutes ago
	2. Is know to excel in certain cases, and might be also called the Erlang model. As you point out, it’s also a known foot-gun if the language isn’t sufficiently restricted.
	Jameson 9 minutes ago
	3. This seems uncommon, although C does (allow for) defining it. It mentions that this is very limited in what functionality is permissible on such a child thread however
	Jameson 6 minutes ago
	I think you’re missing 4: Julia-style cancellation, aka OO cancellation. Any IO object can be cancelled by calling close, which initiates termination of outstanding requests.